Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Don't pass on small block ciphers (00f.net)
49 points by jstrieb 22 hours ago | hide | past | favorite | 58 comments
 help



What symmetric cryptography is there that would be reasonable on a small 8-bitter? This means

- As little code as possible;

- As little constant data as possible;

- Little to no shifts by amounts not divisible by 8, as there may not be a barrel shifter even for bytes;

- No shifts by variable amounts, including as a space-saving technique, for the same reason;

- No multiplies beyond 16×16 bits, and preferably none at all, as there may not be a multiplier.

Speck, mentioned in TFA, fits this very well. None of the things that came out of eSTREAM or the NIST lightweight cryptography competition even qualify, as far as I can tell, as the “lightweight” part is very keen on things that are easy in hardware but hard (slow, space-hungry, or both) in software. Gimli exists but is kind of chonky. So is Speck truly it? Is just noöne interested in the problem?


ChaCha20 satisfies your conditions.

The only disadvantage of ChaCha20 vs. Speck is a bigger state, you need 128 bytes for it (64 bytes of state + 64 bytes for the intermediate computations), but that is not likely to be a problem, except in the smallest microcontrollers.

The bigger state of ChaCha20 is determined by higher security requirements. The advantage of ChaCha20 is that it is supported by standard protocols, e.g. TLS 1.3 and SSH.

The standard protocols mentioned above include ChaCha20 precisely for the case of communication with smaller or older CPUs, which do not have hardware AES support.


Any reason not to use Ascon, which not only got Official Status™ from NIST:

* https://www.nist.gov/news-events/news/2023/02/nist-selects-l...

* https://csrc.nist.gov/pubs/sp/800/232/final

But was also a lightweight finalist in CAESAR (along with ACORN):

* https://en.wikipedia.org/wiki/CAESAR_Competition

* https://en.wikipedia.org/wiki/Ascon_(cipher)


Ascon is a stream-oriented AEAD, not a block cipher, and it requires a nonce. Because of this, it would not work for the usecases in TFA, not to mention it's also quite a bit slower than Speck.

I agree with the article, but I think it could go farther. Instead of having primitives for every 32/48/64/122 bit block, we need good format-preserving encryption. Then all of this advice boils down to "use as many bits as you need" and we can keep using the standard primitives with hardware support. If you need more security in the future, you only need to decrypt and reencrypt with the new size.

Small sizes have to be used with extra care, so I wouldn't want to make a generic function for all sizes. For bigger sizes we already have nice functions that take care of everything.

The article lays out exactly why you'd want small sizes, even with the risks. The good qualifier just means that it'd have to be no riskier than any other algorithm at the same length.

I agree? That doesn't affect what I said. You shouldn't make a one-size-fits-all function that scales that small. It should have to be a deliberate choice to switch from normal mode to small mode, and anyone that hasn't looked into it deeper shouldn't even know about the small mode.

I suppose I don't understand your point. On one hand, you can have different algorithms for each of 32, 64, etc with potentially different pitfalls and usage requirements. On the other, you can have one algorithm that implements all of them. I wasn't trying to comment on how that should be exposed in the library (because crypto lib design is a whole 'nother topic), but I'm not opposed to it being explicit.

Same as CRCs, really. You can easily write a function that performs CRCs of any size and expose different parameterizations as CRC-8/16/32/64 etc.


I'm responding to the idea of "use as many bits as you need" by saying it could be reasonable for small encryption but it should be kept separate from normal encryption and not made into a general statement.

Purely inside the realm of small lengths with deliberate tradeoffs I have no critique on your original statement, but I wanted to make clear that it should stay within that realm or it needs changes.


Are you suggesting a very large custom blocksize? I don't think this would be feasible beyond a few megabytes.

No, a FPE algorithm is a cryptographic construct that uses an existing block cipher (e.g. AES-256) to construct a cryptographically secure permutation of the input without length extension. That is, input size = output size, for all sizes. Ideally, if input size >= block size of the underlying cipher, the resulting permutation is no weaker than just using the cipher directly.

You could use FPE for multi-megabyte permutations, but I don't know why you would.


Not a cryptographer but I'm not liking the "advice" of encrypting the first 64bits of the UUID.

An user of an opensource application using this known "encryption" will be able to approximate the real UUID values based on creation time of objects they control and then would probably be able to approximate keys for 64bit encryption (although I guess one could design a cipher with a far larger key than block size, but it'd be a NIH design with all their pitfalls).

But looking at it sanely, UUIDv7 isn't perfect and no reason really not to "encrypt" the entire UUID with AES instead (often built into hardware anyhow) instead of just the first part.


All of these small block ciphers have regularly large keys.

Nowadays even many small microcontrollers get AES acceleration so I don't see much reason

The ch32v003 implements RISC-V without the M extension, meaning there's not even a MUL/DIV instruction.

Out of all micro controllers I've worked with, only a single one had AES cpu instructions.


Basically all of the use cases in the article don't make sense with AES. That's not because it's AES. That's because its blocks are significantly larger than the data you want to protect. That's the point the article was making: in very specific circumstances, there is practical value in having the cipher output be small.

In that case just use CTR mode, no?

https://www.cs.ucdavis.edu/~rogaway/papers/thorp.pdf

(Not that this is the only solution but that it motivates the problem of why you can't just naively apply AES to the problem).


Some people just itch to use something custom and then to have to think about it. Which can bring amazing results, sure, but it can also bring spectacular disasters as well, especially when we're talking about crypto.

The article is less about crypto and more about improving UUID (and IDs in general) with small block ciphers. It's a low impact mechanism to avoid leaking data that UUID by design does leak. It also doesn't require a good source of entropy.

In the context of encrypting 32 or 64 bit IDs, where there is no nonce, that'd be equivalent to XOR encryption and much weaker than TFA's small block ciphers.

If you really want to encrypt and decrypt 32-bit numbers without having any nonces available, the fastest way on non-microcontroller CPUs remains using the AES instructions.

You can exploit the fact that the core of AES consists of 32-bit invertible mixing functions. In order to extend AES to 128-bit, a byte permutation is used, which mixes the bytes of the 32-bit words.

The AES instructions are such, that you can cancel the byte permutation. In this case, you can use the AES instructions to encrypt separately four 32-bit words, instead of one 128-bit block.

Similarly by canceling the standard byte permutation and replacing it with separate permutations on the 2 halves, you can make the AES instructions independently encrypt two 64-bit words.

These AES modifications remain faster than any software cipher.

How to cancel the internal permutation and replace it with external shuffle instructions was already described in the Intel white paper published in 2010, at the launch of Westmere, the first CPU with AES instructions.


Are you certain using AES is still faster? Let's say for a 32-bit block size and 64-bit key.

From https://en.wikipedia.org/wiki/Speck_(cipher), that Speck combination would use 22 rounds, and using the instruction timings for Zen 5 from https://instlatx64.github.io/InstLatx64/AuthenticAMD/Authent..., it looks like each round would take at most 3 cycles. (Dependency chain for each round is 3 instructions long, ror+add+xor). 22*3 = ~66 cycles.

Using AES with a pshufb to take out the ShiftRows step would be 2 cycles for the pshufb and 4 cycles for each aesenc, and at 10 rounds, you have ~60 cycles.

It's quite close, and to say which one wins, we'd need to actually benchmark it. One is not clearly much faster than the other.


Standard AES-128 has a throughput of around 16 bytes per 8 clock cycles or even less in recent CPUs, because they can do 2 or 4 AES instructions per clock cycle (in the modes of operation that are not limited by latency).

AES-128 can be easily modified to independently encrypting four 32-bit words per execution, instead of one 128-bit block, by cancelling the byte permutation that extends the AES mixing function from 32-bit to 128-bit. this would increase the throughput at least twice, depending on whether PSHUFB is done concurrently or not.

You have given the latencies of the instructions, not their throughput. When you use AES in such a way that you are limited by latency, that is normally wrong. The cryptographic libraries have multi-buffer functions, which compute e.g. 8 AES values, so that they are not limited by latencies.

Regarding the parent article, if you want an unpredictable identifier for a record, you should not do this by encrypting some value with the intent of decrypting it in the future. Instead of this, you should use as identifier an unpredictable random number. Such identifiers can be generated with AES in batches, at maximum throughput, and stored until they are needed for assignment to a record.

If you need in your record some information like time of creation or a monotonically increasing number, which you consider private, such information should be put in distinct fields, that you do not give externally, instead of attempting to encrypt them in a record identifier, which would need to be decrypted to access such information.


maybe the reason they are so close is that the AES microcode is inplementing exactly those operations

There's nothing similar about AES and Speck, and the "microcode" for AES isn't like what you're thinking of. If you want to learn more about it, you can look up the specifications for AES and Intel's AES instruction set.

Would it, though? Either way you're operating in ECB mode with 2^32 or 2^64 values. Why is one more secure than the other?

EDIT: What I mean is you can do cypher = truncate(plain ^ AES(zero_extend(plain))).


>EDIT: What I mean is you can do cypher = truncate(plain ^ AES(zero_extend(plain))).

How would you decrypt that though? You truncated 3/4ths of the AES output needed to decrypt it.

I thought you were suggesting this:

  ciphertext = truncate(AES(key) ^ plaintext)
And in this case, since AES(key) does not depend on the plaintext, it would just be XOR by a constant.

The first examples in the parent article do not require decryption. They only require unpredictable random numbers that are unique.

If uniqueness is needed for a 32-bit number or a 64-bit number, then in AES-128 the byte permutation can be modified, to reduce the block size accordingly.

For the other examples with record identifiers, I am not sure whether the author meant for them to ever be decrypted. If decryption was intended, I disagree that this is the right solution. If an opaque record identifier is desired, it should be an unpredictable random number, which can be generated at maximum speed with AES. There is no need to ever decrypt such an identifier.

If other private information is needed, like a record counter, it should be put in separate fields, that are not provided to external entities, instead of encrypting it inside the identifier. Encrypting such private information would prevent its use in indexing anyway.


You're right, my bad. I guess if you have strict size requirements it does make sense to use small block sizes.

The block size of a block cipher function like AES is important for its security level, but it is completely independent of the size of the data that you may want to encrypt.

Moreover, cryptography has many applications, but the most important 3 of them are data encryption, data integrity verification and random number generation.

The optimal use of a cryptographic component, like a block cipher, depends on the intended application.

If you want e.g. 32-bit random numbers, the fastest method on either Intel/AMD x84-64 CPUs or Arm Aarch64 CPUs is to use the 128-bit AES to encrypt a counter value and then truncate its output to 32 bits. The counter that is the input to AES may be initialized with an arbitrary value, e.g. 0 or the current time, and then you may increment only a 32-bit part of it, if you desire so. Similarly for other sizes of random numbers that are less than 128 bit, you just truncate the output to the desired size. You can also produce random numbers that need to have 1 of a certain number of values that is different from a power of two, by combining either multiplication or division of the output value with rejection done either before or after the operation (for removing the bias).

Similarly, for message authentication, if you have some method that produces an 128-bit MAC, it can be truncated to whatever value you believe to be a good compromise between forgery resistance and message length.

For encryption, short data must be encrypted using either the CTR mode of operation or the OCB mode of operation (where only the last incomplete data block is encrypted using the CTR mode). With these modes of operation, the encrypted data can have any length, even a length that is not an integer number of bytes, without any length expansion of the encrypted message.

The advice given in the parent article is not bad, but it makes sense only in 32-bit microcontrollers, because since 2010 for x86-64 and since 2012 for Aarch64 any decent CPU has AES instructions that are much faster than the implementation in software of any other kind of block cipher.

Moreover, for random number generation or for data integrity verification or for authentication, there are alternative methods that do not use a block cipher but they use a wider invertible function, and which may be more efficient, especially in microcontrollers. For instance, for generating 128-bit unpredictable random numbers, you can use a counter with either an 128-bit block cipher function together with a secret key, or with a 256-bit invertible mixing function, where its 128-bit output value is obtained either by truncation or by summing the 2 halves. In the first case the unpredictability is caused by the secret key, while in the second case the unpredictability is caused by the secret state of the counter, which cannot be recovered by observing the reduced-size output.

For applications where a high level of security is not necessary, e.g. for generating 32-bit random numbers, the already high speed of AES-128 (less than 0.5 clock cycles per output byte on recent CPUs) can be increased by reducing the number of AES rounds, e.g. from 10 to 4, with a proportional increase in throughput.


If you want to encrypt a serial number, you don't want the output to be 256 bits.

The size of encrypted data is completely independent of the block size of a block cipher function that is used for data encryption.

Nowadays, there is almost never any reason to use for encryption any other modes of operation except CTR or OCB, which do not expand the size of encrypted data.

That said, the parent article was less about encryption and more about random number generation, which is done by encrypting a counter value, but you never need to decrypt it again.

In RNGs, the block size again does not matter, as the output can be truncated to any desired size.


AES is most often used in a streaming mode, where it's used to generate a keystream. AES alone is useless, it MUST have a mode of operation to provide any security. A streaming mode can then encrypt any number of bits greater than 0. AES-CTR is one of the more common streaming modes.

A lot of the lightweight cipher justification in this post seems like it overlaps a lot with Format Preserving Cryptography, which uses (generally) more conventional symmetric primitives (16-byte-block ciphers, for instance) to handle encryption with small domains:

https://eprint.iacr.org/2009/251.pdf


>Small block ciphers are thus generally a bad idea against active adversaries.

>However, they can be very useful against passive adversaries whose capability is limited to observing identifiers, who are then unable to map them to the original value.

Really? Isn’t the Sweet32[0] attack mostly passive? “We show that a network attacker who can monitor a long-lived Triple-DES HTTPS connection between a web browser and a website can recover secure HTTP cookies by capturing around 785 GB of traffic.”

[0] https://sweet32.info/


...a long-lived HTTPS connection that manages to transfer >700 GiB of traffic, with no disconnects, and presumably has re-keying disabled? An interesting theoretical setup, I guess.

Small block ciphers are great for some use-cases!

32-bit block ciphers are a good way to create short opaque IDs because they provide a bijection between two sets of integers. And even if your ID is slightly shorter than 32-bit you can easily shave off a few bits with cycle walking: https://en.wikipedia.org/wiki/Format-preserving_encryption#F... E.g. if you want to make sure your IDs can be mapped into 31/63 bits.

I especially like the RC-5 cipher for these kinds of uses. It can be implemented in just a few lines of code and there are standard test vectors for it.


The RC-5 cipher was very nice for its day, but I am certain that it is much slower than AES on any modern CPU, with the exception of microcontrollers, where nonetheless other solutions, e.g. ChaCha20, may be faster.

AES also needs only a handful of lines of code for its implementation (using assembly). For such an application, you can even reduce the number of rounds of AES-128, e.g. from 10 to 4.

When you want truly uniform random numbers, then encrypting with AES-128, then truncating, is best. If you want invertible encryption, then you should encrypt a counter and either use a 32-bit addition or a 32-bit XOR for encrypting the 32-bit number. With a single AES-128 invocation for generating a random mask, you can encrypt four 32-bit numbers.

Of course, when speed does not matter, you can use pretty much any of the historical block ciphers, because the security requirements for encrypting 32-bit numbers are very low, since they are easier to find by brute force searching than by attempting to break any kind of encryption.


The problem is that AES needs a 128-bit block.

Imagine that you want to obfuscate your order numbers in the database so that customers can't infer the volume of business by checking the order number.

You can use UUIDs, but you also want to keep the numbers short so they can be dictated over the phone. You can use random IDs, but then you need to lock them in the database during the object creation otherwise you might get collisions.

Or perhaps you have a distributed system and want to allocate a bunch of IDs for each node in advance.

RC-5 with its small block size neatly solves this. You can have a strictly increasing database sequence, and then just give out each node a bunch of numbers ("from 100 to 120"), and nodes can then generate IDs based on them. And there can be no collisions with other nodes because symmetric ciphers are obviously bijective.

For these kinds of use-cases performance is not an issue.


Standard AES needs an 128-bit block.

However, the AES mixing function is made from a 32-bit mixing function that is extended to blocks whose lengths are multiples of 32-bit by composing it with a byte permutation.

The standard byte permutation extends the block size from 32-bit to 128-bit, but with the current AES instructions of CPUs you can either cancel the byte permutation to get a 32-bit block size or you can replace it with a non-standard byte permutation, to get any block size that is a multiple of 32-bit.

If you cancel the byte permutation, four 32-bit words are encrypted independently, instead of one 128-bit block. This doubles the number of instructions, but this does not necessarily reduce the throughput, as the instructions may be executed concurrently, so the throughput measured in encrypted numbers will increase by a number between 2 and 4, depending on the CPU.


Right, but balanced and unbalanced Feistel networks let you turn the 16-byte AES block into an arbitrarily small PRF.

Well, yes. But at this point you're just making a new cipher with AES as the round function. And I think it should be at least as safe as the round function?

I have not checked lately, but is it actually the recommendation for format-preserving encryption?


I don't know about "the" recommendation, but it's how NIST standardized it (FF1 and FF3, both Feistel networks) and in the NIST rubric these aren't "new ciphers"; they're "block cipher modes".

I'll say I'm more comfortable using a straightforward FPE block cipher mode with AES than I am repurposing a weaker lightweight cipher to take advantage of its 32-bit block size.


Thanks! It indeed makes sense.

I used the RC-5 cipher around 2015 to do that ID generation trick (and it's still in place in AWS as I can see), and there was no NIST standard back then. It also was not a really sensitive application, we just wanted to make IDs opaque.


Funny your example is rc5, I wrote exactly what you describe to generate 32-bit cookies in a random prototype a few years ago: https://github.com/jcalvinowens/sdvr/blob/main/rc5.c

It is cute, but surely there's a more efficient way than RC5? There are bijective hash functions which are much cheaper (murmur, at least).


In my case, performance was utterly unimportant.

But is Murmur actually bijective?


Mine too, I was just curious.

I recall empirically determining murmur was bijective across all 32-bit inputs, but I can't find that written down anywhere.


Slightly unrelated, but aren't these AES-specific custom CPU instructions just a way to easily collect the encryption keys? There is a speedup but is it worth the risks?

If I were a nation state actor, I'd just store the encryption keys supplied to the AES CPU instruction somewhere and in case the data needs to be accessed you just read the stored keys.

No need to waste time deploying a backdoored CPU firmware and wait for days or weeks, and then touch the hardware a second time to extract the information.

When all AES encryption keys are already stored somewhere on the CPU, you can easily do a drive-by readout at any point in time.

Linux kernel has a compile time flag to disable use of custom CPU instructions for encryption, but it can't be disabled at runtime. If "software encryption" is used, the nation state actor needs to physically access the device at least two times or use a network-based exploit which could be logged.


There are serious risks about backdoors in CPUs, but they are not about the CPU gathering the AES keys.

The storage required for this would be humongous and the CPU cannot know for which data the keys have been used. Moreover this would too easily be defeated, because even if the AES instructions allow to specify a derived round key in them, you can always decline to do this and use a separate XOR instruction for combining the round keys with the intermediate states. Detecting such a use would be too difficult.

No, there is no base for fearing that the AES keys can be stored in CPUs (on the other hand you should fear that if you store keys in a TPM, they might never be erased, even if you demand this). The greatest possible danger of adversarial behavior of a CPU exists in the laptop CPUs with integrated WiFi interfaces made by Intel. Unless you disconnect the WiFi antennas, it is impossible to be certain that the remote management feature of the WiFi interface is really disabled, preventing an attacker to take control of the laptop in a manner that cannot be detected by the operating system. The next danger by importance is in the computers that have Ethernet interfaces with the ability to do remote management, where again it is impossible to be certain that this feature is disabled. (A workaround for the case when you connect such a computer to an untrusted network, e.g. directly to the Internet, is to use a USB Ethernet interface.)


I am not a chip designer but from my limited understanding, this "somewhere" is the problem. You can have secret memory somewhere that isn't noticed by analysts, but can it remain secret if it is as big as half the cpu? A quarter? How much storage can you fit in that die space? How many AES keys do you handle per day? Per hour of browsing HN with AES TLS ciphers? (Literally all supported ciphers by HN involve AES)

We use memory-hard algorithms for password storage because memory is more expensive than compute. More specifically, it's die area that is costly, but at least the authors of Argon2 seem to equate the two. (If that's not correct, I based a stackoverflow post or two on that paper so please let me know.) It sounds to me like it's easily visible to a microscope when there's another storage area as large as the L1 cache (which can hold a few thousand keys at most... how to decide which ones to keep)

Of course, the cpu is theoretically omnipotent within your hardware. It can read the RAM and see "ah, you're running pgp.exe, let me store this key", but then you could say the same for any key that your cpu handles (also rsa or anything not using special cpu instructions)


Good points, but might be mitigated by knowing that the first key after boot is for HDD encryption and if storage is limited then keep counter for each key, and always overwrite least frequently observed key.

Could work. How do you know what the least-frequently used key is if you can't store them, though? Would need some heuristics. Maybe it could write the first five keys it sees after power on on every power on, or some other useful heuristic.

Like, I do take your point but it does seem quite involved for the chance that it'll get them something useful, and they still need to gain physical access to the intact device, and trust that it never gets out or the chipmaker's reputation is instantly trash and potentially bankrupt. And we know from Snowden documents that, at least in ~2013 (when aes extensions weren't new, afaik), they couldn't decrypt certain ciphers which is sorta conspicuous if we have these suspicions. It's a legit concern or thing to consider, but perhaps not for the average use-case

edit: nvm it was proposed in 2008, so that it didn't show up yet in ~2013 publications is not too surprising. Might still be a general point about that 'they' haven't (or hadn't) infiltrated most cpus in general


I think Linux/LUKS software encryption was a very big challenge and they solved it with multiple approaches

- 2004: Linux LUKS disk encryption [0]

- 2008: ring −3 / intel management engine [1]

- 2010: AES instruction set [2]

- 2009: TPM [3]

[0] https://en.wikipedia.org/wiki/Linux_Unified_Key_Setup

[1] https://en.wikipedia.org/wiki/Intel_Management_Engine

[2] https://en.wikipedia.org/wiki/AES_instruction_set

[3] https://en.wikipedia.org/wiki/Trusted_Platform_Module


I don't imagine it would be too difficult to snoop the instruction stream to identify a software implementation of AES and yoink the keys from it, at least if the implementation isn't obfuscated. If your threat model includes an adversarial CPU then you probably need to at least obfuscate your implementation, if not entirely offload the crypto to somewhere you trust.

Yes but it's much easier to tell devs "put your keys here" and then just take that.

We’re talking about a hidden CPU backdoor that would let you secretly come in and retrieve keys you’ve squirreled away somewhere. I don’t think finding the keys is the hard part.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: