These Keccak derived hashes are my personal favorite hashes. The Keccak sponge function is a more advanced mathematical foundation of hashing than what we had before. The individual mixing functions were also carefully chosen to do very different things. My favorite attribute, however, is that all the cryptanalysis done on SHA3 maps directly to Keccak (unlike Blake's follow-ons).
That is less than 4 of the 32 software-visible vector registers of an AMD Zen 4 or Zen 5 CPU, or of the future Intel CPUs that will reintroduce AVX-512.
There is no difficulty in defining AVX-512 instructions that would operate on a hash state of this size.
The real amount of 64-bit registers in a modern CPU is well above one thousand and the implementation of the SHA-3 functions is very efficient in hardware, so adding instructions for these hashes would have a very modest cost.
The forms of Keccak that were initially standardized in SHA-3 were secure but significantly slower than possible.
This had the consequence that for many applications where speed is important the existing very efficient implementations of BLAKE2b-512 or of the faster but less secure BLAKE3 have been preferred, or SHA-256 or SHA-512, on the CPUs where these are implemented in hardware.
However, it is also possible to use Keccak in modes of operation where it is as fast or faster than any other comparable hash (e.g. by using parallelizable tree hashing, like the BLAKE derivatives). Previously these modes were less known, because they were not standardized and because the existing reference implementations were less polished than those of the BLAKE derivatives.
After being included in standards like this RFC, it can be hoped that these good secure hashes will become more widely available.
Recent ARM-based CPUs have instructions for the core functions of Keccak, while on AMD/Intel CPUs with SHA-512 Keccak is rather fast even without dedicated instructions. Therefore on such CPUs KangarooTwelve and TurboSHAKE can be very fast right now, when using an appropriate implementation.
For instance I use BLAKE2b-512 for file integrity checking, frequently (i.e. at least a few times per day) running it over hundreds of GB or over many TB of data. Now, when I have an AVX-512 capable CPU, i.e. a Zen 5, I should experiment with implementing an optimized KangarooTwelve, because it should be much faster on such a CPU.
If you want to try an optimized AVX-512 implementation of KangarooTwelve on the command line, you can `cargo install k12sum`. On my machine it's neck-and-neck with `b3sum --no-mmap` (which does not use threads).
Anywhere you need a high assurance and high speed hash function. And because of the sponge design, it can be the heart of lots of cryptographic protocols.
There is no difficulty in defining AVX-512 instructions that would operate on a hash state of this size.
The real amount of 64-bit registers in a modern CPU is well above one thousand and the implementation of the SHA-3 functions is very efficient in hardware, so adding instructions for these hashes would have a very modest cost.
This had the consequence that for many applications where speed is important the existing very efficient implementations of BLAKE2b-512 or of the faster but less secure BLAKE3 have been preferred, or SHA-256 or SHA-512, on the CPUs where these are implemented in hardware.
However, it is also possible to use Keccak in modes of operation where it is as fast or faster than any other comparable hash (e.g. by using parallelizable tree hashing, like the BLAKE derivatives). Previously these modes were less known, because they were not standardized and because the existing reference implementations were less polished than those of the BLAKE derivatives.
After being included in standards like this RFC, it can be hoped that these good secure hashes will become more widely available.
Recent ARM-based CPUs have instructions for the core functions of Keccak, while on AMD/Intel CPUs with SHA-512 Keccak is rather fast even without dedicated instructions. Therefore on such CPUs KangarooTwelve and TurboSHAKE can be very fast right now, when using an appropriate implementation.
For instance I use BLAKE2b-512 for file integrity checking, frequently (i.e. at least a few times per day) running it over hundreds of GB or over many TB of data. Now, when I have an AVX-512 capable CPU, i.e. a Zen 5, I should experiment with implementing an optimized KangarooTwelve, because it should be much faster on such a CPU.
Edit: Oh it looks like another option, `KeccakSum`, was released a couple months ago? https://github.com/XKCP/K12/commit/5271b58c990c1ac33c1097b4e...