Pull crypto updates from Herbert Xu:
"API:
- Rewrite memcpy_sglist from scratch
- Add on-stack AEAD request allocation
- Fix partial block processing in ahash
Algorithms:
- Remove ansi_cprng
- Remove tcrypt tests for poly1305
- Fix EINPROGRESS processing in authenc
- Fix double-free in zstd
Drivers:
- Use drbg ctr helper when reseeding xilinx-trng
- Add support for PCI device 0x115A to ccp
- Add support of paes in caam
- Add support for aes-xts in dthev2
Others:
- Use likely in rhashtable lookup
- Fix lockdep false-positive in padata by removing a helper"
* tag 'v6.19-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (71 commits)
crypto: zstd - fix double-free in per-CPU stream cleanup
crypto: ahash - Zero positive err value in ahash_update_finish
crypto: ahash - Fix crypto_ahash_import with partial block data
crypto: lib/mpi - use min() instead of min_t()
crypto: ccp - use min() instead of min_t()
hwrng: core - use min3() instead of nested min_t()
crypto: aesni - ctr_crypt() use min() instead of min_t()
crypto: drbg - Delete unused ctx from struct sdesc
crypto: testmgr - Add missing DES weak and semi-weak key tests
Revert "crypto: scatterwalk - Move skcipher walk and use it for memcpy_sglist"
crypto: scatterwalk - Fix memcpy_sglist() to always succeed
crypto: iaa - Request to add Kanchana P Sridhar to Maintainers.
crypto: tcrypt - Remove unused poly1305 support
crypto: ansi_cprng - Remove unused ansi_cprng algorithm
crypto: asymmetric_keys - fix uninitialized pointers with free attribute
KEYS: Avoid -Wflex-array-member-not-at-end warning
crypto: ccree - Correctly handle return of sg_nents_for_len
crypto: starfive - Correctly handle return of sg_nents_for_len
crypto: iaa - Fix incorrect return value in save_iaa_wq()
crypto: zstd - Remove unnecessary size_t cast
...
Pull arm64 FPSIMD on-stack buffer updates from Eric Biggers:
"This is a core arm64 change. However, I was asked to take this because
most uses of kernel-mode FPSIMD are in crypto or CRC code.
In v6.8, the size of task_struct on arm64 increased by 528 bytes due
to the new 'kernel_fpsimd_state' field. This field was added to allow
kernel-mode FPSIMD code to be preempted.
Unfortunately, 528 bytes is kind of a lot for task_struct. This
regression in the task_struct size was noticed and reported.
Recover that space by making this state be allocated on the stack at
the beginning of each kernel-mode FPSIMD section.
To make it easier for all the users of kernel-mode FPSIMD to do that
correctly, introduce and use a 'scoped_ksimd' abstraction"
* tag 'fpsimd-on-stack-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux: (23 commits)
lib/crypto: arm64: Move remaining algorithms to scoped ksimd API
lib/crypto: arm/blake2b: Move to scoped ksimd API
arm64/fpsimd: Allocate kernel mode FP/SIMD buffers on the stack
arm64/fpu: Enforce task-context only for generic kernel mode FPU
net/mlx5: Switch to more abstract scoped ksimd guard API on arm64
arm64/xorblocks: Switch to 'ksimd' scoped guard API
crypto/arm64: sm4 - Switch to 'ksimd' scoped guard API
crypto/arm64: sm3 - Switch to 'ksimd' scoped guard API
crypto/arm64: sha3 - Switch to 'ksimd' scoped guard API
crypto/arm64: polyval - Switch to 'ksimd' scoped guard API
crypto/arm64: nhpoly1305 - Switch to 'ksimd' scoped guard API
crypto/arm64: aes-gcm - Switch to 'ksimd' scoped guard API
crypto/arm64: aes-blk - Switch to 'ksimd' scoped guard API
crypto/arm64: aes-ccm - Switch to 'ksimd' scoped guard API
raid6: Move to more abstract 'ksimd' guard API
crypto: aegis128-neon - Move to more abstract 'ksimd' guard API
crypto/arm64: sm4-ce-gcm - Avoid pointless yield of the NEON unit
crypto/arm64: sm4-ce-ccm - Avoid pointless yield of the NEON unit
crypto/arm64: aes-ce-ccm - Avoid pointless yield of the NEON unit
lib/crc: Switch ARM and arm64 to 'ksimd' scoped guard API
...
Pull 'at_least' array size update from Eric Biggers:
"C supports lower bounds on the sizes of array parameters, using the
static keyword as follows: 'void f(int a[static 32]);'. This allows
the compiler to warn about a too-small array being passed.
As discussed, this reuse of the 'static' keyword, while standard, is a
bit obscure. Therefore, add an alias 'at_least' to compiler_types.h.
Then, add this 'at_least' annotation to the array parameters of
various crypto library functions"
* tag 'libcrypto-at-least-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux:
lib/crypto: sha2: Add at_least decoration to fixed-size array params
lib/crypto: sha1: Add at_least decoration to fixed-size array params
lib/crypto: poly1305: Add at_least decoration to fixed-size array params
lib/crypto: md5: Add at_least decoration to fixed-size array params
lib/crypto: curve25519: Add at_least decoration to fixed-size array params
lib/crypto: chacha: Add at_least decoration to fixed-size array params
lib/crypto: chacha20poly1305: Statically check fixed array lengths
compiler_types: introduce at_least parameter decoration pseudo keyword
wifi: iwlwifi: trans: rename at_least variable to min_mode
Pull crypto library test updates from Eric Biggers:
- Add KUnit test suites for SHA-3, BLAKE2b, and POLYVAL. These are the
algorithms that have new crypto library interfaces this cycle.
- Remove the crypto_shash POLYVAL tests. They're no longer needed
because POLYVAL support was removed from crypto_shash. Better POLYVAL
test coverage is now provided via the KUnit test suite.
* tag 'libcrypto-tests-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux:
crypto: testmgr - Remove polyval tests
lib/crypto: tests: Add KUnit tests for POLYVAL
lib/crypto: tests: Add additional SHAKE tests
lib/crypto: tests: Add SHA3 kunit tests
lib/crypto: tests: Add KUnit tests for BLAKE2b
Pull crypto library updates from Eric Biggers:
"This is the main crypto library pull request for 6.19. It includes:
- Add SHA-3 support to lib/crypto/, including support for both the
hash functions and the extendable-output functions. Reimplement the
existing SHA-3 crypto_shash support on top of the library.
This is motivated mainly by the upcoming support for the ML-DSA
signature algorithm, which needs the SHAKE128 and SHAKE256
functions. But even on its own it's a useful cleanup.
This also fixes the longstanding issue where the
architecture-optimized SHA-3 code was disabled by default.
- Add BLAKE2b support to lib/crypto/, and reimplement the existing
BLAKE2b crypto_shash support on top of the library.
This is motivated mainly by btrfs, which supports BLAKE2b
checksums. With this change, all btrfs checksum algorithms now have
library APIs. btrfs is planned to start just using the library
directly.
This refactor also improves consistency between the BLAKE2b code
and BLAKE2s code. And as usual, it also fixes the issue where the
architecture-optimized BLAKE2b code was disabled by default.
- Add POLYVAL support to lib/crypto/, replacing the existing POLYVAL
support in crypto_shash. Reimplement HCTR2 on top of the library.
This simplifies the code and improves HCTR2 performance. As usual,
it also makes the architecture-optimized code be enabled by
default. The generic implementation of POLYVAL is greatly improved
as well.
- Clean up the BLAKE2s code
- Add FIPS self-tests for SHA-1, SHA-2, and SHA-3"
* tag 'libcrypto-updates-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux: (37 commits)
fscrypt: Drop obsolete recommendation to enable optimized POLYVAL
crypto: polyval - Remove the polyval crypto_shash
crypto: hctr2 - Convert to use POLYVAL library
lib/crypto: x86/polyval: Migrate optimized code into library
lib/crypto: arm64/polyval: Migrate optimized code into library
lib/crypto: polyval: Add POLYVAL library
crypto: polyval - Rename conflicting functions
lib/crypto: x86/blake2s: Use vpternlogd for 3-input XORs
lib/crypto: x86/blake2s: Avoid writing back unchanged 'f' value
lib/crypto: x86/blake2s: Improve readability
lib/crypto: x86/blake2s: Use local labels for data
lib/crypto: x86/blake2s: Drop check for nblocks == 0
lib/crypto: x86/blake2s: Fix 32-bit arg treated as 64-bit
lib/crypto: arm, arm64: Drop filenames from file comments
lib/crypto: arm/blake2s: Fix some comments
crypto: s390/sha3 - Remove superseded SHA-3 code
crypto: sha3 - Reimplement using library API
crypto: jitterentropy - Use default sha3 implementation
lib/crypto: s390/sha3: Add optimized one-shot SHA-3 digest functions
lib/crypto: sha3: Support arch overrides of one-shot digest functions
...
min_t(unsigned int, a, b) casts an 'unsigned long' to 'unsigned int'.
Use min(a, b) instead as it promotes any 'unsigned int' to 'unsigned long'
and so cannot discard significant bits.
In this case the 'unsigned long' value is small enough that the result
is ok.
Detected by an extra check added to min_t().
Signed-off-by: David Laight <david.laight.linux@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Several parameters of the chacha20poly1305 functions require arrays of
an exact length. Use the new at_least keyword to instruct gcc and
clang to statically check that the caller is passing an object of at
least that length.
Here it is in action, with this faulty patch to wireguard's cookie.h:
struct cookie_checker {
u8 secret[NOISE_HASH_LEN];
- u8 cookie_encryption_key[NOISE_SYMMETRIC_KEY_LEN];
+ u8 cookie_encryption_key[NOISE_SYMMETRIC_KEY_LEN - 1];
u8 message_mac1_key[NOISE_SYMMETRIC_KEY_LEN];
If I try compiling this code, I get this helpful warning:
CC drivers/net/wireguard/cookie.o
drivers/net/wireguard/cookie.c: In function ‘wg_cookie_message_create’:
drivers/net/wireguard/cookie.c:193:9: warning: ‘xchacha20poly1305_encrypt’ reading 32 bytes from a region of size 31 [-Wstringop-overread]
193 | xchacha20poly1305_encrypt(dst->encrypted_cookie, cookie, COOKIE_LEN,
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
194 | macs->mac1, COOKIE_LEN, dst->nonce,
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
195 | checker->cookie_encryption_key);
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
drivers/net/wireguard/cookie.c:193:9: note: referencing argument 7 of type ‘const u8 *’ {aka ‘const unsigned char *’}
In file included from drivers/net/wireguard/messages.h:10,
from drivers/net/wireguard/cookie.h:9,
from drivers/net/wireguard/cookie.c:6:
include/crypto/chacha20poly1305.h:28:6: note: in a call to function ‘xchacha20poly1305_encrypt’
28 | void xchacha20poly1305_encrypt(u8 *dst, const u8 *src, const size_t src_len,
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: "Jason A. Donenfeld" <Jason@zx2c4.com>
Link: https://lore.kernel.org/r/20251123054819.2371989-4-Jason@zx2c4.com
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Fully initialize *ctx, including the buf field which sha256_init()
doesn't initialize, to avoid a KMSAN warning when comparing *ctx to
orig_ctx. This KMSAN warning slipped in while KMSAN was not working
reliably due to a stackdepot bug, which has now been fixed.
Fixes: 6733968be7 ("lib/crypto: tests: Add tests and benchmark for sha256_finup_2x()")
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20251121033431.34406-1-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Move the arm64 implementations of SHA-3 and POLYVAL to the newly
introduced scoped ksimd API, which replaces kernel_neon_begin() and
kernel_neon_end(). On arm64, this is needed because the latter API
will change in an incompatible manner.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Even though ARM's versions of kernel_neon_begin()/_end() are not being
changed, update the newly migrated ARM blake2b to the scoped ksimd API
so that all ARM and arm64 in lib/crypto remains consistent in this
manner.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Pull scoped ksimd API for ARM and arm64 from Ard Biesheuvel:
"Introduce a more strict replacement API for
kernel_neon_begin()/kernel_neon_end() on both ARM and arm64, and
replace occurrences of the latter pair appearing in lib/crypto"
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Before modifying the prototypes of kernel_neon_begin() and
kernel_neon_end() to accommodate kernel mode FP/SIMD state buffers
allocated on the stack, move arm64 to the new 'ksimd' scoped guard API,
which encapsulates the calls to those functions.
For symmetry, do the same for 32-bit ARM too.
Reviewed-by: Eric Biggers <ebiggers@kernel.org>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Add a test suite for the POLYVAL library, including:
- All the standard tests and the benchmark from hash-test-template.h
- Comparison with a test vector from the RFC
- Test with key and message containing all one bits
- Additional tests related to the key struct
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20251109234726.638437-4-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Add the following test cases to cover gaps in the SHAKE testing:
- test_shake_all_lens_up_to_4096()
- test_shake_multiple_squeezes()
- test_shake_with_guarded_bufs()
Remove test_shake256_tiling() and test_shake256_tiling2() since they are
superseded by test_shake_multiple_squeezes(). It provides better test
coverage by using randomized testing. E.g., it's able to generate a
zero-length squeeze followed by a nonzero-length squeeze, which the
first 7 versions of the SHA-3 patchset handled incorrectly.
Tested-by: Harald Freudenberger <freude@linux.ibm.com>
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20251026055032.1413733-7-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Add a SHA3 kunit test suite, providing the following:
(*) A simple test of each of SHA3-224, SHA3-256, SHA3-384, SHA3-512,
SHAKE128 and SHAKE256.
(*) NIST 0- and 1600-bit test vectors for SHAKE128 and SHAKE256.
(*) Output tiling (multiple squeezing) tests for SHAKE256.
(*) Standard hash template test for SHA3-256. To make this possible,
gen-hash-testvecs.py is modified to support sha3-256.
(*) Standard benchmark test for SHA3-256.
[EB: dropped some unnecessary changes to gen-hash-testvecs.py, moved
addition of Testing section in doc file into this commit, and
other small cleanups]
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Tested-by: Harald Freudenberger <freude@linux.ibm.com>
Link: https://lore.kernel.org/r/20251026055032.1413733-6-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Migrate the x86_64 implementation of POLYVAL into lib/crypto/, wiring it
up to the POLYVAL library interface. This makes the POLYVAL library be
properly optimized on x86_64.
This drops the x86_64 optimizations of polyval in the crypto_shash API.
That's fine, since polyval will be removed from crypto_shash entirely
since it is unneeded there. But even if it comes back, the crypto_shash
API could just be implemented on top of the library API, as usual.
Adjust the names and prototypes of the assembly functions to align more
closely with the rest of the library code.
Also replace a movaps instruction with movups to remove the assumption
that the key struct is 16-byte aligned. Users can still align the key
if they want (and at least in this case, movups is just as fast as
movaps), but it's inconvenient to require it.
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20251109234726.638437-6-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Migrate the arm64 implementation of POLYVAL into lib/crypto/, wiring it
up to the POLYVAL library interface. This makes the POLYVAL library be
properly optimized on arm64.
This drops the arm64 optimizations of polyval in the crypto_shash API.
That's fine, since polyval will be removed from crypto_shash entirely
since it is unneeded there. But even if it comes back, the crypto_shash
API could just be implemented on top of the library API, as usual.
Adjust the names and prototypes of the assembly functions to align more
closely with the rest of the library code.
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20251109234726.638437-5-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Add support for POLYVAL to lib/crypto/.
This will replace the polyval crypto_shash algorithm and its use in the
hctr2 template, simplifying the code and reducing overhead.
Specifically, this commit introduces the POLYVAL library API and a
generic implementation of it. Later commits will migrate the existing
architecture-optimized implementations of POLYVAL into lib/crypto/ and
add a KUnit test suite.
I've also rewritten the generic implementation completely, using a more
modern approach instead of the traditional table-based approach. It's
now constant-time, requires no precomputation or dynamic memory
allocations, decreases the per-key memory usage from 4096 bytes to 16
bytes, and is faster than the old polyval-generic even on bulk data
reusing the same key (at least on x86_64, where I measured 15% faster).
We should do this for GHASH too, but for now just do it for POLYVAL.
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Tested-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20251109234726.638437-3-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
AVX-512 supports 3-input XORs via the vpternlogd (or vpternlogq)
instruction with immediate 0x96. This approach, vs. the alternative of
two vpxor instructions, is already used in the CRC, AES-GCM, and AES-XTS
code, since it reduces the instruction count and is faster on some CPUs.
Make blake2s_compress_avx512() take advantage of it too.
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20251102234209.62133-7-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Just before returning, blake2s_compress_ssse3() and
blake2s_compress_avx512() store updated values to the 'h', 't', and 'f'
fields of struct blake2s_ctx. But 'f' is always unchanged (which is
correct; only the C code changes it). So, there's no need to write to
'f'. Use 64-bit stores (movq and vmovq) instead of 128-bit stores
(movdqu and vmovdqu) so that only 't' is written.
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20251102234209.62133-6-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Various cleanups for readability. No change to the generated code:
- Add some comments
- Add #defines for arguments
- Rename some labels
- Use decimal constants instead of hex where it makes sense.
(The pshufd immediates intentionally remain as hex.)
- Add blank lines when there's a logical break
The round loop still could use some work, but this is at least a start.
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20251102234209.62133-5-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Since blake2s_compress() is always passed nblocks != 0, remove the
unnecessary check for nblocks == 0 from blake2s_compress_ssse3().
Note that this makes it consistent with blake2s_compress_avx512() in the
same file as well as the arm32 blake2s_compress().
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20251102234209.62133-3-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
In the C code, the 'inc' argument to the assembly functions
blake2s_compress_ssse3() and blake2s_compress_avx512() is declared with
type u32, matching blake2s_compress(). The assembly code then reads it
from the 64-bit %rcx. However, the ABI doesn't guarantee zero-extension
to 64 bits, nor do gcc or clang guarantee it. Therefore, fix these
functions to read this argument from the 32-bit %ecx.
In theory, this bug could have caused the wrong 'inc' value to be used,
causing incorrect BLAKE2s hashes. In practice, probably not: I've fixed
essentially this same bug in many other assembly files too, but there's
never been a real report of it having caused a problem. In x86_64, all
writes to 32-bit registers are zero-extended to 64 bits. That results
in zero-extension in nearly all situations. I've only been able to
demonstrate a lack of zero-extension with a somewhat contrived example
involving truncation, e.g. when the C code has a u64 variable holding
0x1234567800000040 and passes it as a u32 expecting it to be truncated
to 0x40 (64). But that's not what the real code does, of course.
Fixes: ed0356eda1 ("crypto: blake2s - x86_64 SIMD implementation")
Cc: stable@vger.kernel.org
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20251102234209.62133-2-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Some z/Architecture processors can compute a SHA-3 digest in a single
instruction. arch/s390/crypto/ already uses this capability to optimize
the SHA-3 crypto_shash algorithms.
Use this capability to implement the sha3_224(), sha3_256(), sha3_384(),
and sha3_512() library functions too.
SHA3-256 benchmark results provided by Harald Freudenberger
(https://lore.kernel.org/r/4188d18bfcc8a64941c5ebd8de10ede2@linux.ibm.com/)
on a z/Architecture machine with "facility 86" (MSA level 12):
Length (bytes) Before (MB/s) After (MB/s)
============== ============= ============
16 212 225
64 820 915
256 1850 3350
1024 5400 8300
4096 11200 11300
Note: the original data from Harald was given in the form of a graph for
each length, showing the distribution of throughputs from 500 runs. I
guesstimated the peak of each one.
Harald also reported that the generic SHA-3 code was at most 259 MB/s
(https://lore.kernel.org/r/c39f6b6c110def0095e5da5becc12085@linux.ibm.com/).
So as expected, the earlier commit that optimized sha3_absorb_blocks()
and sha3_keccakf() is the more important one; it optimized the Keccak
permutation which is the most performance-critical part of SHA-3.
Still, this additional commit does notably improve performance further
on some lengths.
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Tested-by: Harald Freudenberger <freude@linux.ibm.com>
Link: https://lore.kernel.org/r/20251026055032.1413733-13-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Implement sha3_absorb_blocks() and sha3_keccakf() using the hardware-
accelerated SHA-3 support in Message-Security-Assist Extension 6.
This accelerates the SHA3-224, SHA3-256, SHA3-384, SHA3-512, and
SHAKE256 library functions.
Note that arch/s390/crypto/ already has SHA-3 code that uses this
extension, but it is exposed only via crypto_shash. This commit brings
the same acceleration to the SHA-3 library. The arch/s390/crypto/
version will become redundant and be removed in later changes.
Tested-by: Harald Freudenberger <freude@linux.ibm.com>
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20251026055032.1413733-11-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Instead of exposing the arm64-optimized SHA-3 code via arm64-specific
crypto_shash algorithms, instead just implement the sha3_absorb_blocks()
and sha3_keccakf() library functions. This is much simpler, it makes
the SHA-3 library functions be arm64-optimized, and it fixes the
longstanding issue where the arm64-optimized SHA-3 code was disabled by
default. SHA-3 still remains available through crypto_shash, but
individual architectures no longer need to handle it.
Note: to see the diff from arch/arm64/crypto/sha3-ce-glue.c to
lib/crypto/arm64/sha3.h, view this commit with 'git show -M10'.
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20251026055032.1413733-10-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
In crypto/sha3_generic.c, the keccakf() function calls keccakf_round()
to do four of Keccak-f's five step mappings. However, it does not do
the Iota step mapping - presumably because that is dependent on round
number, whereas Theta, Rho, Pi and Chi are not.
Note that the keccakf_round() function needs to be explicitly
non-inlined on certain architectures as gcc's produced output will (or
used to) use over 1KiB of stack space if inlined.
Now, this code was copied more or less verbatim into lib/crypto/sha3.c,
so that has the same aesthetic issue. Fix this there by passing the
round number into sha3_keccakf_one_round_generic() and doing the Iota
step mapping there.
crypto/sha3_generic.c is left untouched as that will be converted to use
lib/crypto/sha3.c at some point.
Suggested-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Harald Freudenberger <freude@linux.ibm.com>
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20251026055032.1413733-5-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Add SHA-3 support to lib/crypto/. All six algorithms in the SHA-3
family are supported: four digests (SHA3-224, SHA3-256, SHA3-384, and
SHA3-512) and two extendable-output functions (SHAKE128 and SHAKE256).
The SHAKE algorithms will be required for ML-DSA.
[EB: simplified the API to use fewer types and functions, fixed bug that
sometimes caused incorrect SHAKE output, cleaned up the
documentation, dropped an ad-hoc test that was inconsistent with
the rest of lib/crypto/, and many other cleanups]
Signed-off-by: David Howells <dhowells@redhat.com>
Co-developed-by: Eric Biggers <ebiggers@kernel.org>
Tested-by: Harald Freudenberger <freude@linux.ibm.com>
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20251026055032.1413733-4-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
On big endian arm kernels, the arm optimized Curve25519 code produces
incorrect outputs and fails the Curve25519 test. This has been true
ever since this code was added.
It seems that hardly anyone (or even no one?) actually uses big endian
arm kernels. But as long as they're ostensibly supported, we should
disable this code on them so that it's not accidentally used.
Note: for future-proofing, use !CPU_BIG_ENDIAN instead of
CPU_LITTLE_ENDIAN. Both of these are arch-specific options that could
get removed in the future if big endian support gets dropped.
Fixes: d8f1308a02 ("crypto: arm/curve25519 - wire up NEON implementation")
Cc: stable@vger.kernel.org
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20251104054906.716914-1-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Migrate the arm-optimized BLAKE2b code from arch/arm/crypto/ to
lib/crypto/arm/. This makes the BLAKE2b library able to use it, and it
also simplifies the code because it's easier to integrate with the
library than crypto_shash.
This temporarily makes the arm-optimized BLAKE2b code unavailable via
crypto_shash. A later commit reimplements the blake2b-* crypto_shash
algorithms on top of the BLAKE2b library API, making it available again.
Note that as per the lib/crypto/ convention, the optimized code is now
enabled by default. So, this also fixes the longstanding issue where
the optimized BLAKE2b code was not enabled by default.
To see the diff from arch/arm/crypto/blake2b-neon-glue.c to
lib/crypto/arm/blake2b.h, view this commit with 'git show -M10'.
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20251018043106.375964-8-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Add a library API for BLAKE2b, closely modeled after the BLAKE2s API.
This will allow in-kernel users such as btrfs to use BLAKE2b without
going through the generic crypto layer. In addition, as usual the
BLAKE2b crypto_shash algorithms will be reimplemented on top of this.
Note: to create lib/crypto/blake2b.c I made a copy of
lib/crypto/blake2s.c and made the updates from BLAKE2s => BLAKE2b. This
way, the BLAKE2s and BLAKE2b code is kept consistent. Therefore, it
borrows the SPDX-License-Identifier and Copyright from
lib/crypto/blake2s.c rather than crypto/blake2b_generic.c.
The library API uses 'struct blake2b_ctx', consistent with other
lib/crypto/ APIs. The existing 'struct blake2b_state' will be removed
once the blake2b crypto_shash algorithms are updated to stop using it.
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20251018043106.375964-7-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
A couple more small cleanups to the BLAKE2s code before these things get
propagated into the BLAKE2b code:
- Drop 'const' from some non-pointer function parameters. It was a bit
excessive and not conventional.
- Rename 'block' argument of blake2s_compress*() to 'data'. This is for
consistency with the SHA-* code, and also to avoid the implication
that it points to a singular "block".
No functional changes.
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20251018043106.375964-4-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
For consistency with the SHA-1, SHA-2, SHA-3 (in development), and MD5
library APIs, rename blake2s_state to blake2s_ctx.
As a refresher, the ctx name:
- Is a bit shorter.
- Avoids confusion with the compression function state, which is also
often called the state (but is just part of the full context).
- Is consistent with OpenSSL.
Not a big deal, of course. But consistency is nice. With a BLAKE2b
library API about to be added, this is a convenient time to update this.
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20251018043106.375964-3-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Reorder the parameters of blake2s() from (out, in, key, outlen, inlen,
keylen) to (key, keylen, in, inlen, out, outlen).
This aligns BLAKE2s with the common conventions of pairing buffers and
their lengths, and having outputs follow inputs. This is widely used
elsewhere in lib/crypto/ and crypto/, and even elsewhere in the BLAKE2s
code itself such as blake2s_init_key() and blake2s_final(). So
blake2s() was a bit of an exception.
Notably, this results in the same order as hmac_*_usingrawkey().
Note that since the type signature changed, it's not possible for a
blake2s() call site to be silently missed.
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20251018043106.375964-2-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Add FIPS cryptographic algorithm self-tests for all SHA-1 and SHA-2
algorithms. Following the "Implementation Guidance for FIPS 140-3"
document, to achieve this it's sufficient to just test a single test
vector for each of HMAC-SHA1, HMAC-SHA256, and HMAC-SHA512.
Just run these tests in the initcalls, following the example of e.g.
crypto/kdf_sp800108.c. Note that this should meet the FIPS self-test
requirement even in the built-in case, given that the initcalls run
before userspace, storage, network, etc. are accessible.
This does not fix a regression, seeing as lib/ has had SHA-1 support
since 2005 and SHA-256 support since 2018. Neither ever had FIPS
self-tests. Moreover, fips=1 support has always been an unfinished
feature upstream. However, with lib/ now being used more widely, it's
now seeing more scrutiny and people seem to want these now [1][2].
[1] https://lore.kernel.org/r/3226361.1758126043@warthog.procyon.org.uk/
[2] https://lore.kernel.org/r/f31dbb22-0add-481c-aee0-e337a7731f8e@oracle.com/
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20251011001047.51886-1-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Pull interleaved SHA-256 hashing support from Eric Biggers:
"Optimize fsverity with 2-way interleaved hashing
Add support for 2-way interleaved SHA-256 hashing to lib/crypto/, and
make fsverity use it for faster file data verification. This improves
fsverity performance on many x86_64 and arm64 processors.
Later, I plan to make dm-verity use this too"
* tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fsverity/linux:
fsverity: Use 2-way interleaved SHA-256 hashing when supported
fsverity: Remove inode parameter from fsverity_hash_block()
lib/crypto: tests: Add tests and benchmark for sha256_finup_2x()
lib/crypto: x86/sha256: Add support for 2-way interleaved hashing
lib/crypto: arm64/sha256: Add support for 2-way interleaved hashing
lib/crypto: sha256: Add support for 2-way interleaved hashing
Pull crypto library updates from Eric Biggers:
- Add a RISC-V optimized implementation of Poly1305. This code was
written by Andy Polyakov and contributed by Zhihang Shao.
- Migrate the MD5 code into lib/crypto/, and add KUnit tests for MD5.
Yes, it's still the 90s, and several kernel subsystems are still
using MD5 for legacy use cases. As long as that remains the case,
it's helpful to clean it up in the same way as I've been doing for
other algorithms.
Later, I plan to convert most of these users of MD5 to use the new
MD5 library API instead of the generic crypto API.
- Simplify the organization of the ChaCha, Poly1305, BLAKE2s, and
Curve25519 code.
Consolidate these into one module per algorithm, and centralize the
configuration and build process. This is the same reorganization that
has already been successful for SHA-1 and SHA-2.
- Remove the unused crypto_kpp API for Curve25519.
- Migrate the BLAKE2s and Curve25519 self-tests to KUnit.
- Always enable the architecture-optimized BLAKE2s code.
* tag 'libcrypto-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux: (38 commits)
crypto: md5 - Implement export_core() and import_core()
wireguard: kconfig: simplify crypto kconfig selections
lib/crypto: tests: Enable Curve25519 test when CRYPTO_SELFTESTS
lib/crypto: curve25519: Consolidate into single module
lib/crypto: curve25519: Move a couple functions out-of-line
lib/crypto: tests: Add Curve25519 benchmark
lib/crypto: tests: Migrate Curve25519 self-test to KUnit
crypto: curve25519 - Remove unused kpp support
crypto: testmgr - Remove curve25519 kpp tests
crypto: x86/curve25519 - Remove unused kpp support
crypto: powerpc/curve25519 - Remove unused kpp support
crypto: arm/curve25519 - Remove unused kpp support
crypto: hisilicon/hpre - Remove unused curve25519 kpp support
lib/crypto: tests: Add KUnit tests for BLAKE2s
lib/crypto: blake2s: Consolidate into single C translation unit
lib/crypto: blake2s: Move generic code into blake2s.c
lib/crypto: blake2s: Always enable arch-optimized BLAKE2s code
lib/crypto: blake2s: Remove obsolete self-test
lib/crypto: x86/blake2s: Reduce size of BLAKE2S_SIGMA2
lib/crypto: chacha: Consolidate into single module
...
Pull CRC updates from Eric Biggers:
"Update crc_kunit to test the CRC functions in softirq and hardirq
contexts, similar to what the lib/crypto/ KUnit tests do. Move the
helper function needed to do this into a common header.
This is useful mainly to test fallback code paths for when
FPU/SIMD/vector registers are unusable"
* tag 'crc-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux:
Documentation/staging: Fix typo and incorrect citation in crc32.rst
lib/crc: Drop inline from all *_mod_init_arch() functions
lib/crc: Use underlying functions instead of crypto_simd_usable()
lib/crc: crc_kunit: Test CRC computation in interrupt contexts
kunit, lib/crypto: Move run_irq_test() to common header
Add an implementation of sha256_finup_2x_arch() for x86_64. It
interleaves the computation of two SHA-256 hashes using the x86 SHA-NI
instructions. dm-verity and fs-verity will take advantage of this for
greatly improved performance on capable CPUs.
This increases the throughput of SHA-256 hashing 4096-byte messages by
the following amounts on the following CPUs:
Intel Ice Lake (server): 4%
Intel Sapphire Rapids: 38%
Intel Emerald Rapids: 38%
AMD Zen 1 (Threadripper 1950X): 84%
AMD Zen 4 (EPYC 9B14): 98%
AMD Zen 5 (Ryzen 9 9950X): 64%
For now, this seems to benefit AMD more than Intel. This seems to be
because current AMD CPUs support concurrent execution of the SHA-NI
instructions, but unfortunately current Intel CPUs don't, except for the
sha256msg2 instruction. Hopefully future Intel CPUs will support SHA-NI
on more execution ports. Zen 1 supports 2 concurrent sha256rnds2, and
Zen 4 supports 4 concurrent sha256rnds2, which suggests that even better
performance may be achievable on Zen 4 by interleaving more than two
hashes. However, doing so poses a number of trade-offs, and furthermore
Zen 5 goes back to supporting "only" 2 concurrent sha256rnds2.
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20250915160819.140019-4-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Add an implementation of sha256_finup_2x_arch() for arm64. It
interleaves the computation of two SHA-256 hashes using the ARMv8
SHA-256 instructions. dm-verity and fs-verity will take advantage of
this for greatly improved performance on capable CPUs.
This increases the throughput of SHA-256 hashing 4096-byte messages by
the following amounts on the following CPUs:
ARM Cortex-X1: 70%
ARM Cortex-X3: 68%
ARM Cortex-A76: 65%
ARM Cortex-A715: 43%
ARM Cortex-A510: 25%
ARM Cortex-A55: 8%
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20250915160819.140019-3-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Many arm64 and x86_64 CPUs can compute two SHA-256 hashes in nearly the
same speed as one, if the instructions are interleaved. This is because
SHA-256 is serialized block-by-block, and two interleaved hashes take
much better advantage of the CPU's instruction-level parallelism.
Meanwhile, a very common use case for SHA-256 hashing in the Linux
kernel is dm-verity and fs-verity. Both use a Merkle tree that has a
fixed block size, usually 4096 bytes with an empty or 32-byte salt
prepended. Usually, many blocks need to be hashed at a time. This is
an ideal scenario for 2-way interleaved hashing.
To enable this optimization, add a new function sha256_finup_2x() to the
SHA-256 library API. It computes the hash of two equal-length messages,
starting from a common initial context.
For now it always falls back to sequential processing. Later patches
will wire up arm64 and x86_64 optimized implementations.
Note that the interleaving factor could in principle be higher than 2x.
However, that runs into many practical difficulties and CPU throughput
limitations. Thus, both the implementations I'm adding are 2x. In the
interest of using the simplest solution, the API matches that.
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20250915160819.140019-2-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@kernel.org>