Pull kunit updates from Shuah Khan:
- Make filter parameters configurable via Kconfig
- Add description of kunit.enable parameter to documentation
* tag 'linux_kselftest-kunit-6.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
kunit: Make filter parameters configurable via Kconfig
Documentation: kunit: add description of kunit.enable parameter
Pull kselftest updates from Shuah Khan:
- Add basic test for trace_marker_raw file to tracing selftest
- Fix invalid array access in printf dma_map_benchmark selftest
- Add tprobe enable/disable testcase to tracing selftest
- Update fprobe selftest for ftrace based fprobe
* tag 'linux_kselftest-next-6.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
selftests: tracing: Update fprobe selftest for ftrace based fprobe
selftests: tracing: Add tprobe enable/disable testcase
selftests/run_kselftest.sh: exit with error if tests fail
selftests/dma: fix invalid array access in printf
selftests/tracing: Add basic test for trace_marker_raw file
Pull Kbuild updates from Nicolas Schier:
- Enable -fms-extensions, allowing anonymous use of tagged struct or
union in struct/union (tag kbuild-ms-extensions-6.19). An exemplary
conversion patch is added here, too (btrfs).
[ Editor's note: the core of this actually came in early through a
shared branch and a few other trees - Linus ]
- Introduce architecture-specific CC_CAN_LINK and flags for userprogs
- Add new packaging target 'modules-cpio-pkg' for building a initramfs
cpio w/ kmods
- Handle included .c files in gen_compile_commands
- Minor kbuild changes:
- Use objtree for module signing key path, fixing oot kmod signing
- Improve documentation of KBUILD_BUILD_TIMESTAMP
- Reuse KBUILD_USERCFLAGS for UAPI, instead of defining twice
- Rename scripts/Makefile.extrawarn to Makefile.warn
- Drop obsolete types.h check from headers_check.pl
- Remove outdated config leak ignore entries
* tag 'kbuild-6.19-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kbuild/linux:
kbuild: add target to build a cpio containing modules
initramfs: add gen_init_cpio to hostprogs unconditionally
kbuild: allow architectures to override CC_CAN_LINK
init: deduplicate cc-can-link.sh invocations
kbuild: don't enable CC_CAN_LINK if the dummy program generates warnings
scripts: headers_install.sh: Remove two outdated config leak ignore entries
scripts/clang-tools: Handle included .c files in gen_compile_commands
kbuild: uapi: Drop types.h check from headers_check.pl
kbuild: Rename Makefile.extrawarn to Makefile.warn
MAINTAINERS, .mailmap: Update mail address for Nicolas Schier
kbuild: uapi: reuse KBUILD_USERCFLAGS
kbuild: doc: improve KBUILD_BUILD_TIMESTAMP documentation
kbuild: Use objtree for module signing key path
btrfs: send: make use of -fms-extensions for defining struct fs_path
Pull Rust updates from Miguel Ojeda:
"Toolchain and infrastructure:
- Add support for 'syn'.
Syn is a parsing library for parsing a stream of Rust tokens into a
syntax tree of Rust source code.
Currently this library is geared toward use in Rust procedural
macros, but contains some APIs that may be useful more generally.
'syn' allows us to greatly simplify writing complex macros such as
'pin-init' (Benno has already prepared the 'syn'-based version). We
will use it in the 'macros' crate too.
'syn' is the most downloaded Rust crate (according to crates.io),
and it is also used by the Rust compiler itself. While the amount
of code is substantial, there should not be many updates needed for
these crates, and even if there are, they should not be too big,
e.g. +7k -3k lines across the 3 crates in the last year.
'syn' requires two smaller dependencies: 'quote' and 'proc-macro2'.
I only modified their code to remove a third dependency
('unicode-ident') and to add the SPDX identifiers. The code can be
easily verified to exactly match upstream with the provided
scripts.
They are all licensed under "Apache-2.0 OR MIT", like the other
vendored 'alloc' crate we had for a while.
Please see the merge commit with the cover letter for more context.
- Allow 'unreachable_pub' and 'clippy::disallowed_names' for
doctests.
Examples (i.e. doctests) may want to do things like show public
items and use names such as 'foo'.
Nevertheless, we still try to keep examples as close to real code
as possible (this is part of why running Clippy on doctests is
important for us, e.g. for safety comments, which userspace Rust
does not support yet but we are stricter).
'kernel' crate:
- Replace our custom 'CStr' type with 'core::ffi::CStr'.
Using the standard library type reduces our custom code footprint,
and we retain needed custom functionality through an extension
trait and a new 'fmt!' macro which replaces the previous 'core'
import.
This started in 6.17 and continued in 6.18, and we finally land the
replacement now. This required quite some stamina from Tamir, who
split the changes in steps to prepare for the flag day change here.
- Replace 'kernel::c_str!' with C string literals.
C string literals were added in Rust 1.77, which produce '&CStr's
(the 'core' one), so now we can write:
c"hi"
instead of:
c_str!("hi")
- Add 'num' module for numerical features.
It includes the 'Integer' trait, implemented for all primitive
integer types.
It also includes the 'Bounded' integer wrapping type: an integer
value that requires only the 'N' least significant bits of the
wrapped type to be encoded:
// An unsigned 8-bit integer, of which only the 4 LSBs are used.
let v = Bounded::<u8, 4>::new::<15>();
assert_eq!(v.get(), 15);
'Bounded' is useful to e.g. enforce guarantees when working with
bitfields that have an arbitrary number of bits.
Values can also be constructed from simple non-constant expressions
or, for more complex ones, validated at runtime.
'Bounded' also comes with comparison and arithmetic operations
(with both their backing type and other 'Bounded's with a
compatible backing type), casts to change the backing type,
extending/shrinking and infallible/fallible conversions from/to
primitives as applicable.
- 'rbtree' module: add immutable cursor ('Cursor').
It enables to use just an immutable tree reference where
appropriate. The existing fully-featured mutable cursor is renamed
to 'CursorMut'.
kallsyms:
- Fix wrong "big" kernel symbol type read from procfs.
'pin-init' crate:
- A couple minor fixes (Benno asked me to pick these patches up for
him this cycle).
Documentation:
- Quick Start guide: add Debian 13 (Trixie).
Debian Stable is now able to build Linux, since Debian 13 (released
2025-08-09) packages Rust 1.85.0, which is recent enough.
We are planning to propose that the minimum supported Rust version
in Linux follows Debian Stable releases, with Debian 13 being the
first one we upgrade to, i.e. Rust 1.85.
MAINTAINERS:
- Add entry for the new 'num' module.
- Remove Alex as Rust maintainer: he hasn't had the time to
contribute for a few years now, so it is a no-op change in
practice.
And a few other cleanups and improvements"
* tag 'rust-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/ojeda/linux: (53 commits)
rust: macros: support `proc-macro2`, `quote` and `syn`
rust: syn: enable support in kbuild
rust: syn: add `README.md`
rust: syn: remove `unicode-ident` dependency
rust: syn: add SPDX License Identifiers
rust: syn: import crate
rust: quote: enable support in kbuild
rust: quote: add `README.md`
rust: quote: add SPDX License Identifiers
rust: quote: import crate
rust: proc-macro2: enable support in kbuild
rust: proc-macro2: add `README.md`
rust: proc-macro2: remove `unicode_ident` dependency
rust: proc-macro2: add SPDX License Identifiers
rust: proc-macro2: import crate
rust: kbuild: support using libraries in `rustc_procmacro`
rust: kbuild: support skipping flags in `rustc_test_library`
rust: kbuild: add proc macro library support
rust: kbuild: simplify `--cfg` handling
rust: kbuild: introduce `core-flags` and `core-skip_flags`
...
Pull livepatching updates from Petr Mladek:
- Support both paths where tracefs is typically mounted in selftests
- Make old_sympos 0 and 1 equal. They both are valid when there is only
one symbol with the given name.
* tag 'livepatching-for-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/livepatching/livepatching:
selftests: livepatch: use canonical ftrace path
livepatch: Match old_sympos 0 and 1 in klp_find_func()
Pull sched_ext updates from Tejun Heo:
- Improve recovery from misbehaving BPF schedulers.
When a scheduler puts many tasks with varying affinity restrictions
on a shared DSQ, CPUs scanning through tasks they cannot run can
overwhelm the system, causing lockups.
Bypass mode now uses per-CPU DSQs with a load balancer to avoid this,
and hooks into the hardlockup detector to attempt recovery.
Add scx_cpu0 example scheduler to demonstrate this scenario.
- Add lockless peek operation for DSQs to reduce lock contention for
schedulers that need to query queue state during load balancing.
- Allow scx_bpf_reenqueue_local() to be called from anywhere in
preparation for deprecating cpu_acquire/release() callbacks in favor
of generic BPF hooks.
- Prepare for hierarchical scheduler support: add
scx_bpf_task_set_slice() and scx_bpf_task_set_dsq_vtime() kfuncs,
make scx_bpf_dsq_insert*() return bool, and wrap kfunc args in
structs for future aux__prog parameter.
- Implement cgroup_set_idle() callback to notify BPF schedulers when a
cgroup's idle state changes.
- Fix migration tasks being incorrectly downgraded from
stop_sched_class to rt_sched_class across sched_ext enable/disable.
Applied late as the fix is low risk and the bug subtle but needs
stable backporting.
- Various fixes and cleanups including cgroup exit ordering,
SCX_KICK_WAIT reliability, and backward compatibility improvements.
* tag 'sched_ext-for-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: (44 commits)
sched_ext: Fix incorrect sched_class settings for per-cpu migration tasks
sched_ext: tools: Removing duplicate targets during non-cross compilation
sched_ext: Use kvfree_rcu() to release per-cpu ksyncs object
sched_ext: Pass locked CPU parameter to scx_hardlockup() and add docs
sched_ext: Update comments replacing breather with aborting mechanism
sched_ext: Implement load balancer for bypass mode
sched_ext: Factor out abbreviated dispatch dequeue into dispatch_dequeue_locked()
sched_ext: Factor out scx_dsq_list_node cursor initialization into INIT_DSQ_LIST_CURSOR
sched_ext: Add scx_cpu0 example scheduler
sched_ext: Hook up hardlockup detector
sched_ext: Make handle_lockup() propagate scx_verror() result
sched_ext: Refactor lockup handlers into handle_lockup()
sched_ext: Make scx_exit() and scx_vexit() return bool
sched_ext: Exit dispatch and move operations immediately when aborting
sched_ext: Simplify breather mechanism with scx_aborting flag
sched_ext: Use per-CPU DSQs instead of per-node global DSQs in bypass mode
sched_ext: Refactor do_enqueue_task() local and global DSQ paths
sched_ext: Use shorter slice in bypass mode
sched_ext: Mark racy bitfields to prevent adding fields that can't tolerate races
sched_ext: Minor cleanups to scx_task_iter
...
Pull cgroup updates from Tejun Heo:
- Defer task cgroup unlink until after the dying task's final context
switch so that controllers see the cgroup properly populated until
the task is truly gone
- cpuset cleanups and simplifications.
Enforce that domain isolated CPUs stay in root or isolated partitions
and fail if isolated+nohz_full would leave no housekeeping CPU. Fix
sched/deadline root domain handling during CPU hot-unplug and race
for tasks in attaching cpusets
- Misc fixes including memory reclaim protection documentation and
selftest KTAP conformance
* tag 'cgroup-for-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (21 commits)
cpuset: Treat cpusets in attaching as populated
sched/deadline: Walk up cpuset hierarchy to decide root domain when hot-unplug
cgroup/cpuset: Introduce cpuset_cpus_allowed_locked()
docs: cgroup: No special handling of unpopulated memcgs
docs: cgroup: Note about sibling relative reclaim protection
docs: cgroup: Explain reclaim protection target
selftests/cgroup: conform test to KTAP format output
cpuset: remove need_rebuild_sched_domains
cpuset: remove global remote_children list
cpuset: simplify node setting on error
cgroup: include missing header for struct irq_work
cgroup: Fix sleeping from invalid context warning on PREEMPT_RT
cgroup/cpuset: Globally track isolated_cpus update
cgroup/cpuset: Ensure domain isolated CPUs stay in root or isolated partition
cgroup/cpuset: Move up prstate_housekeeping_conflict() helper
cgroup/cpuset: Fail if isolated and nohz_full don't leave any housekeeping
cgroup/cpuset: Rename update_unbound_workqueue_cpumask() to update_isolation_cpumasks()
cgroup: Defer task cgroup unlink until after the task is done switching out
cgroup: Move dying_tasks cleanup from cgroup_task_release() to cgroup_task_free()
cgroup: Rename cgroup lifecycle hooks to cgroup_task_*()
...
Pull workqueue updates from Tejun Heo:
- Rescuer affinity management: Affinity is now updated only when
detached using wq_unbound_cpumask consistently. DISASSOCIATED workers
also follow unbound cpumask changes to avoid breaking CPU isolation
- Rescuer cleanups preparing for fetching work items one by one from
pool list: Work assignment factored out, optimized to skip pwqs no
longer needing rescue, and shutdown logic simplified
- Unused assert_rcu_or_wq_mutex_or_pool_mutex() removed
* tag 'wq-for-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: Don't rely on wq->rescuer to stop rescuer
workqueue: Only assign rescuer work when really needed
workqueue: Factor out assign_rescuer_work()
workqueue: Init rescuer's affinities as wq_unbound_cpumask
workqueue: Let DISASSOCIATED workers follow unbound wq cpumask changes
workqueue: Update the rescuer's affinity only when it is detached
workqueue: Remove unused assert_rcu_or_wq_mutex_or_pool_mutex
Pull printk updates from Petr Mladek:
- Allow creaing nbcon console drivers with an unsafe write_atomic()
callback that can only be called by the final nbcon_atomic_flush_unsafe().
Otherwise, the driver would rely on the kthread.
It is going to be used as the-best-effort approach for an
experimental nbcon netconsole driver, see
https://lore.kernel.org/r/20251121-nbcon-v1-2-503d17b2b4af@debian.org
Note that a safe .write_atomic() callback is supposed to work in NMI
context. But some networking drivers are not safe even in IRQ
context:
https://lore.kernel.org/r/oc46gdpmmlly5o44obvmoatfqo5bhpgv7pabpvb6sjuqioymcg@gjsma3ghoz35
In an ideal world, all networking drivers would be fixed first and
the atomic flush would be blocked only in NMI context. But it brings
the question how reliable networking drivers are when the system is
in a bad state. They might block flushing more reliable serial
consoles which are more suitable for serious debugging anyway.
- Allow to use the last 4 bytes of the printk ring buffer.
- Prevent queuing IRQ work and block printk kthreads when consoles are
suspended. Otherwise, they create non-necessary churn or even block
the suspend.
- Release console_lock() between each record in the kthread used for
legacy consoles on RT. It might significantly speed up the boot.
- Release nbcon context between each record in the atomic flush. It
prevents stalls of the related printk kthread after it has lost the
ownership in the middle of a record
- Add support for NBCON consoles into KDB
- Add %ptsP modifier for printing struct timespec64 and use it where
possible
- Misc code clean up
* tag 'printk-for-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux: (48 commits)
printk: Use console_is_usable on console_unblank
arch: um: kmsg_dump: Use console_is_usable
drivers: serial: kgdboc: Drop checks for CON_ENABLED and CON_BOOT
lib/vsprintf: Unify FORMAT_STATE_NUM handlers
printk: Avoid irq_work for printk_deferred() on suspend
printk: Avoid scheduling irq_work on suspend
printk: Allow printk_trigger_flush() to flush all types
tracing: Switch to use %ptSp
scsi: snic: Switch to use %ptSp
scsi: fnic: Switch to use %ptSp
s390/dasd: Switch to use %ptSp
ptp: ocp: Switch to use %ptSp
pps: Switch to use %ptSp
PCI: epf-test: Switch to use %ptSp
net: dsa: sja1105: Switch to use %ptSp
mmc: mmc_test: Switch to use %ptSp
media: av7110: Switch to use %ptSp
ipmi: Switch to use %ptSp
igb: Switch to use %ptSp
e1000e: Switch to use %ptSp
...
Pull lkmm documentation update from Paul McKenney:
- Sort the memory-barriers.txt file's wait_event* and wait_on_bit* list
alphabetically
* tag 'lkmm.2025.12.01a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu:
memory-barriers.txt: Sort wait_event* and wait_on_bit* list alphabetically
Pull RCU updates from Frederic Weisbecker:
"SRCU:
- Properly handle SRCU readers within IRQ disabled sections in tiny
SRCU
- Preparation to reimplement RCU Tasks Trace on top of SRCU fast:
- Introduce API to expedite a grace period and test it through
rcutorture
- Split srcu-fast in two flavours: SRCU-fast and SRCU-fast-updown.
Both are still targeted toward faster readers (without full
barriers on LOCK and UNLOCK) at the expense of heavier write
side (using full RCU grace period ordering instead of simply
full ordering) as compared to "traditional" non-fast SRCU. But
those srcu-fast flavours are going to be optimized in two
different ways:
- SRCU-fast will become the reimplementation basis for
RCU-TASK-TRACE for consolidation. Since RCU-TASK-TRACE must
be NMI safe, SRCU-fast must be as well.
- SRCU-fast-updown will be needed for uretprobes code in order
to get rid of the read-side memory barriers while still
allowing entering the reader at task level while exiting it
in a timer handler. It is considered semaphore-like in that
it can have different owners between LOCK and UNLOCK.
However it is not NMI-safe.
The actual optimizations are work in progress for the next
cycle. Only the new interfaces are added for now, along with
related torture and scalability test code.
- Create/document/debug/torture new proper initializers for RCU fast:
DEFINE_SRCU_FAST() and init_srcu_struct_fast()
This allows for using right away the proper ordering on the write
side (either full ordering or full RCU grace period ordering)
without waiting for the read side to tell which to use.
This also optimizes the read side altogether with moving flavour
debug checks under debug config and with removing a costly RmW
operation on their first call.
- Make some diagnostic functions tracing safe
Refscale:
- Add performance testing for common context synchronizations
(Preemption, IRQ, Softirq) and per-cpu increments. Those are
relevant comparisons against SRCU-fast read side APIs, especially
as they are planned to synchronize further tracing fast-path code
Miscellanous:
- In order to prepare the layout for nohz_full work deferral to user
exit, the context tracking state must shrink the counter of
transitions to/from RCU not watching. The only possible hazard is
to trigger wrap-around more easily, delaying a bit grace periods
when that happens. This should be a rare event though. Yet add
debugging and torture code to test that assumption
- Fix memory leak on locktorture module
- Annotate accesses in rculist_nulls.h to prevent from KCSAN
warnings. On recent discussions, we also concluded that all those
WRITE_ONCE() and READ_ONCE() on list APIs deserve appropriate
comments. Something to be expected for the next cycle
- Provide a script to apply several configs to several commits with
torture
- Allow torture to reuse a build directory in order to save needless
rebuild time
- Various cleanups"
* tag 'rcu.release.v6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux: (29 commits)
refscale: Add SRCU-fast-updown readers
refscale: Exercise DEFINE_STATIC_SRCU_FAST() and init_srcu_struct_fast()
rcutorture: Make srcu{,d}_torture_init() announce the SRCU type
srcu: Create an SRCU-fast-updown API
refscale: Do not disable interrupts for tests involving local_bh_enable()
refscale: Add non-atomic per-CPU increment readers
refscale: Add this_cpu_inc() readers
refscale: Add preempt_disable() readers
refscale: Add local_bh_disable() readers
refscale: Add local_irq_disable() and local_irq_save() readers
torture: Permit negative kvm.sh --kconfig numberic arguments
srcu: Add SRCU_READ_FLAVOR_FAST_UPDOWN CPP macro
rcu: Mark diagnostic functions as notrace
rcutorture: Make TREE04 use CONFIG_RCU_DYNTICKS_TORTURE
rcutorture: Remove redundant rcutorture_one_extend() from rcu_torture_one_read()
rcutorture: Permit kvm-again.sh to re-use the build directory
torture: Add kvm-series.sh to test commit/scenario combination
rcu: use WRITE_ONCE() for ->next and ->pprev of hlist_nulls
locktorture: Fix memory leak in param_set_cpumask()
doc: Update for SRCU-fast definitions and initialization
...
Pull slab updates from Vlastimil Babka:
- mempool_alloc_bulk() support for upcoming users in the block layer
that need to allocate multiple objects at once with the mempool's
guaranteed progress semantics, which is not achievable with an
allocation single objects in a loop. Along with refactoring and
various improvements (Christoph Hellwig)
- Preparations for the upcoming separation of struct slab from struct
page, mostly by removing the struct folio layer, as the purpose of
struct folio has shifted since it became used in slab code (Matthew
Wilcox)
- Modernisation of slab's boot param API usage, which removes some
unexpected parsing corner cases (Petr Tesarik)
- Refactoring of freelist_aba_t (now struct freelist_counters) and
associated functions for double cmpxchg, enabled by -fms-extensions
(Vlastimil Babka)
- Cleanups and improvements related to sheaves caching layer, that were
part of the full conversion to sheaves, which is planned for the next
release (Vlastimil Babka)
* tag 'slab-for-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab: (42 commits)
slab: Remove unnecessary call to compound_head() in alloc_from_pcs()
mempool: clarify behavior of mempool_alloc_preallocated()
mempool: drop the file name in the top of file comment
mempool: de-typedef
mempool: remove mempool_{init,create}_kvmalloc_pool
mempool: legitimize the io_schedule_timeout in mempool_alloc_from_pool
mempool: add mempool_{alloc,free}_bulk
mempool: factor out a mempool_alloc_from_pool helper
slab: Remove references to folios from virt_to_slab()
kasan: Remove references to folio in __kasan_mempool_poison_object()
memcg: Convert mem_cgroup_from_obj_folio() to mem_cgroup_from_obj_slab()
mempool: factor out a mempool_adjust_gfp helper
mempool: add error injection support
mempool: improve kerneldoc comments
mm: improve kerneldoc comments for __alloc_pages_bulk
fault-inject: make enum fault_flags available unconditionally
usercopy: Remove folio references from check_heap_object()
slab: Remove folio references from kfree_nolock()
slab: Remove folio references from kfree_rcu_sheaf()
slab: Remove folio references from build_detached_freelist()
...
Pull documentation updates from Jonathan Corbet:
"This has been another busy cycle for documentation, with a lot of
build-system thrashing. That work should slow down from here on out.
- The various scripts and tools for documentation were spread out in
several directories; now they are (almost) all coalesced under
tools/docs/. The holdout is the kernel-doc script, which cannot be
easily moved without some further thought.
- As the amount of Python code increases, we are accumulating modules
that are imported by multiple programs. These modules have been
pulled together under tools/lib/python/ -- at least, for
documentation-related programs. There is other Python code in the
tree that might eventually want to move toward this organization.
- The Perl kernel-doc.pl script has been removed. It is no longer
used by default, and nobody has missed it, least of all anybody who
actually had to look at it.
- The docs build was controlled by a complex mess of makefilese that
few dared to touch. Mauro has moved that logic into a new program
(tools/docs/sphinx-build-wrapper) that, with any luck at all, will
be far easier to understand and maintain.
- The get_feat.pl program, used to access information under
Documentation/features/, has been rewritten in Python, bringing an
end to the use of Perl in the docs subsystem.
- The top-level README file has been reorganized into a more
reader-friendly presentation.
- A lot of Chinese translation additions
- Typo fixes and documentation updates as usual"
* tag 'docs-6.19' of git://git.lwn.net/linux: (164 commits)
docs: makefile: move rustdoc check to the build wrapper
README: restructure with role-based documentation and guidelines
docs: kdoc: various fixes for grammar, spelling, punctuation
docs: kdoc_parser: use '@' for Excess enum value
docs: submitting-patches: Clarify that removal of Acks needs explanation too
docs: kdoc_parser: add data/function attributes to ignore
docs: MAINTAINERS: update Mauro's files/paths
docs/zh_CN: Add wd719x.rst translation
docs/zh_CN: Add libsas.rst translation
get_feat.pl: remove it, as it got replaced by get_feat.py
Documentation/sphinx/kernel_feat.py: use class directly
tools/docs/get_feat.py: convert get_feat.pl to Python
Documentation/admin-guide: fix typo and comment in cscope example
docs/zh_CN: Add data-integrity.rst translation
docs/zh_CN: Add blk-mq.rst translation
docs/zh_CN: Add block/index.rst translation
docs/zh_CN: Update the Chinese translation of kbuild.rst
docs: bring some order to our Python module hierarchy
docs: Move the python libraries to tools/lib/python
Documentation/kernel-parameters: Move the kernel build options
...
Pull crypto updates from Herbert Xu:
"API:
- Rewrite memcpy_sglist from scratch
- Add on-stack AEAD request allocation
- Fix partial block processing in ahash
Algorithms:
- Remove ansi_cprng
- Remove tcrypt tests for poly1305
- Fix EINPROGRESS processing in authenc
- Fix double-free in zstd
Drivers:
- Use drbg ctr helper when reseeding xilinx-trng
- Add support for PCI device 0x115A to ccp
- Add support of paes in caam
- Add support for aes-xts in dthev2
Others:
- Use likely in rhashtable lookup
- Fix lockdep false-positive in padata by removing a helper"
* tag 'v6.19-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (71 commits)
crypto: zstd - fix double-free in per-CPU stream cleanup
crypto: ahash - Zero positive err value in ahash_update_finish
crypto: ahash - Fix crypto_ahash_import with partial block data
crypto: lib/mpi - use min() instead of min_t()
crypto: ccp - use min() instead of min_t()
hwrng: core - use min3() instead of nested min_t()
crypto: aesni - ctr_crypt() use min() instead of min_t()
crypto: drbg - Delete unused ctx from struct sdesc
crypto: testmgr - Add missing DES weak and semi-weak key tests
Revert "crypto: scatterwalk - Move skcipher walk and use it for memcpy_sglist"
crypto: scatterwalk - Fix memcpy_sglist() to always succeed
crypto: iaa - Request to add Kanchana P Sridhar to Maintainers.
crypto: tcrypt - Remove unused poly1305 support
crypto: ansi_cprng - Remove unused ansi_cprng algorithm
crypto: asymmetric_keys - fix uninitialized pointers with free attribute
KEYS: Avoid -Wflex-array-member-not-at-end warning
crypto: ccree - Correctly handle return of sg_nents_for_len
crypto: starfive - Correctly handle return of sg_nents_for_len
crypto: iaa - Fix incorrect return value in save_iaa_wq()
crypto: zstd - Remove unnecessary size_t cast
...
Pull IPE udates from Fan Wu:
"The primary change is the addition of support for the AT_EXECVE_CHECK
flag. This allows interpreters to signal the kernel to perform IPE
security checks on script files before execution, extending IPE
enforcement to indirectly executed scripts.
Update documentation for it, and also fix a comment"
* tag 'ipe-pr-20251202' of git://git.kernel.org/pub/scm/linux/kernel/git/wufan/ipe:
ipe: Update documentation for script enforcement
ipe: Add AT_EXECVE_CHECK support for script enforcement
ipe: Drop a duplicated CONFIG_ prefix in the ifdeffery
Pull integrity updates from Mimi Zohar:
"Bug fixes:
- defer credentials checking from the bprm_check_security hook to the
bprm_creds_from_file security hook
- properly ignore IMA policy rules based on undefined SELinux labels
IMA policy rule extensions:
- extend IMA to limit including file hashes in the audit logs
(dont_audit action)
- define a new filesystem subtype policy option (fs_subtype)
Misc:
- extend IMA to support in-kernel module decompression by deferring
the IMA signature verification in kernel_read_file() to after the
kernel module is decompressed"
* tag 'integrity-v6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity:
ima: Handle error code returned by ima_filter_rule_match()
ima: Access decompressed kernel module to verify appended signature
ima: add fs_subtype condition for distinguishing FUSE instances
ima: add dont_audit action to suppress audit actions
ima: Attach CREDS_CHECK IMA hook to bprm_creds_from_file LSM hook
Pull smack updates from Casey Schaufler:
- fix several cases where labels were treated inconsistently when
imported from user space
- clean up the assignment of extended attributes
- documentation improvements
* tag 'Smack-for-6.19' of https://github.com/cschaufler/smack-next:
Smack: function parameter 'gfp' not described
smack: fix kernel-doc warnings for smk_import_valid_label()
smack: fix bug: setting task label silently ignores input garbage
smack: fix bug: unprivileged task can create labels
smack: fix bug: invalid label of unix socket file
smack: always "instantiate" inode in smack_inode_init_security()
smack: deduplicate xattr setting in smack_inode_init_security()
smack: fix bug: SMACK64TRANSMUTE set on non-directory
smack: deduplicate "does access rule request transmutation"
Pull audit updates from Paul Moore:
- Consolidate the loops in __audit_inode_child() to improve performance
When logging a child inode in __audit_inode_child(), we first run
through the list of recorded inodes looking for the parent and then
we repeat the search looking for a matching child entry. This pull
request consolidates both searches into one pass through the recorded
inodes, resuling in approximately a 50% reduction in audit overhead.
See the commit description for the testing details.
- Combine kmalloc()/memset() into kzalloc() in audit_krule_to_data()
- Comment fixes
* tag 'audit-pr-20251201' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
audit: merge loops in __audit_inode_child()
audit: Use kzalloc() instead of kmalloc()/memset() in audit_krule_to_data()
audit: fix comment misindentation in audit.h
Pull selinux updates from Paul Moore:
- Improve the granularity of SELinux labeling for memfd files
Currently when creating a memfd file, SELinux treats it the same as
any other tmpfs, or hugetlbfs, file. While simple, the drawback is
that it is not possible to differentiate between memfd and tmpfs
files.
This adds a call to the security_inode_init_security_anon() LSM hook
and wires up SELinux to provide a set of memfd specific access
controls, including the ability to control the execution of memfds.
As usual, the commit message has more information.
- Improve the SELinux AVC lookup performance
Adopt MurmurHash3 for the SELinux AVC hash function instead of the
custom hash function currently used. MurmurHash3 is already used for
the SELinux access vector table so the impact to the code is minimal,
and performance tests have shown improvements in both hash
distribution and latency.
See the commit message for the performance measurments.
- Introduce a Kconfig option for the SELinux AVC bucket/slot size
While we have the ability to grow the number of AVC hash buckets
today, the size of the buckets (slot size) is fixed at 512. This pull
request makes that slot size configurable at build time through a new
Kconfig knob, CONFIG_SECURITY_SELINUX_AVC_HASH_BITS.
* tag 'selinux-pr-20251201' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux:
selinux: improve bucket distribution uniformity of avc_hash()
selinux: Move avtab_hash() to a shared location for future reuse
selinux: Introduce a new config to make avc cache slot size adjustable
memfd,selinux: call security_inode_init_security_anon()
Pull LSM updates from Paul Moore:
- Rework the LSM initialization code
What started as a "quick" patch to enable a notification event once
all of the individual LSMs were initialized, snowballed a bit into a
30+ patch patchset when everything was done. Most of the patches, and
diffstat, is due to splitting out the initialization code into
security/lsm_init.c and cleaning up some of the mess that was there.
While not strictly necessary, it does cleanup the code signficantly,
and hopefully makes the upkeep a bit easier in the future.
Aside from the new LSM_STARTED_ALL notification, these changes also
ensure that individual LSM initcalls are only called when the LSM is
enabled at boot time. There should be a minor reduction in boot times
for those who build multiple LSMs into their kernels, but only enable
a subset at boot.
It is worth mentioning that nothing at present makes use of the
LSM_STARTED_ALL notification, but there is work in progress which is
dependent upon LSM_STARTED_ALL.
- Make better use of the seq_put*() helpers in device_cgroup
* tag 'lsm-pr-20251201' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm: (36 commits)
lsm: use unrcu_pointer() for current->cred in security_init()
device_cgroup: Refactor devcgroup_seq_show to use seq_put* helpers
lsm: add a LSM_STARTED_ALL notification event
lsm: consolidate all of the LSM framework initcalls
selinux: move initcalls to the LSM framework
ima,evm: move initcalls to the LSM framework
lockdown: move initcalls to the LSM framework
apparmor: move initcalls to the LSM framework
safesetid: move initcalls to the LSM framework
tomoyo: move initcalls to the LSM framework
smack: move initcalls to the LSM framework
ipe: move initcalls to the LSM framework
loadpin: move initcalls to the LSM framework
lsm: introduce an initcall mechanism into the LSM framework
lsm: group lsm_order_parse() with the other lsm_order_*() functions
lsm: output available LSMs when debugging
lsm: cleanup the debug and console output in lsm_init.c
lsm: add/tweak function header comment blocks in lsm_init.c
lsm: fold lsm_init_ordered() into security_init()
lsm: cleanup initialize_lsm() and rename to lsm_init_single()
...
Pull trusted key updates from Jarkko Sakkinen:
- Remove duplicate 'tpm2_hash_map' in favor of 'tpm2_find_hash_alg()'
- Fix a memory leak on failure paths of 'tpm2_load_cmd'
* tag 'keys-trusted-next-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jarkko/linux-tpmdd:
KEYS: trusted: Fix a memory leak in tpm2_load_cmd
KEYS: trusted: Replace a redundant instance of tpm2_hash_map
Pull keys update from Jarkko Sakkinen:
"This contains only three fixes"
* tag 'keys-next-6.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jarkko/linux-tpmdd:
keys: Fix grammar and formatting in 'struct key_type' comments
keys: Replace deprecated strncpy in ecryptfs_fill_auth_tok
keys: Remove redundant less-than-zero checks
Pull nolibc updates from Thomas Weißschuh:
- Preparations to the use of nolibc in UML:
- Cleanup of sparse warnings
- Library mode without _start()
- More consistency when disabling errno
- Unconditional installation of all architecture support files
- Always 64-bit wide ino_t and off_t
- Various cleanups and bug fixes
* tag 'nolibc-20251130-for-6.19-1' of git://git.kernel.org/pub/scm/linux/kernel/git/nolibc/linux-nolibc: (25 commits)
selftests/nolibc: error out on linker warnings
selftests/nolibc: use lld to link loongarch binaries
tools/nolibc: remove more __nolibc_enosys() fallbacks
tools/nolibc: remove now superfluous overflow check in llseek
tools/nolibc: use 64-bit off_t
tools/nolibc: prefer the llseek syscall
tools/nolibc: handle 64-bit off_t for llseek
tools/nolibc: use 64-bit ino_t
tools/nolibc: avoid using plain integer as NULL pointer
tools/nolibc: add support for fchdir()
tools/nolibc: clean up outdated comments in generic arch.h
tools/nolibc: make the "headers" target install all supported archs
tools/nolibc: add the more portable inttypes.h
tools/nolibc: provide the portable sys/select.h
tools/nolibc: add missing memchr() to string.h
tools/nolibc: fix misleading help message regarding installation path
tools/nolibc: add uio.h with readv and writev
tools/nolibc: add option to disable runtime
tools/nolibc: use __fallthrough__ rather than fallthrough
tools/nolibc: implement %m if errno is not defined
...
Commit 1ee90870ce ("dt-bindings: thermal: tsens: Add QCS8300
compatible") uses a tab character which is illegal in YAML (at the
beginning of a line). The original patch was correct, so this got
corrupted when applied.
Fixes: 1ee90870ce ("dt-bindings: thermal: tsens: Add QCS8300 compatible")
Signed-off-by: Rob Herring (Arm) <robh@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch adds explanation of script enforcement mechanism in admin
guide documentation. Describes how IPE supports integrity enforcement
for indirectly executed scripts through the AT_EXECVE_CHECK flag, and
how this differs from kernel enforcement for compiled executables.
Signed-off-by: Yanzhu Huang <yanzhuhuang@linux.microsoft.com>
Signed-off-by: Fan Wu <wufan@kernel.org>
This patch adds a new ipe_bprm_creds_for_exec() hook that integrates
with the AT_EXECVE_CHECK mechanism. To enable script enforcement,
interpreters need to incorporate the AT_EXECVE_CHECK flag when
calling execveat() on script files before execution.
When a userspace interpreter calls execveat() with the AT_EXECVE_CHECK
flag, this hook triggers IPE policy evaluation on the script file. The
hook only triggers IPE when bprm->is_check is true, ensuring it's
being called from an AT_EXECVE_CHECK context. It then builds an
evaluation context for an IPE_OP_EXEC operation and invokes IPE policy.
The kernel returns the policy decision to the interpreter, which can
then decide whether to proceed with script execution.
This extends IPE enforcement to indirectly executed scripts, permitting
trusted scripts to execute while denying untrusted ones.
Signed-off-by: Yanzhu Huang <yanzhuhuang@linux.microsoft.com>
Signed-off-by: Fan Wu <wufan@kernel.org>
Looks like it got added by mistake, perhaps editor auto-completion
artifact. Drop it.
No functional changes.
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Fan Wu <wufan@kernel.org>
When loading the ebpf scheduler, the tasks in the scx_tasks list will
be traversed and invoke __setscheduler_class() to get new sched_class.
however, this would also incorrectly set the per-cpu migration
task's->sched_class to rt_sched_class, even after unload, the per-cpu
migration task's->sched_class remains sched_rt_class.
The log for this issue is as follows:
./scx_rustland --stats 1
[ 199.245639][ T630] sched_ext: "rustland" does not implement cgroup cpu.weight
[ 199.269213][ T630] sched_ext: BPF scheduler "rustland" enabled
04:25:09 [INFO] RustLand scheduler attached
bpftrace -e 'iter:task /strcontains(ctx->task->comm, "migration")/
{ printf("%s:%d->%pS\n", ctx->task->comm, ctx->task->pid, ctx->task->sched_class); }'
Attaching 1 probe...
migration/0:24->rt_sched_class+0x0/0xe0
migration/1:27->rt_sched_class+0x0/0xe0
migration/2:33->rt_sched_class+0x0/0xe0
migration/3:39->rt_sched_class+0x0/0xe0
migration/4:45->rt_sched_class+0x0/0xe0
migration/5:52->rt_sched_class+0x0/0xe0
migration/6:58->rt_sched_class+0x0/0xe0
migration/7:64->rt_sched_class+0x0/0xe0
sched_ext: BPF scheduler "rustland" disabled (unregistered from user space)
EXIT: unregistered from user space
04:25:21 [INFO] Unregister RustLand scheduler
bpftrace -e 'iter:task /strcontains(ctx->task->comm, "migration")/
{ printf("%s:%d->%pS\n", ctx->task->comm, ctx->task->pid, ctx->task->sched_class); }'
Attaching 1 probe...
migration/0:24->rt_sched_class+0x0/0xe0
migration/1:27->rt_sched_class+0x0/0xe0
migration/2:33->rt_sched_class+0x0/0xe0
migration/3:39->rt_sched_class+0x0/0xe0
migration/4:45->rt_sched_class+0x0/0xe0
migration/5:52->rt_sched_class+0x0/0xe0
migration/6:58->rt_sched_class+0x0/0xe0
migration/7:64->rt_sched_class+0x0/0xe0
This commit therefore generate a new scx_setscheduler_class() and
add check for stop_sched_class to replace __setscheduler_class().
Fixes: f0e1a0643a ("sched_ext: Implement BPF extensible scheduler class")
Cc: stable@vger.kernel.org # v6.12+
Signed-off-by: Zqiang <qiang.zhang@linux.dev>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The crypto/zstd module has a double-free bug that occurs when multiple
tfms are allocated and freed.
The issue happens because zstd_streams (per-CPU contexts) are freed in
zstd_exit() during every tfm destruction, rather than being managed at
the module level. When multiple tfms exist, each tfm exit attempts to
free the same shared per-CPU streams, resulting in a double-free.
This leads to a stack trace similar to:
BUG: Bad page state in process kworker/u16:1 pfn:106fd93
page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x106fd93
flags: 0x17ffffc0000000(node=0|zone=2|lastcpupid=0x1fffff)
page_type: 0xffffffff()
raw: 0017ffffc0000000 dead000000000100 dead000000000122 0000000000000000
raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
page dumped because: nonzero entire_mapcount
Modules linked in: ...
CPU: 3 UID: 0 PID: 2506 Comm: kworker/u16:1 Kdump: loaded Tainted: G B
Hardware name: ...
Workqueue: btrfs-delalloc btrfs_work_helper
Call Trace:
<TASK>
dump_stack_lvl+0x5d/0x80
bad_page+0x71/0xd0
free_unref_page_prepare+0x24e/0x490
free_unref_page+0x60/0x170
crypto_acomp_free_streams+0x5d/0xc0
crypto_acomp_exit_tfm+0x23/0x50
crypto_destroy_tfm+0x60/0xc0
...
Change the lifecycle management of zstd_streams to free the streams only
once during module cleanup.
Fixes: f5ad93ffb5 ("crypto: zstd - convert to acomp")
Cc: stable@vger.kernel.org
Signed-off-by: Giovanni Cabiddu <giovanni.cabiddu@intel.com>
Reviewed-by: Suman Kumar Chakraborty <suman.kumar.chakraborty@intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
- In order to prepare the layout for nohz_full work deferral to
user exit, the context tracking state must shrink the counter
of transitions to/from RCU not watching. The only possible hazard
is to trigger wrap-around more easily, delaying a bit grace periods
when that happens. This should be a rare event though. Yet add
debugging and torture code to test that assumption.
- Fix memory leak on locktorture module
- Annotate accesses in rculist_nulls.h to prevent from KCSAN warnings.
On recent discussions, we also concluded that all those WRITE_ONCE()
and READ_ONCE() on list APIs deserve appropriate comments. Something
to be expected for the next cycle.
- Provide a script to apply several configs to several commits with torture.
- Allow torture to reuse a build directory in order to save needless
rebuild time.
- Various cleanups.
'tpm2_load_cmd' allocates a tempoary blob indirectly via 'tpm2_key_decode'
but it is not freed in the failure paths. Address this by wrapping the blob
into with a cleanup helper.
Cc: stable@vger.kernel.org # v5.13+
Fixes: f221974525 ("security: keys: trusted: use ASN.1 TPM2 key format for the blobs")
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
'trusted_tpm2' duplicates 'tpm2_hash_map' originally part of the TPN
driver, which is suboptimal.
Implement and export `tpm2_find_hash_alg()` in the driver, and substitute
the redundant code in 'trusted_tpm2' with a call to the new function.
Reviewed-by: Jonathan McDowell <noodles@meta.com>
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
The makefile logic to detect if rust is enabled is not working
the way it was expected: instead of using the current setup
for CONFIG_RUST, it uses a cached version from a previous build.
The root cause is that the current logic inside docs/Makefile
uses a cached version of CONFIG_RUST, from the last time a non
documentation target was executed. That's perfectly fine for
Sphinx build, as it doesn't need to read or depend on any
CONFIG_*.
So, instead of relying at the cache, move the logic to the
wrapper script and let it check the current content of .config,
to verify if CONFIG_RUST was selected.
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Message-ID: <c06b1834ef02099735c13ee1109fa2a2b9e47795.1763722971.git.mchehab+huawei@kernel.org>
kdoc is looking for "@value" here, so use that kind of string in the
warning message. The "%value" can be confusing.
This changes:
Warning: drivers/net/wireless/mediatek/mt76/testmode.h:92 Excess enum value '%MT76_TM_ATTR_TX_PENDING' description in 'mt76_testmode_attr'
to this:
Warning: drivers/net/wireless/mediatek/mt76/testmode.h:92 Excess enum value '@MT76_TM_ATTR_TX_PENDING' description in 'mt76_testmode_attr'
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Reviewed-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Message-ID: <20251126061752.3497106-1-rdunlap@infradead.org>
The paragraph mentions only removal of Tested-by and Reviewed-by tags as
action needing mentioning in patch changelog, so some developers treat
it too literally. Acks, as a weaker form of review/approval, should
rarely be removed, but if that happens it should be explained as well.
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@oss.qualcomm.com>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Message-ID: <20251126081905.7684-2-krzysztof.kozlowski@oss.qualcomm.com>
Recognize and ignore __rcu (in struct members), __private (in struct
members), and __always_unused (in function parameters) to prevent
kernel-doc warnings:
Warning: include/linux/rethook.h:38 struct member 'void (__rcu *handler' not described in 'rethook'
Warning: include/linux/hrtimer_types.h:47 Invalid param: enum hrtimer_restart (*__private function)(struct hrtimer *)
Warning: security/ipe/hooks.c:81 function parameter '__always_unused' not described in 'ipe_mmap_file'
Warning: security/ipe/hooks.c:109 function parameter '__always_unused' not described in 'ipe_file_mprotect'
There are more of these (in compiler_types.h, compiler_attributes.h)
that can be added as needed.
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Message-ID: <20251127063117.150384-1-rdunlap@infradead.org>
Chinese translation docs for 6.19
This is the Chinese translation subtree for 6.19. It includes
the following changes:
- Add block part translations
- Update kbuild.rst translations
- Add more scsi translations and fixes
Above patches are tested by 'make htmldocs'
Signed-off-by: Alex Shi <alexs@kernel.org>
Add performance testing for common context synchronizations
(Preemption, IRQ, Softirq) and per-cpu increments. Those are
relevant comparisons against SRCU-fast read side APIs, especially
as they are planned to synchronize further tracing fast-path code.
_ Properly handle SRCU readers within IRQ disabled sections in tiny SRCU
- Preparation to reimplement RCU Tasks Trace on top of SRCU fast:
- Introduce API to expedite a grace period and test it through
rcutorture.
- Split srcu-fast in two flavours: SRCU-fast and SRCU-fast-updown.
Both are still targeted toward faster readers (without full
barriers on LOCK and UNLOCK) at the expense of heavier write side
(using full RCU grace period ordering instead of simply full
ordering) as compared to "traditional" non-fast SRCU. But those
srcu-fast flavours are going to be optimized in two different
ways:
- SRCU-fast will become the reimplementation basis for
RCU-TASK-TRACE for consolidation. Since RCU-TASK-TRACE must
be NMI safe, SRCU-fast must be as well.
- SRCU-fast-updown will be needed for uretprobes code in order
to get rid of the read-side memory barriers while still
allowing entering the reader at task level while exiting it
in a timer handler. It is considered semaphore-like in that
it can have different owners between LOCK and UNLOCK.
However it is not NMI-safe.
The actual optimizations are work in progress for the next cycle.
Only the new interfaces are added for now, along with related
torture and scalability test code.
- Create/document/debug/torture new proper initializers for RCU fast:
DEFINE_SRCU_FAST() and init_srcu_struct_fast()
This allows for using right away the proper ordering on the write
side (either full ordering or full RCU grace period ordering) without
waiting for the read side to tell which to use. Also this optimizes
the read side altogether with moving flavour debug checks to debug
config and with removing a costly RmW operation on their first call.
- Make some diagnostic functions tracing safe.
s/it/if/ and s/revokation/revocation/, capitalize "clear", and add a
period after the sentence. Fix the comment formatting.
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
strncpy() is deprecated for NUL-terminated destination buffers; use
strscpy_pad() instead to retain the NUL-padding behavior of strncpy().
The destination buffer is initialized using kzalloc() with a 'signature'
size of ECRYPTFS_PASSWORD_SIG_SIZE + 1. strncpy() then copies up to
ECRYPTFS_PASSWORD_SIG_SIZE bytes from 'key_desc', NUL-padding any
remaining bytes if needed, but expects the last byte to be zero.
strscpy_pad() also copies the source string to 'signature', and NUL-pads
the destination buffer if needed, but ensures it's always NUL-terminated
without relying on it being zero-initialized.
strscpy_pad() automatically determines the size of the fixed-length
destination buffer via sizeof() when the optional size argument is
omitted, making an explicit size unnecessary.
In encrypted_init(), the source string 'key_desc' is validated by
valid_ecryptfs_desc() before calling ecryptfs_fill_auth_tok(), and is
therefore NUL-terminated and satisfies the __must_be_cstr() requirement
of strscpy_pad().
Link: https://github.com/KSPP/linux/issues/90
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Reviewed-by: Kees Cook <kees@kernel.org>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
The local variables 'size_t datalen' are unsigned and cannot be less
than zero. Remove the redundant conditions.
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Reviewed-by: Mimi Zohar <zohar@linux.ibm.com>
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
The macro for_each_console_srcu iterates over all registered consoles. It's
implied that all registered consoles have CON_ENABLED flag set, making
the check for the flag unnecessary. Call console_is_usable function to
fully verify if the given console is usable before calling the ->unblank
callback.
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://patch.msgid.link/20251121-printk-cleanup-part2-v2-3-57b8b78647f4@suse.com
Signed-off-by: Petr Mladek <pmladek@suse.com>
The original code tried to find a console that has CON_BOOT _or_
CON_ENABLED flag set. The flag CON_ENABLED is set to all registered
consoles, so in this case this check is always true, even for the
CON_BOOT consoles.
The initial intent of the kgdboc_earlycon_init was to get a console
early (CON_BOOT) or later on in the process (CON_ENABLED). The
code was using for_each_console macro, meaning that all console structs
were previously registered on the printk() machinery. At this point,
any console found on for_each_console is safe for kgdboc_earlycon_init
to use.
Dropping the check makes the code cleaner, and avoids further confusion
by future readers of the code.
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://patch.msgid.link/20251121-printk-cleanup-part2-v2-1-57b8b78647f4@suse.com
Signed-off-by: Petr Mladek <pmladek@suse.com>
This commit causes rcutorture's srcu_torture_init() and
srcud_torture_init() functions to announce on the console log
which variant of SRCU is being tortured, for example: "torture:
srcud_torture_init fast SRCU".
[ paulmck: Apply feedback from kernel test robot. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
This commit creates an SRCU-fast-updown API, including
DEFINE_SRCU_FAST_UPDOWN(), DEFINE_STATIC_SRCU_FAST_UPDOWN(),
__init_srcu_struct_fast_updown(), init_srcu_struct_fast_updown(),
srcu_read_lock_fast_updown(), srcu_read_unlock_fast_updown(),
__srcu_read_lock_fast_updown(), and __srcu_read_unlock_fast_updown().
These are initially identical to their SRCU-fast counterparts, but both
SRCU-fast and SRCU-fast-updown will be optimized in different directions
by later commits. SRCU-fast will lack any sort of srcu_down_read() and
srcu_up_read() APIs, which will enable extremely efficient NMI safety.
For its part, SRCU-fast-updown will not be NMI safe, which will enable
reasonably efficient implementations of srcu_down_read_fast() and
srcu_up_read_fast().
This API fork happens to meet two different future use cases.
* SRCU-fast will become the reimplementation basis for RCU-TASK-TRACE
for consolidation. Since RCU-TASK-TRACE must be NMI safe, SRCU-fast
must be as well.
* SRCU-fast-updown will be needed for uretprobes code in order to get
rid of the read-side memory barriers while still allowing entering the
reader at task level while exiting it in a timer handler.
This commit also adds rcutorture tests for the new APIs. This
(annoyingly) needs to be in the same commit for bisectability. With this
commit, the 0x8 value tests SRCU-fast-updown. However, most SRCU-fast
testing will be via the RCU Tasks Trace wrappers.
[ paulmck: Apply s/0x8/0x4/ missing change per Boqun Feng feedback. ]
[ paulmck: Apply Akira Yokosawa feedback. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: <bpf@vger.kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Translate .../scsi/wd719x.rst into Chinese.
Add wd719x into .../scsi/index.rst.
Update the translation through commit 40ee63091a
("scsi: docs: convert wd719x.txt to ReST")
Signed-off-by: Yujie Zhang <yjzhang@leap-io-kernel.com>
Signed-off-by: Alex Shi <alexs@kernel.org>
Translate .../scsi/libsas.rst into Chinese.
Add libsas into .../scsi/index.rst.
Update the translation through commit 25882c82f8
("scsi: libsas: Delete lldd_clear_aca callback")
Signed-off-by: Yujie Zhang <yjzhang@leap-io-kernel.com>
Signed-off-by: Alex Shi <alexs@kernel.org>
Merge series "slab: cmpxchg cleanups enabled by -fms-extensions"
From the cover letter [1]:
After learning about -fms-extensions being enabled for 6.19, I realized
there is some cleanup potential in slub code by extending the definition
and usage of freelist_aba_t, as it can now become an unnamed member of
struct slab. This series performs the cleanup, with no functional
changes intended. Additionally we turn freelist_aba_t to struct
freelist_counters as it doesn't meet any criteria for being a typedef,
per Documentation/process/coding-style.rst
Based on the tag kbuild-ms-extensions-6.19 from
git://git.kernel.org/pub/scm/linux/kernel/git/kbuild/linuxV
Link: https://lore.kernel.org/all/20251107-slab-fms-cleanup-v1-0-650b1491ac9e@suse.cz/#t [1]
Merge series "Prepare slab for memdescs" by Matthew Wilcox.
From the cover letter [1]:
When we separate struct folio, struct page and struct slab from each
other, converting to folios then to slabs will be nonsense. It made
sense under the 'folio is just a head page' interpretation, but with
full separation, page_folio() will return NULL for a page which belongs
to a slab.
This patch series removes almost all mentions of folio from slab.
There are a few folio_test_slab() invocations left around the tree that
I haven't decided how to handle yet. We're not yet quite at the point
of separately allocating struct slab, but that's what I'll be working
on next.
Link: https://lore.kernel.org/all/20251113000932.1589073-1-willy@infradead.org/ [1]
This patch series introduces support for `syn` (and its dependencies):
Syn is a parsing library for parsing a stream of Rust tokens into a
syntax tree of Rust source code.
Currently this library is geared toward use in Rust procedural
macros, but contains some APIs that may be useful more generally.
It is the most downloaded Rust crate (according to crates.io), and it
is also used by the Rust compiler itself. Having such support allows to
greatly simplify writing complex macros such as `pin-init`. We will use
it in the `macros` crate too.
Benno has already prepared the `pin-init` version based on this, and on
top of that, we will be able to simplify the `macros` crate too. I think
Jesung is working on updating the `TryFrom` and `Into` upcoming derive
macros to use `syn` too.
The series starts with a few preparation commits (two fixes were already
merged in mainline that were discovered by this series), then each crate
is added. Finally, support for using the new crates from our `macros`
crate is introduced.
This has been a long time coming, e.g. even before Rust for Linux was
merged into the Linux kernel, Gary and Benno have wanted to use `syn`.
The first iterations of this, from 2022 and 2023 (with `serde` too,
another popular crate), are at:
https://github.com/Rust-for-Linux/linux/pull/910https://github.com/Rust-for-Linux/linux/pull/1007
After those, we considered picking these from the distributions where
possible. However, after discussing it, it is not really worth the
complexity: vendoring makes things less complex and is less fragile.
In particular, we avoid having to support and test several versions,
we avoid having to introduce Cargo just to properly fetch the right
versions from the registry, we can easily customize the crates if needed
(e.g. dropping the `unicode_idents` dependency like it is done in this
series) and we simplify the configuration of the build for users for
which the "default" paths/registries would not have worked.
Moreover, nowadays, the ~57k lines introduced are not that much compared
to years ago (it dwarfed the actual Rust kernel code). Moreover, back
then it wasn't clear the Rust experiment would be a success, so it would
have been a bit pointless/risky to add many lines for nothing. Our macro
needs were also smaller in the early days.
So, finally, in Kangrejos 2025 we discussed going with the original,
simpler approach. Thus here it is the result.
There should not be many updates needed for these, and even if there
are, they should not be too big, e.g. +7k -3k lines across the 3 crates
in the last year.
Note that `syn` does not have all the features enabled, since we do not
need them so far, but they can easily be enabled just adding them to the
list.
Link: https://patch.msgid.link/20251124151837.2184382-1-ojeda@kernel.org
Signed-off-by: Miguel Ojeda <ojeda@kernel.org>
Originally, when the Rust upstream `alloc` standard library crate was
vendored in commit 057b8d2571 ("rust: adapt `alloc` crate to the
kernel"), the SPDX License Identifiers were added to every file so that
the license on those was clear.
Thus do the same for the `proc-macro2` crate.
This makes `scripts/spdxcheck.py` pass.
Reviewed-by: Gary Guo <gary@garyguo.net>
Tested-by: Gary Guo <gary@garyguo.net>
Tested-by: Jesung Yang <y.j3ms.n@gmail.com>
Link: https://patch.msgid.link/20251124151837.2184382-8-ojeda@kernel.org
Signed-off-by: Miguel Ojeda <ojeda@kernel.org>
We need to handle `cfg`s in both `rustc` and `rust-analyzer`, and in
future commits some of those contain double quotes, which complicates
things further.
Thus, instead of removing the `--cfg ` part in the rust-analyzer
generation script, have the `*-cfgs` variables contain just the actual
`cfg`, and use that to generate the actual flags in `*-flags`.
Reviewed-by: Alice Ryhl <aliceryhl@google.com>
Reviewed-by: Gary Guo <gary@garyguo.net>
Tested-by: Gary Guo <gary@garyguo.net>
Tested-by: Jesung Yang <y.j3ms.n@gmail.com>
Link: https://patch.msgid.link/20251124151837.2184382-3-ojeda@kernel.org
Signed-off-by: Miguel Ojeda <ojeda@kernel.org>
In the next commits we are introducing `*-{cfgs,skip_flags,flags}`
variables for other crates.
Thus do so here for `core`, which simplifies a bit the `Makefile`
(including the next commit) and makes it more consistent.
This means we stop passing `-Wrustdoc::unescaped_backticks` to `rustc`
and `-Wunreachable_pub` to `rustdoc`, i.e. we skip more, which is fine
since it shouldn't have an effect.
In addition, use `:=` for `core-cfgs` to make it consistent with the
upcoming additions.
Reviewed-by: Alice Ryhl <aliceryhl@google.com>
Reviewed-by: Gary Guo <gary@garyguo.net>
Tested-by: Gary Guo <gary@garyguo.net>
Tested-by: Jesung Yang <y.j3ms.n@gmail.com>
Link: https://patch.msgid.link/20251124151837.2184382-2-ojeda@kernel.org
Signed-off-by: Miguel Ojeda <ojeda@kernel.org>
I've long since stopped having the time to contribute code or reviews,
this acknowledges that.
Geoffrey Thomas and I created the "linux-kernel-module-rust" project at
PyCon in 2018, as an experiment to see if we could make it possible to
write kernel modules in Rust. The Rust for Linux effort has far exceeded
anything we could have expected at the time.
I want to thank all the Rust for Linux contributors, past and present,
who have helped make this a reality -- and in particularly Miguel, who
really transformed this project from an interesting demo to something
that could really land in mainline.
Signed-off-by: Alex Gaynor <alex.gaynor@gmail.com>
Link: https://patch.msgid.link/CAFRnB2U1uvg1vyZe1kDi7L3P4kTFowfOo6Hfo9WJED4qve4ZZw@mail.gmail.com
[ Reflowed. - Miguel ]
Signed-off-by: Miguel Ojeda <ojeda@kernel.org>
Currently when the length of a symbol is longer than 0x7f characters,
its type shown in /proc/kallsyms can be incorrect.
I found this issue when reading the code, but it can be reproduced by
following steps:
1. Define a function which symbol length is 130 characters:
#define X13(x) x##x##x##x##x##x##x##x##x##x##x##x##x
static noinline void X13(x123456789)(void)
{
printk("hello world\n");
}
2. The type in vmlinux is 't':
$ nm vmlinux | grep x123456
ffffffff816290f0 t x123456789x123456789x123456789x12[...]
3. Then boot the kernel, the type shown in /proc/kallsyms becomes 'g'
instead of the expected 't':
# cat /proc/kallsyms | grep x123456
ffffffff816290f0 g x123456789x123456789x123456789x12[...]
The root cause is that, after commit 73bbb94466 ("kallsyms: support
"big" kernel symbols"), ULEB128 was used to encode symbol name length.
That is, for "big" kernel symbols of which name length is longer than
0x7f characters, the length info is encoded into 2 bytes.
kallsyms_get_symbol_type() expects to read the first char of the
symbol name which indicates the symbol type. However, due to the
"big" symbol case not being handled, the symbol type read from
/proc/kallsyms may be wrong, so handle it properly.
Cc: stable@vger.kernel.org
Fixes: 73bbb94466 ("kallsyms: support "big" kernel symbols")
Signed-off-by: Zheng Yejian <zhengyejian@huaweicloud.com>
Acked-by: Gary Guo <gary@garyguo.net>
Link: https://patch.msgid.link/20241011143853.3022643-1-zhengyejian@huaweicloud.com
Signed-off-by: Miguel Ojeda <ojeda@kernel.org>
With commit ("printk: Avoid scheduling irq_work on suspend") the
implementation of printk_get_console_flush_type() was modified to
avoid offloading when irq_work should be blocked during suspend.
Since printk uses the returned flush type to determine what
flushing methods are used, this was thought to be sufficient for
avoiding irq_work usage during the suspend phase.
However, vprintk_emit() implements a hack to support
printk_deferred(). In this hack, the returned flush type is
adjusted to make sure no legacy direct printing occurs when
printk_deferred() was used.
Because of this hack, the legacy offloading flushing method can
still be used, causing irq_work to be queued when it should not
be.
Adjust the vprintk_emit() hack to also consider
@console_irqwork_blocked so that legacy offloading will not be
chosen when irq_work should be blocked.
Link: https://lore.kernel.org/lkml/87fra90xv4.fsf@jogness.linutronix.de
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Fixes: 26873e3e7f ("printk: Avoid scheduling irq_work on suspend")
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
The partial block length returned by a block-only driver should
not be passed up to the caller since ahash itself deals with the
partial block data.
Set err to zero in ahash_update_finish if it was positive.
Reported-by: T Pratham <t-pratham@ti.com>
Tested-by: T Pratham <t-pratham@ti.com>
Fixes: 9d7a0ab1c7 ("crypto: ahash - Handle partial blocks in API")
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Restore the partial block buffer in crypto_ahash_import by copying
it. Check whether the partial block buffer exceeds the maximum
size and return -EOVERFLOW if it does.
Zero the partial block buffer in crypto_ahash_import_core.
Reported-by: T Pratham <t-pratham@ti.com>
Tested-by: T Pratham <t-pratham@ti.com>
Fixes: 9d7a0ab1c7 ("crypto: ahash - Handle partial blocks in API")
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
min_t(unsigned int, a, b) casts an 'unsigned long' to 'unsigned int'.
Use min(a, b) instead as it promotes any 'unsigned int' to 'unsigned long'
and so cannot discard significant bits.
In this case the 'unsigned long' value is small enough that the result
is ok.
Detected by an extra check added to min_t().
Signed-off-by: David Laight <david.laight.linux@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
min_t(unsigned int, a, b) casts an 'unsigned long' to 'unsigned int'.
Use min(a, b) instead as it promotes any 'unsigned int' to 'unsigned long'
and so cannot discard significant bits.
In this case the 'unsigned long' value is small enough that the result
is ok.
Detected by an extra check added to min_t().
Signed-off-by: David Laight <david.laight.linux@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
min_t(unsigned int, a, b) casts an 'unsigned long' to 'unsigned int'.
Use min(a, b) instead as it promotes any 'unsigned int' to 'unsigned long'
and so cannot discard significant bits.
In this case the 'unsigned long' value is small enough that the result
is ok.
Detected by an extra check added to min_t().
Signed-off-by: David Laight <david.laight.linux@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
The ctx array in struct sdesc is never used. Delete it as it's
bogus since the previous member ends with a flexible array.
Reported-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Ever since commit da7f033ddc ("crypto: cryptomgr - Add test
infrastructure"), the DES test suite has tested only one of the four
weak keys and none of the twelve semi-weak keys.
DES has four weak keys and twelve semi-weak keys, and the kernel's DES
implementation correctly detects and rejects all of these keys when the
CRYPTO_TFM_REQ_FORBID_WEAK_KEYS flag is set. However, only a single weak
key was being tested. Add tests for all 16 weak and semi-weak keys.
While DES is deprecated, it is still used in some legacy protocols, and
weak/semi-weak key detection should be tested accordingly.
Tested on arm64 with cryptographic self-tests.
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
`from_expr` relies on `build_assert` to infer that the passed expression
fits the type's boundaries at build time. That inference can only be
successful its code (and that of `fits_within`, which performs the
check) is inlined, as a dedicated function would need to work with a
variable and cannot verify that property.
While inlining happens as expected in most cases, it is not guaranteed.
In particular, kernel options that optimize for size like
`CONFIG_CC_OPTIMIZE_FOR_SIZE` can result in `from_expr` not being
inlined.
Add `#[inline(always)]` attributes to both `fits_within` and `from_expr`
to make the compiler inline these functions more aggressively, as it
does not make sense to use them non-inlined anyway.
[ For reference, the errors look like:
ld.lld: error: undefined symbol: rust_build_error
>>> referenced by build_assert.rs:83 (rust/kernel/build_assert.rs:83)
>>> rust/doctests_kernel_generated.o:(<kernel::num::bounded::Bounded<u8, 1>>::from_expr) in archive vmlinux.a
- Miguel ]
Fixes: 01e345e82e ("rust: num: add Bounded integer wrapping type")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202511210055.RUsFNku1-lkp@intel.com/
Suggested-by: Gary Guo <gary@garyguo.net>
Signed-off-by: Alexandre Courbot <acourbot@nvidia.com>
Link: https://patch.msgid.link/20251122-bounded_ints_fix-v1-1-1e07589d4955@nvidia.com
Signed-off-by: Miguel Ojeda <ojeda@kernel.org>
Add a version of the mempool allocator that works for batch allocations
of multiple objects. Calling mempool_alloc in a loop is not safe because
it could deadlock if multiple threads are performing such an allocation
at the same time.
As an extra benefit the interface is build so that the same array can be
used for alloc_pages_bulk / release_pages so that at least for page
backed mempools the fast path can use a nice batch optimization.
Note that mempool_alloc_bulk does not take a gfp_mask argument as it
must always be able to sleep and doesn't support any non-trivial
modifiers. NOFO or NOIO constrainst must be set through the scoped API.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251113084022.1255121-8-hch@lst.de
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Add a helper for the mempool_alloc slowpath to better separate it from the
fast path, and also use it to implement mempool_alloc_preallocated which
shares the same logic.
[hughd@google.com: fix lack of retrying with __GFP_DIRECT_RECLAIM]
[vbabka@suse.cz: really use limited flags for first mempool attempt]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251113084022.1255121-7-hch@lst.de
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
If the linker emits warnings these should abort the build.
Otherwise they will be swallowed by run-tests.sh and not shown.
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Acked-by: Willy Tarreau <w@1wt.eu>
LLVM 21 switched to -mcmodel=medium for LoongArch64 compilations.
This code model uses R_LARCH_ECALL36 relocations which might not be
supported by GNU ld which to nolibc testsuite uses by default.
ld will not resolve the relocation and all function calls will end up
as busy loops.
Use lld instead.
We can not switch to lld for all LLVM builds, as it does not support all
necessary architectures.
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Acked-by: Willy Tarreau <w@1wt.eu>
This reverts commit 0f8d42bf12, with the
memcpy_sglist() part dropped.
Now that memcpy_sglist() no longer uses the skcipher_walk code, the
skcipher_walk code can be moved back to where it belongs.
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
The original implementation of memcpy_sglist() was broken because it
didn't handle scatterlists that describe exactly the same memory, which
is a case that many callers rely on. The current implementation is
broken too because it calls the skcipher_walk functions which can fail.
It ignores any errors from those functions.
Fix it by replacing it with a new implementation written from scratch.
It always succeeds. It's also a bit faster, since it avoids the
overhead of skcipher_walk. skcipher_walk includes a lot of
functionality (such as alignmask handling) that's irrelevant here.
Reported-by: Colin Ian King <coking@nvidia.com>
Closes: https://lore.kernel.org/r/20251114122620.111623-1-coking@nvidia.com
Fixes: 131bdceca1 ("crypto: scatterwalk - Add memcpy_sglist")
Fixes: 0f8d42bf12 ("crypto: scatterwalk - Move skcipher walk and use it for memcpy_sglist")
Cc: stable@vger.kernel.org
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Since the crypto_shash support for poly1305 was removed, the tcrypt
support for it is now unused as well. Support for benchmarking the
kernel's Poly1305 code is now provided by the poly1305 kunit test.
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Remove ansi_cprng, since it's obsolete and unused, as confirmed at
https://lore.kernel.org/r/aQxpnckYMgAAOLpZ@gondor.apana.org.au/
This was originally added in 2008, apparently as a FIPS approved random
number generator. Whether this has ever belonged upstream is
questionable. Either way, ansi_cprng is no longer usable for this
purpose, since it's been superseded by the more modern algorithms in
crypto/drbg.c, and FIPS itself no longer allows it. (NIST SP 800-131A
Rev 1 (2015) says that RNGs based on ANSI X9.31 will be disallowed after
2015. NIST SP 800-131A Rev 2 (2019) confirms they are now disallowed.)
Therefore, there is no reason to keep it around.
Suggested-by: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Haotian Zhang <vulab@iscas.ac.cn>
Cc: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Uninitialized pointers with `__free` attribute can cause undefined
behavior as the memory assigned randomly to the pointer is freed
automatically when the pointer goes out of scope.
crypto/asymmetric_keys doesn't have any bugs related to this as of now,
but, it is better to initialize and assign pointers with `__free`
attribute in one statement to ensure proper scope-based cleanup
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/all/aPiG_F5EBQUjZqsl@stanley.mountain/
Signed-off-by: Ally Heev <allyheev@gmail.com>
Reviewed-by: Ignat Korchagin <ignat@cloudflare.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
-Wflex-array-member-not-at-end was introduced in GCC-14, and we are
getting ready to enable it, globally.
Use the new TRAILING_OVERLAP() helper to fix the following warning:
crypto/asymmetric_keys/restrict.c:20:34: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end]
This helper creates a union between a flexible-array member (FAM) and a
set of MEMBERS that would otherwise follow it.
This overlays the trailing MEMBER unsigned char data[10]; onto the FAM
struct asymmetric_key_id::data[], while keeping the FAM and the start
of MEMBER aligned.
The static_assert() ensures this alignment remains, and it's
intentionally placed inmediately after the corresponding structures --no
blank line in between.
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Reviewed-by: Ignat Korchagin <ignat@cloudflare.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Fix error handling in cc_map_hash_request_update where sg_nents_for_len
return value was assigned to u32, converting negative errors to large
positive values before passing to sg_copy_to_buffer.
Check sg_nents_for_len return value and propagate errors before
assigning to areq_ctx->in_nents.
Fixes: b7ec853068 ("crypto: ccree - use std api when possible")
Signed-off-by: Haotian Zhang <vulab@iscas.ac.cn>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
The return value of sg_nents_for_len was assigned to an unsigned long
in starfive_hash_digest, causing negative error codes to be converted
to large positive integers.
Add error checking for sg_nents_for_len and return immediately on
failure to prevent potential buffer overflows.
Fixes: 7883d1b28a ("crypto: starfive - Add hash and HMAC support")
Signed-off-by: Haotian Zhang <vulab@iscas.ac.cn>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
The commit1 def98c84b6 ("workqueue: Fix spurious sanity check failures
in destroy_workqueue()") tries to fix spurious sanity check failures by
stopping send_mayday() via setting wq->rescuer to NULL.
But it fails to stop the pwq->mayday_node requeuing in the rescuer, and
the commit2 e66b39af00 ("workqueue: Fix pwq ref leak in
rescuer_thread()") fixes it by checking wq->rescuer which is the result
of commit1.
Both commits together really fix spurious sanity check failures caused
by the rescuer, but they both use a convoluted method by relying on
wq->rescuer state rather than the real count of work items.
Actually __WQ_DESTROYING and drain_workqueue() together already stop
send_mayday() by draining all the work items and ensuring no new work
item requeuing.
And the more proper fix to stop the pwq->mayday_node requeuing in the
rescuer is from commit3 4f3f4cf388 ("workqueue: avoid unneeded
requeuing the pwq in rescuer thread") and renders the checking of
wq->rescuer in commit2 unnecessary.
So __WQ_DESTROYING, drain_workqueue() and commit3 together fix spurious
sanity check failures introduced by the rescuer.
Just remove the convoluted code of using wq->rescuer.
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
If the pwq does not need rescue (normal workers have been created or
become available), the rescuer can immediately move on to other stalled
pwqs.
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Move the code to assign work to rescuer and assign_rescuer_work().
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Mauro says:
That's the final series to complete the migration of documentation
build: it converts get_feat from Perl to Python.
V2 is technically identical to v1: the only difference is that it
now uses tools/lib/python/feat to store the library logic.
With that, no Sphinx in-kernel extensions use fork anymore to call
ancillary scripts: everything is now importing Python methods
directly from the libraries.
There's nothing special on this conversion: it is a direct translation,
almost bug-compatible with the original version (*).
(*) I did solve two or three caveats on patch 1.
Most of the complexity of the script relies at the logic to produce
ReST tables. I do have here on my internal scripts a (somewhat) generic
formatter for ReST tables in Python. I was tempted to convert the logic
to use it, but, as this could cause regressions, I opted to not do it
right now, mainly because the matrix table logic is complex. Also,
I'm tempted to modify a little bit the output there, but extra tests
are required to see if PDF output would work with complex tables (I
remember I had a problem with that in the past). So, I'm postponing
such extra cleanup.
As we want to call Python code directly at the Sphinx extension,
convert get_feat.pl to Python.
The code was made to be (almost) bug-compatible with the Perl
version, with two exceptions:
1. Currently, Perl script outputs a wrong table if arch is set
to a non-existing value;
2. the ReST table output when --feat is used without --arch
has an invalid format, as the number of characters for the
table delimiters are wrong.
Those two bugs were fixed while testing the conversion.
Additionally, another caveat was solved:
the output when --feat is used without arch and the feature
doesn't exist doesn't contain an empty table anymore.
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Message-ID: <03c26cee1ec567804735a33047e625ef5ab7bfa8.1763492868.git.mchehab+huawei@kernel.org>
This patch updates the Linux documentation for cscope, fixing two issues:
1. Corrects the typo in the command line:
c"scope -d -p10 -> cscope -d -p10
2. Fixes the related documentation comment for clarity and correctness:
cscope by default cscope.out database.
->
cscope by default uses the cscope.out database.
Signed-off-by: Jiakai Xu <jiakaiPeanut@gmail.com>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Tested-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Message-ID: <20251119065727.3500015-1-jiakaiPeanut@gmail.com>
In ima_match_rules(), if ima_filter_rule_match() returns -ENOENT due to
the rule being NULL, the function incorrectly skips the 'if (!rc)' check
and sets 'result = true'. The LSM rule is considered a match, causing
extra files to be measured by IMA.
This issue can be reproduced in the following scenario:
After unloading the SELinux policy module via 'semodule -d', if an IMA
measurement is triggered before ima_lsm_rules is updated,
in ima_match_rules(), the first call to ima_filter_rule_match() returns
-ESTALE. This causes the code to enter the 'if (rc == -ESTALE &&
!rule_reinitialized)' block, perform ima_lsm_copy_rule() and retry. In
ima_lsm_copy_rule(), since the SELinux module has been removed, the rule
becomes NULL, and the second call to ima_filter_rule_match() returns
-ENOENT. This bypasses the 'if (!rc)' check and results in a false match.
Call trace:
selinux_audit_rule_match+0x310/0x3b8
security_audit_rule_match+0x60/0xa0
ima_match_rules+0x2e4/0x4a0
ima_match_policy+0x9c/0x1e8
ima_get_action+0x48/0x60
process_measurement+0xf8/0xa98
ima_bprm_check+0x98/0xd8
security_bprm_check+0x5c/0x78
search_binary_handler+0x6c/0x318
exec_binprm+0x58/0x1b8
bprm_execve+0xb8/0x130
do_execveat_common.isra.0+0x1a8/0x258
__arm64_sys_execve+0x48/0x68
invoke_syscall+0x50/0x128
el0_svc_common.constprop.0+0xc8/0xf0
do_el0_svc+0x24/0x38
el0_svc+0x44/0x200
el0t_64_sync_handler+0x100/0x130
el0t_64_sync+0x3c8/0x3d0
Fix this by changing 'if (!rc)' to 'if (rc <= 0)' to ensure that error
codes like -ENOENT do not bypass the check and accidentally result in a
successful match.
Fixes: 4af4662fa4 ("integrity: IMA policy")
Signed-off-by: Zhao Yipeng <zhaoyipeng5@huawei.com>
Reviewed-by: Roberto Sassu <roberto.sassu@huawei.com>
Signed-off-by: Mimi Zohar <zohar@linux.ibm.com>
Currently, the check for whether a partition is populated does not
account for tasks in the cpuset of attaching. This is a corner case
that can leave a task stuck in a partition with no effective CPUs.
The race condition occurs as follows:
cpu0 cpu1
//cpuset A with cpu N
migrate task p to A
cpuset_can_attach
// with effective cpus
// check ok
// cpuset_mutex is not held // clear cpuset.cpus.exclusive
// making effective cpus empty
update_exclusive_cpumask
// tasks_nocpu_error check ok
// empty effective cpus, partition valid
cpuset_attach
...
// task p stays in A, with non-effective cpus.
To fix this issue, this patch introduces cs_is_populated, which considers
tasks in the attaching cpuset. This new helper is used in validate_change
and partition_is_populated.
Fixes: e2d59900d9 ("cgroup/cpuset: Allow no-task partition to have empty cpuset.cpus.effective")
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Finish the translation of kbuild/kbuild.rst.
Update to commit 5cbfb4da7e06 ("kbuild: doc: improve
KBUILD_BUILD_TIMESTAMP documentation")
Signed-off-by: Chenguang Zhao <zhaochenguang@kylinos.cn>
Reviewed-by: WangYuli <wanyl5933@chinaunicom.cn>
Reviewed-by: Dongliang Mu <dzm91@hust.edu.cn>
Signed-off-by: Alex Shi <alexs@kernel.org>
The affinity to set to the rescuers should be consistent in all paths
when a rescuer is in detached state. The affinity could be either
wq_unbound_cpumask or unbound_effective_cpumask(wq).
Related paths:
rescuer's worker_detach_from_pool()
update wq_unbound_cpumask
update wq's cpumask
init_rescuer()
Both affinities are Ok as long as they are consistent in all paths.
In the commit 449b31ad29 ("workqueue: Init rescuer's affinities as
the wq's effective cpumask") makes init_rescuer use
unbound_effective_cpumask(wq) which is consistent with then
apply_wqattrs_commit().
But using unbound_effective_cpumask(wq) requres much more code to
maintain the consistency, and it doesn't make much sense since the
affinity is only effective when the rescuer is not processing works.
wq_unbound_cpumask is more favorable.
So apply_wqattrs_commit() and the path of "updating wq's cpumask" had
been changed to not update the rescuer's affinity, and both the paths
of "updating wq_unbound_cpumask" and "rescuer's
worker_detach_from_pool()" had been changed to use wq_unbound_cpumask.
Now, make init_rescuer() use wq_unbound_cpumask for rescuer's affinity
and make all the paths consistent.
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Waiman Long <longman@redhat.com>
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
When workqueue cpumask changes are committed, the DISASSOCIATED workers
affinity is not touched and this might be a problem down the line for
isolated setups when the DISASSOCIATED pools still have works to run
after the cpu is offline.
Make sure the workers' affinity is updated every time a workqueue cpumask
changes, so these workers can't break isolation.
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Waiman Long <longman@redhat.com>
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
When a rescuer is attached to a pool, its affinity should be only
managed by the pool.
But updating the detached rescuer's affinity is still meaningful so
that it will not disrupt isolated CPUs when it is to be waken up.
But the commit d64f2fa064 ("kernel/workqueue: Let rescuers follow
unbound wq cpumask changes") updates the affinity unconditionally, and
causes some issues
1) it also changes the affinity when the rescuer is already attached to
a pool, which violates the affinity management.
2) the said commit tries to update the affinity of the rescuers, but it
misses the rescuers of the PERCPU workqueues, and isolated CPUs can
be possibly disrupted by these rescuers when they are summoned.
3) The affinity to set to the rescuers should be consistent in all paths
when a rescuer is in detached state. The affinity could be either
wq_unbound_cpumask or unbound_effective_cpumask(wq). Related paths:
rescuer's worker_detach_from_pool()
update wq_unbound_cpumask
update wq's cpumask
init_rescuer()
Both affinities are Ok as long as they are consistent in all paths.
But using unbound_effective_cpumask(wq) requres much more code to
maintain the consistency, and it doesn't make much sense since the
affinity is only effective when the rescuer is not processing works.
wq_unbound_cpumask is more favorable.
Fix the 1) issue by testing rescuer->pool before updating with
wq_pool_attach_mutex held.
Fix the 2) issue by moving the rescuer's affinity updating code to
the place updating wq_unbound_cpumask and make it also update for
PERCPU workqueues.
Partially cleanup the 3) consistency issue by using wq_unbound_cpumask.
So that the path of "updating wq's cpumask" doesn't need to maintain it.
and both the paths of "updating wq_unbound_cpumask" and "rescuer's
worker_detach_from_pool()" use wq_unbound_cpumask.
Cleanup for init_rescuer()'s consistency for affinity can be done in
future.
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Waiman Long <longman@redhat.com>
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Commit e6366101ce ("tools/nolibc: remove __nolibc_enosys() fallback
from time64-related functions") removed many of these fallbacks but
forgot a few.
Finish the job.
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Acked-by: Willy Tarreau <w@1wt.eu>
As off_t is now always 64-bit wide this overflow can not happen anymore,
remove the check.
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Acked-by: Willy Tarreau <w@1wt.eu>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Make sure to always use the 64-bit safe system call
in preparation for 64-bit off_t on 32 bit architectures.
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Acked-by: Willy Tarreau <w@1wt.eu>
When cross-compilation is not used, BPFOBJ and HOST_BPFOBJ are identical
files, libbpf.a, and duplicate libbpf.a files should be removed.
Signed-off-by: Rong Tao <rongtao@cestc.cn>
Signed-off-by: Tejun Heo <tj@kernel.org>
*** Bug description ***
When testing kexec-reboot on a 144 cpus machine with
isolcpus=managed_irq,domain,1-71,73-143 in kernel command line, I
encounter the following bug:
[ 97.114759] psci: CPU142 killed (polled 0 ms)
[ 97.333236] Failed to offline CPU143 - error=-16
[ 97.333246] ------------[ cut here ]------------
[ 97.342682] kernel BUG at kernel/cpu.c:1569!
[ 97.347049] Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
[...]
In essence, the issue originates from the CPU hot-removal process, not
limited to kexec. It can be reproduced by writing a SCHED_DEADLINE
program that waits indefinitely on a semaphore, spawning multiple
instances to ensure some run on CPU 72, and then offlining CPUs 1–143
one by one. When attempting this, CPU 143 failed to go offline.
bash -c 'taskset -cp 0 $$ && for i in {1..143}; do echo 0 > /sys/devices/system/cpu/cpu$i/online 2>/dev/null; done'
Tracking down this issue, I found that dl_bw_deactivate() returned
-EBUSY, which caused sched_cpu_deactivate() to fail on the last CPU.
But that is not the fact, and contributed by the following factors:
When a CPU is inactive, cpu_rq()->rd is set to def_root_domain. For an
blocked-state deadline task (in this case, "cppc_fie"), it was not
migrated to CPU0, and its task_rq() information is stale. So its rq->rd
points to def_root_domain instead of the one shared with CPU0. As a
result, its bandwidth is wrongly accounted into a wrong root domain
during domain rebuild.
*** Issue ***
The key point is that root_domain is only tracked through active rq->rd.
To avoid using a global data structure to track all root_domains in the
system, there should be a method to locate an active CPU within the
corresponding root_domain.
*** Solution ***
To locate the active cpu, the following rules for deadline
sub-system is useful
-1.any cpu belongs to a unique root domain at a given time
-2.DL bandwidth checker ensures that the root domain has active cpus.
Now, let's examine the blocked-state task P.
If P is attached to a cpuset that is a partition root, it is
straightforward to find an active CPU.
If P is attached to a cpuset that has changed from 'root' to 'member',
the active CPUs are grouped into the parent root domain. Naturally, the
CPUs' capacity and reserved DL bandwidth are taken into account in the
ancestor root domain. (In practice, it may be unsafe to attach P to an
arbitrary root domain, since that domain may lack sufficient DL
bandwidth for P.) Again, it is straightforward to find an active CPU in
the ancestor root domain.
This patch groups CPUs into isolated and housekeeping sets. For the
housekeeping group, it walks up the cpuset hierarchy to find active CPUs
in P's root domain and retrieves the valid rd from cpu_rq(cpu)->rd.
Signed-off-by: Pingfan Liu <piliu@redhat.com>
Cc: Waiman Long <longman@redhat.com>
Cc: Chen Ridong <chenridong@huaweicloud.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Pierre Gondois <pierre.gondois@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Valentin Schneider <vschneid@redhat.com>
To: linux-kernel@vger.kernel.org
Signed-off-by: Tejun Heo <tj@kernel.org>
Parsing KTAP is quite an inconvenience, but most of the time the thing
you really want to know is "did anything fail"?
Let's give the user the his information without them needing
to parse anything.
Because of the use of subshells and namespaces, this needs to be
communicated via a file. Just write arbitrary data into the file and
treat non-empty content as a signal that something failed.
In case any user depends on the current behaviour, such as running this
from a script with `set -e` and parsing the result for failures
afterwards, add a flag they can set to get the old behaviour, namely
--no-error-on-fail.
Link: https://lore.kernel.org/r/20251111-b4-ksft-error-on-fail-v3-1-0951a51135f6@google.com
Signed-off-by: Brendan Jackman <jackmanb@google.com>
Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
The printf statement attempts to print the DMA direction string using
the syntax 'dir[directions]', which is an invalid array access. The
variable 'dir' is an integer, and 'directions' is a char pointer array.
This incorrect syntax should be 'directions[dir]', using 'dir' as the
index into the 'directions' array. Fix this by correcting the array
access from 'dir[directions]' to 'directions[dir]'.
Link: https://lore.kernel.org/r/20251104025234.2363-1-zhangchujun@cmss.chinamobile.com
Signed-off-by: Zhang Chujun <zhangchujun@cmss.chinamobile.com>
Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
Rust 1.92.0 warns when building the documentation that [`PinnedDrop`] is
an invalid reference. This is correct and it's weird that it didn't warn
before, so fix the link.
[ The reason is that it is hidden -- I had asked about that in the
upstream PR that changed the behavior because I wasn't sure it was
intentional (and thus whether we needed to fix this and other cases):
https://github.com/rust-lang/rust/pull/147153#issuecomment-3395484636
It turns out it was not, and it has been fixed for 1.92.0's upcoming
release thanks to Guillaume and León. So we do not strictly need
this patch and the other changes anymore:
https://github.com/rust-lang/rust/pull/147809
However, checking hidden/private items or, even better, a runtime
toggle to be able to see those on the fly, is something that I think
would be quite nice so I have had it in our usual lists for a while.
Guillaume is open to the idea and perhaps experimenting with an
implementation on our side first -- he asked me to open issues
upstream:
https://github.com/rust-lang/rust/issues/149105https://github.com/rust-lang/rust/issues/149106
- Miguel ]
Signed-off-by: Benno Lossin <lossin@kernel.org>
Link: https://patch.msgid.link/20251016211740.653599-1-lossin@kernel.org
Signed-off-by: Miguel Ojeda <ojeda@kernel.org>
We need to directly allocate the cred's LSM state for the initial task
when we initialize the LSM framework. Unfortunately, this results in a
RCU related type mismatch, use the unrcu_pointer() macro to handle this
a bit more elegantly.
The explicit type casting still remains as we need to work around the
constification of current->cred in this particular case.
Reviewed-by: Xiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
Allowing irq_work to be scheduled while trying to suspend has shown
to cause problems as some architectures interpret the pending
interrupts as a reason to not suspend. This became a problem for
printk() with the introduction of NBCON consoles. With every
printk() call, NBCON console printing kthreads are woken by queueing
irq_work. This means that irq_work continues to be queued due to
printk() calls late in the suspend procedure.
Avoid this problem by preventing printk() from queueing irq_work
once console suspending has begun. This applies to triggering NBCON
and legacy deferred printing as well as klogd waiters.
Since triggering of NBCON threaded printing relies on irq_work, the
pr_flush() within console_suspend_all() is used to perform the final
flushing before suspending consoles and blocking irq_work queueing.
NBCON consoles that are not suspended (due to the usage of the
"no_console_suspend" boot argument) transition to atomic flushing.
Introduce a new global variable @console_irqwork_blocked to flag
when irq_work queueing is to be avoided. The flag is used by
printk_get_console_flush_type() to avoid allowing deferred printing
and switch NBCON consoles to atomic flushing. It is also used by
vprintk_emit() to avoid klogd waking.
Add WARN_ON_ONCE(console_irqwork_blocked) to the irq_work queuing
functions to catch any code that attempts to queue printk irq_work
during the suspending/resuming procedure.
Cc: stable@vger.kernel.org # 6.13.x because no drivers in 6.12.x
Fixes: 6b93bb41f6 ("printk: Add non-BKL (nbcon) console basic infrastructure")
Closes: https://lore.kernel.org/lkml/DB9PR04MB8429E7DDF2D93C2695DE401D92C4A@DB9PR04MB8429.eurprd04.prod.outlook.com
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Tested-by: Sherry Sun <sherry.sun@nxp.com>
Link: https://patch.msgid.link/20251113160351.113031-3-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
Currently, when in-kernel module decompression (CONFIG_MODULE_DECOMPRESS)
is enabled, IMA has no way to verify the appended module signature as it
can't decompress the module.
Define a new kernel_read_file_id enumerate READING_MODULE_COMPRESSED so
IMA can calculate the compressed kernel module data hash on
READING_MODULE_COMPRESSED and defer appraising/measuring it until on
READING_MODULE when the module has been decompressed.
Before enabling in-kernel module decompression, a kernel module in
initramfs can still be loaded with ima_policy=secure_boot. So adjust the
kernel module rule in secure_boot policy to allow either an IMA
signature OR an appended signature i.e. to use
"appraise func=MODULE_CHECK appraise_type=imasig|modsig".
Reported-by: Karel Srot <ksrot@redhat.com>
Suggested-by: Mimi Zohar <zohar@linux.ibm.com>
Suggested-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: Coiby Xu <coxu@redhat.com>
Signed-off-by: Mimi Zohar <zohar@linux.ibm.com>
Add the `Bounded` integer wrapper type, which restricts the number of
bits allowed to represent of value.
This is useful to e.g. enforce guarantees when working with bitfields
that have an arbitrary number of bits.
Alongside this type, provide many `From` and `TryFrom` implementations
are to reduce friction when using with regular integer types. Proxy
implementations of common integer operations are also provided.
Signed-off-by: Alexandre Courbot <acourbot@nvidia.com>
Reviewed-by: Alice Ryhl <aliceryhl@google.com>
Link: https://patch.msgid.link/20251108-bounded_ints-v4-2-c9342ac7ebd1@nvidia.com
[ Added intra-doc link. Fixed a few other nits. - Miguel ]
Signed-off-by: Miguel Ojeda <ojeda@kernel.org>
scripts/lib was always a bit of an awkward place for Python libraries; give
them a proper home under tools/lib/python. Put the modules from
tools/docs/lib there for good measure.
The second patch ties them into a single package namespace. It would be
more aesthetically pleasing to add a kernel layer, so we could say:
from kernel.kdoc import kdoc_parser
...and have the kernel-specific stuff clearly marked, but that means adding
an empty directory in the hierarchy, which isn't as pleasing.
There are some other "Python library" directories hidden in the kernel
tree; we may eventually want to encourage them to move as well.
Now that we have tools/lib/python for our Python modules, turn them into
proper packages with a single namespace so that everything can just use
tools/lib/python in sys.path. No functional change.
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Message-ID: <20251110220430.726665-3-corbet@lwn.net>
"scripts/lib" was always a bit of an awkward place for Python modules. We
already have tools/lib; create a tools/lib/python, move the libraries
there, and update the users accordingly.
While at it, move the contents of tools/docs/lib. Rather than make another
directory, just put these documentation-oriented modules under "kdoc".
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Message-ID: <20251110220430.726665-2-corbet@lwn.net>
Move the kernel build options abbreviations to the .txt file so that
they are together instead of one having to go hunt them in the .rst
file.
Tweak the formatting so that the inclusion of kernel-parameters.txt
still keeps the whole thing somewhat presentable in the html output too.
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Tested-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Message-ID: <20251112114641.8230-1-bp@kernel.org>
The free_kick_syncs_rcu() rcu-callback only invoke kvfree() to
release per-cpu ksyncs object, this can use kvfree_rcu() replace
call_rcu() to release per-cpu ksyncs object in the free_kick_syncs().
Signed-off-by: Zqiang <qiang.zhang@linux.dev>
Signed-off-by: Tejun Heo <tj@kernel.org>
0day CI reported several broken references under
Documentation/translations/zh_CN/scsi/scsi-parameters.rst.
These files do not exist under the translations directory.
The correct references are the original English documents under
Documentation/scsi/.
This patch updates all broken paths accordingly.
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202511130315.WOiKJQTu-lkp@intel.com/
Signed-off-by: ke zijie <kezijie@leap-io-kernel.com>
Signed-off-by: Alex Shi <alexs@kernel.org>
Sometimes we may need to iterate over, or find an element in a read
only (or read mostly) red-black tree, and in that case we don't need a
mutable reference to the tree, which we'll however have to take to be
able to use the current (mutable) cursor implementation.
This patch adds a simple immutable cursor implementation to RBTree,
which enables us to use an immutable tree reference. The existing
(fully featured) cursor implementation is renamed to CursorMut,
while retaining its functionality.
The only existing user of the [mutable] cursor for RBTrees (binder) is
updated to match the changes.
Signed-off-by: Vitaly Wool <vitaly.wool@konsulko.se>
Reviewed-by: Alice Ryhl <aliceryhl@google.com>
Link: https://patch.msgid.link/20251014123339.2492210-1-vitaly.wool@konsulko.se
[ Applied `rustfmt`. Added intra-doc link. Fixed unclosed example.
Fixed docs description. Fixed typo and other formatting nits.
- Miguel ]
Signed-off-by: Miguel Ojeda <ojeda@kernel.org>
The current kernel doesn't handle unpopulated cgroups any special
regarding reclaim protection. Furthermore, this wasn't a case even when
this was introduced in
bf8d5d52ff ("memcg: introduce memory.min")
Drop the incorrect documentation. (Implementation taking into account
the inner-node constraint may be added later.)
Signed-off-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The protection target is necessary to understand how effective reclaim
protection applies in the hierarchy.
Signed-off-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
With the buddy lockup detector, smp_processor_id() returns the detecting CPU,
not the locked CPU, making scx_hardlockup()'s printouts confusing. Pass the
locked CPU number from watchdog_hardlockup_check() as a parameter instead.
Also add kerneldoc comments to handle_lockup(), scx_hardlockup(), and
scx_rcu_cpu_stall() documenting their return value semantics.
Suggested-by: Doug Anderson <dianders@chromium.org>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Acked-by: Andrea Righi <arighi@nvidia.com>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The generic test for CC_CAN_LINK assumes that all architectures use -m32
and -m64 to switch between 32-bit and 64-bit compilation. This is overly
simplistic. Architectures may use other flags (-mabi, -m31, etc.) or may
also require byte order handling (-mlittle-endian, -EL). Expressing all
of the different possibilities will be very complicated and brittle.
Instead allow architectures to supply their own logic which will be
easy to understand and evolve.
Both the boolean ARCH_HAS_CC_CAN_LINK and the string ARCH_USERFLAGS need
to be implemented as kconfig does not allow the reuse of string options.
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Reviewed-by: Nicolas Schier <nsc@kernel.org>
Reviewed-by: Nathan Chancellor <nathan@kernel.org>
Link: https://patch.msgid.link/20251114-kbuild-userprogs-bits-v3-3-4dee0d74d439@linutronix.de
Signed-off-by: Nicolas Schier <nsc@kernel.org>
It is possible that the kernel toolchain generates warnings when used
together with the system toolchain. This happens for example when the
older kernel toolchain does not handle new versions of sframe debug
information. While these warnings where ignored during the evaluation
of CC_CAN_LINK, together with CONFIG_WERROR the actual userprog build
will later fail.
Example warning:
.../x86_64-linux/13.2.0/../../../../x86_64-linux/bin/ld:
error in /lib/../lib64/crt1.o(.sframe); no .sframe will be created
collect2: error: ld returned 1 exit status
Make sure that the very simple example program does not generate
warnings already to avoid breaking the userprog compilations.
Fixes: ec4a3992bc ("kbuild: respect CONFIG_WERROR for linker and assembler")
Fixes: 3f0ff4cc6f ("kbuild: respect CONFIG_WERROR for userprogs")
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Reviewed-by: Nicolas Schier <nsc@kernel.org>
Reviewed-by: Nathan Chancellor <nathan@kernel.org>
Link: https://patch.msgid.link/20251114-kbuild-userprogs-bits-v3-1-4dee0d74d439@linutronix.de
Signed-off-by: Nicolas Schier <nsc@kernel.org>
Conform the layout, informational and status messages to KTAP. No
functional change is intended other than the layout of output messages.
Signed-off-by: Guopeng Zhang <zhangguopeng@kylinos.cn>
Suggested-by: Sebastian Chlad <sebastian.chlad@suse.com>
Acked-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The save_iaa_wq() function unconditionally returns 0, even when an error
is encountered. This prevents the error code from being propagated to the
caller.
Fix this by returning the 'ret' variable, which holds the actual status
of the operations within the function.
Fixes: ea7a5cbb43 ("crypto: iaa - Add Intel IAA Compression Accelerator crypto driver core")
Signed-off-by: Zilin Guan <zilin@seu.edu.cn>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Use max() instead of max_t() since zstd_cstream_workspace_bound() and
zstd_dstream_workspace_bound() already return size_t and casting the
values is unnecessary.
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Add the __counted_by() compiler attribute to the flexible array member
'wksp' to improve access bounds-checking via CONFIG_UBSAN_BOUNDS and
CONFIG_FORTIFY_SOURCE.
Use struct_size(), which provides additional compile-time checks for
structures with flexible array members (e.g., __must_be_array()), for
the allocation size for a new 'zstd_ctx' while we're at it.
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Currently if a user enqueues a work item using schedule_delayed_work() the
used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to
schedule_work() that is using system_wq and queue_work(), that makes use
again of WORK_CPU_UNBOUND.
This lack of consistency cannot be addressed without refactoring the API.
alloc_workqueue() treats all queues as per-CPU by default, while unbound
workqueues must opt-in via WQ_UNBOUND.
This default is suboptimal: most workloads benefit from unbound queues,
allowing the scheduler to place worker threads where they’re needed and
reducing noise when CPUs are isolated.
This continues the effort to refactor workqueue APIs, which began with
the introduction of new workqueues and a new alloc_workqueue flag in:
commit 128ea9f6cc ("workqueue: Add system_percpu_wq and system_dfl_wq")
commit 930c2ea566 ("workqueue: Add new WQ_PERCPU flag")
This change adds a new WQ_PERCPU flag to explicitly request alloc_workqueue()
to be per-cpu when WQ_UNBOUND has not been specified.
With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND),
any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND
must now use WQ_PERCPU.
Once migration is complete, WQ_UNBOUND can be removed and unbound will
become the implicit default.
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Acked-by: Giovanni Cabiddu <giovanni.cabiddu@intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Driver's probe function matches against driver's of_device_id table,
where each entry has non-NULL match data, so of_match_node() can be
simplified with of_device_get_match_data().
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Driver's probe function matches against driver's of_device_id table,
where each entry has non-NULL match data, so of_match_node() can be
simplified with of_device_get_match_data().
Acked-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
sp_device->dev_vdata points to only const data (see 'static const struct
sp_dev_vdata dev_vdata'), so can be made pointer to const for code
safety.
Update also sp_get_acpi_version() function which returns this pointer to
'pointer to const' for code readability, even though it is not needed.
On the other hand, do not touch similar function sp_get_of_version()
because it will be immediately removed in next patches.
Acked-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Driver's probe function matches against driver's of_device_id table, so
of_match_node() can be simplified with of_device_get_match_data().
This requires changing the enum used in the driver match data entries to
non-zero, to be able to recognize error case of
of_device_get_match_data().
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Acked-by: Jesper Nilsson <jesper.nilsson@axis.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Driver's probe function matches against driver's of_device_id table,
where each entry has non-NULL match data, so of_match_node() can be
simplified with of_device_get_match_data().
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Convention is to place MODULE_DEVICE_TABLE() immediately after
definition of the affected table, so one can easily spot missing such.
There is on the other hand no benefits of putting MODULE_DEVICE_TABLE()
far away.
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Currently if a user enqueues a work item using schedule_delayed_work() the
used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to
schedule_work() that is using system_wq and queue_work(), that makes use
again of WORK_CPU_UNBOUND.
This lack of consistency cannot be addressed without refactoring the API.
alloc_workqueue() treats all queues as per-CPU by default, while unbound
workqueues must opt-in via WQ_UNBOUND.
This default is suboptimal: most workloads benefit from unbound queues,
allowing the scheduler to place worker threads where they’re needed and
reducing noise when CPUs are isolated.
This continues the effort to refactor workqueue APIs, which began with
the introduction of new workqueues and a new alloc_workqueue flag in:
commit 128ea9f6cc ("workqueue: Add system_percpu_wq and system_dfl_wq")
commit 930c2ea566 ("workqueue: Add new WQ_PERCPU flag")
This change adds a new WQ_PERCPU flag to explicitly request alloc_workqueue()
to be per-cpu when WQ_UNBOUND has not been specified.
With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND),
any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND
must now use WQ_PERCPU.
Once migration is complete, WQ_UNBOUND can be removed and unbound will
become the implicit default.
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Currently if a user enqueues a work item using schedule_delayed_work() the
used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to
schedule_work() that is using system_wq and queue_work(), that makes use
again of WORK_CPU_UNBOUND.
This lack of consistency cannot be addressed without refactoring the API.
alloc_workqueue() treats all queues as per-CPU by default, while unbound
workqueues must opt-in via WQ_UNBOUND.
This default is suboptimal: most workloads benefit from unbound queues,
allowing the scheduler to place worker threads where they’re needed and
reducing noise when CPUs are isolated.
This continues the effort to refactor workqueue APIs, which began with
the introduction of new workqueues and a new alloc_workqueue flag in:
commit 128ea9f6cc ("workqueue: Add system_percpu_wq and system_dfl_wq")
commit 930c2ea566 ("workqueue: Add new WQ_PERCPU flag")
This change adds a new WQ_PERCPU flag to explicitly request
alloc_workqueue() to be per-cpu when WQ_UNBOUND has not been specified.
With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND),
any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND
must now use WQ_PERCPU.
Once migration is complete, WQ_UNBOUND can be removed and unbound will
become the implicit default.
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
The function already initialized the ivsize variable
at the point of declaration, let's use it instead of
calling crypto_skcipher_ivsize() extra couple times.
Found by Linux Verification Center (linuxtesting.org) with SVACE.
Fixes: 57d67c6e82 ("crypto: rockchip - rework by using crypto_engine")
Signed-off-by: Karina Yankevich <k.yankevich@omp.ru>
Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Since the beginning, SPHINXDIRS was meant to be used by any
subdirectory inside Documentation/ that contains a file named
index.rst on it. The typical usecase for SPHINXDIRS is help
building subsystem-specific documentation, without needing to
wait for the entire building (with can take 3 minutes with
Sphinx 8.x and above, and a lot more with older versions).
Yet, the documentation for such feature was written back in
2016, where almost all index.rst files were at the first
level (Documentation/*/index.rst).
Update the documentation to reflect the way it works.
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Message-ID: <683469813350214da122c258063dd71803ff700b.1763031632.git.mchehab+huawei@kernel.org>
Add a call to should_fail_ex that forces mempool to actually allocate
from the pool to stress the mempool implementation when enabled through
debugfs. By default should_fail{,_ex} prints a very verbose stack trace
that clutters the kernel log, slows down execution and triggers the
kernel bug detection in xfstests. Pass FAULT_NOWARN and print a
single-line message notating the caller instead so that full tests
can be run with fault injection.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Link: https://patch.msgid.link/20251113084022.1255121-5-hch@lst.de
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
In the future, we will separate slab, folio and page from each other
and calling virt_to_folio() on an address allocated from slab will
return NULL. Delay the conversion from struct page to struct slab
until we know we're not dealing with a large kmalloc allocation.
There's a minor win for large kmalloc allocations as we avoid the
compound_head() hidden in virt_to_folio().
This deprecates calling ksize() on memory allocated by alloc_pages().
Today it becomes a warning and support will be removed entirely in
the future.
Introduce large_kmalloc_size() to abstract how we represent the size
of a large kmalloc allocation. For now, this is the same as
page_size(), but it will change with separately allocated memdescs.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://patch.msgid.link/20251113000932.1589073-3-willy@infradead.org
Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
In order to separate slabs from folios, we need to convert from any page
in a slab to the slab directly without going through a page to folio
conversion first.
Up to this point, page_slab() has followed the example of other memdesc
converters (page_folio(), page_ptdesc() etc) and just cast the pointer
to the requested type, regardless of whether the pointer is actually a
pointer to the correct type or not.
That changes with this commit; we check that the page actually belongs
to a slab and return NULL if it does not. Other memdesc converters will
adopt this convention in future.
kfence was the only user of page_slab(), so adjust it to the new way
of working. It will need to be touched again when we separate slab
from page.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Alexander Potapenko <glider@google.com>
Cc: Marco Elver <elver@google.com>
Cc: kasan-dev@googlegroups.com
Link: https://patch.msgid.link/20251113000932.1589073-2-willy@infradead.org
Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>
Tested-by: Marco Elver <elver@google.com>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
In barn_shrink(), use LIST_HEAD() to declare and initialize the
list_head in one step instead of using INIT_LIST_HEAD() separately.
No functional change.
Signed-off-by: Baolin Liu <liubaolin@kylinos.cn>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
In functions such as [__]slab_update_freelist() and
__slab_update_freelist_fast/slow() we pass old and new freelist and
counters as 4 separate parameters. The underlying
__update_freelist_fast() then constructs struct freelist_counters
variables for passing the full freelist+counter combinations to cmpxchg
double.
In most cases we actually start with struct freelist_counters variables,
but then pass the individual fields, only to construct new struct
freelist_counters variables. While it's all inlined and thus should be
efficient, we can simplify this code.
Thus replace the 4 parameters for individual fields with two pointers to
struct freelist_counters wherever applicable. __update_freelist_fast()
can then pass them directly to try_cmpxchg_freelist().
The code is also more obvious as the pattern becomes unified such that
we set up "old" and "new" struct freelist_counters variables upfront as
we fully need them to be, and simply call [__]slab_update_freelist() on
them. Previously some of the "new" values would be hidden among the
many parameters and thus make it harder to figure out what the code
does.
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Commit 5ebec443fb ("sched_ext: Exit dispatch and move operations
immediately when aborting") replaced the breather mechanism with the
scx_aborting flag.
Update comments removing references to the breather mechanism to avoid
confusion.
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
In bypass mode, tasks are queued on per-CPU bypass DSQs. While this works well
in most cases, there is a failure mode where a BPF scheduler can skew task
placement severely before triggering bypass in highly over-saturated systems.
If most tasks end up concentrated on a few CPUs, those CPUs can accumulate
queues that are too long to drain in a reasonable time, leading to RCU stalls
and hung tasks.
Implement a simple timer-based load balancer that redistributes tasks across
CPUs within each NUMA node. The balancer runs periodically (default 500ms,
tunable via bypass_lb_intv_us module parameter) and moves tasks from overloaded
CPUs to underloaded ones.
When moving tasks between bypass DSQs, the load balancer holds nested DSQ locks
to avoid dropping and reacquiring the donor DSQ lock on each iteration, as
donor DSQs can be very long and highly contended. Add the SCX_ENQ_NESTED flag
and use raw_spin_lock_nested() in dispatch_enqueue() to support this. The load
balancer timer function reads scx_bypass_depth locklessly to check whether
bypass mode is active. Use WRITE_ONCE() when updating scx_bypass_depth to pair
with the READ_ONCE() in the timer function.
This has been tested on a 192 CPU dual socket AMD EPYC machine with ~20k
runnable tasks running scx_cpu0. As scx_cpu0 queues all tasks to CPU0, almost
all tasks end up on CPU0 creating severe imbalance. Without the load balancer,
disabling the scheduler can lead to RCU stalls and hung tasks, taking a very
long time to complete. With the load balancer, disable completes in about a
second.
The load balancing operation can be monitored using the sched_ext_bypass_lb
tracepoint and disabled by setting bypass_lb_intv_us to 0.
v2: Lock both rq and DSQ in bypass_lb_cpu() and use dispatch_dequeue_locked()
to prevent races with dispatch_dequeue() (Andrea Righi).
Cc: Andrea Righi <arighi@nvidia.com>
Cc: Dan Schatzberg <schatzberg.dan@gmail.com>
Cc: Emil Tsalapatis <etsal@meta.com>
Reviewed_by: Emil Tsalapatis <emil@etsalapatis.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
move_task_between_dsqs() contains open-coded abbreviated dequeue logic when
moving tasks between non-local DSQs. Factor this out into
dispatch_dequeue_locked() which can be used when both the task's rq and dsq
locks are already held. Add lockdep assertions to both dispatch_dequeue() and
the new helper to verify locking requirements.
This prepares for the load balancer which will need the same abbreviated
dequeue pattern.
Cc: Andrea Righi <arighi@nvidia.com>
Cc: Dan Schatzberg <schatzberg.dan@gmail.com>
Cc: Emil Tsalapatis <etsal@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Factor out scx_dsq_list_node cursor initialization into INIT_DSQ_LIST_CURSOR
macro in preparation for additional users.
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Cc: Dan Schatzberg <schatzberg.dan@gmail.com>
Acked-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Add scx_cpu0, a simple scheduler that queues all tasks to a single DSQ and
only dispatches them from CPU0 in FIFO order. This is useful for testing bypass
behavior when many tasks are concentrated on a single CPU. If the load balancer
doesn't work, bypass mode can trigger task hangs or RCU stalls as the queue is
long and there's only one CPU working on it.
v2: Check whether task is on CPU0 at enqueue using scx_bpf_task_cpu() instead
of nr_cpus_allowed (Andrea Righi).
Cc: Dan Schatzberg <schatzberg.dan@gmail.com>
Cc: Emil Tsalapatis <etsal@meta.com>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
A poorly behaving BPF scheduler can trigger hard lockup. For example, on a
large system with many tasks pinned to different subsets of CPUs, if the BPF
scheduler puts all tasks in a single DSQ and lets all CPUs at it, the DSQ lock
can be contended to the point where hardlockup triggers. Unfortunately,
hardlockup can be the first signal out of such situations, thus requiring
hardlockup handling.
Hook scx_hardlockup() into the hardlockup detector to try kicking out the
current scheduler in an attempt to recover the system to a good state. The
handling strategy can delay watchdog taking its own action by one polling
period; however, given that the only remediation for hardlockup is crash, this
is likely an acceptable trade-off.
v2: Add missing dummy scx_hardlockup() definition for
!CONFIG_SCHED_CLASS_EXT (kernel test bot).
Reported-by: Dan Schatzberg <schatzberg.dan@gmail.com>
Cc: Emil Tsalapatis <etsal@meta.com>
Cc: Douglas Anderson <dianders@chromium.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
handle_lockup() currently calls scx_verror() but ignores its return value,
always returning true when the scheduler is enabled. Make it capture and return
the result from scx_verror(). This prepares for hardlockup handling.
Cc: Dan Schatzberg <schatzberg.dan@gmail.com>
Cc: Emil Tsalapatis <etsal@meta.com>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
scx_rcu_cpu_stall() and scx_softlockup() share the same pattern: check if the
scheduler is enabled under RCU read lock and trigger an error if so. Extract
the common pattern into handle_lockup() helper. Add scx_verror() macro and use
guard(rcu)().
This simplifies both handlers, reduces code duplication, and prepares for
hardlockup handling.
Reviewed-by: Dan Schatzberg <schatzberg.dan@gmail.com>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Cc: Emil Tsalapatis <etsal@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Make scx_exit() and scx_vexit() return bool indicating whether the calling
thread successfully claimed the exit. This will be used by the abort mechanism
added in a later patch.
Reviewed-by: Dan Schatzberg <schatzberg.dan@gmail.com>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Cc: Emil Tsalapatis <etsal@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
62dcbab8b0 ("sched_ext: Avoid live-locking bypass mode switching") introduced
the breather mechanism to inject delays during bypass mode switching. It
maintains operation semantics unchanged while reducing lock contention to avoid
live-locks on large NUMA systems.
However, the breather only activates when exiting the scheduler, so there's no
need to maintain operation semantics. Simplify by exiting dispatch and move
operations immediately when scx_aborting is set. In consume_dispatch_q(), break
out of the task iteration loop. In scx_dsq_move(), return early before
acquiring locks.
This also fixes cases the breather mechanism cannot handle. When a large system
has many runnable threads affinitized to different CPU subsets and the BPF
scheduler places them all into a single DSQ, many CPUs can scan the DSQ
concurrently for tasks they can run. This can cause DSQ and RQ locks to be held
for extended periods, leading to various failure modes. The breather cannot
solve this because once in the consume loop, there's no exit. The new mechanism
fixes this by exiting the loop immediately.
The bypass DSQ is exempted to ensure the bypass mechanism itself can make
progress.
v2: Use READ_ONCE() when reading scx_aborting (Andrea Righi).
Reported-by: Dan Schatzberg <schatzberg.dan@gmail.com>
Reviewed-by: Dan Schatzberg <schatzberg.dan@gmail.com>
Cc: Andrea Righi <arighi@nvidia.com>
Cc: Emil Tsalapatis <etsal@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The breather mechanism was introduced in 62dcbab8b0 ("sched_ext: Avoid
live-locking bypass mode switching") and e32c260195 ("sched_ext: Enable the
ops breather and eject BPF scheduler on softlockup") to prevent live-locks by
injecting delays when CPUs are trapped in dispatch paths.
Currently, it uses scx_breather_depth (atomic_t) and scx_in_softlockup
(unsigned long) with separate increment/decrement and cleanup operations. The
breather is only activated when aborting, so tie it directly to the exit
mechanism. Replace both variables with scx_aborting flag set when exit is
claimed and cleared after bypass is enabled. Introduce scx_claim_exit() to
consolidate exit_kind claiming and breather enablement. This eliminates
scx_clear_softlockup() and simplifies scx_softlockup() and scx_bypass().
The breather mechanism will be replaced by a different abort mechanism in a
future patch. This simplification prepares for that change.
Reviewed-by: Dan Schatzberg <schatzberg.dan@gmail.com>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Acked-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Bypass mode routes tasks through fallback dispatch queues. Originally a single
global DSQ, b7b3b2dbae ("sched_ext: Split the global DSQ per NUMA node")
changed this to per-node DSQs to resolve NUMA-related livelocks.
Dan Schatzberg found per-node DSQs can still livelock when many threads are
pinned to different small CPU subsets: each CPU must scan many incompatible
tasks to find runnable ones, causing severe contention with high CPU counts.
Switch to per-CPU bypass DSQs. Each task queues on its current CPU. Default
idle CPU selection and direct dispatch handle most cases well.
This introduces a failure mode when tasks concentrate on one CPU in
over-saturated systems. If the BPF scheduler severely skews placement before
triggering bypass, that CPU's queue may be too long to drain, causing RCU
stalls. A load balancer in a future patch will address this. The bypass DSQ is
separate from local DSQ to enable load balancing: local DSQs use rq locks,
preventing efficient scanning and transfer across CPUs, especially problematic
when systems are already contended.
v2: Clarified why bypass DSQ is separate from local DSQ (Andrea Righi).
Reported-by: Dan Schatzberg <schatzberg.dan@gmail.com>
Reviewed-by: Dan Schatzberg <schatzberg.dan@gmail.com>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The local and global DSQ enqueue paths in do_enqueue_task() share the same
slice refill logic. Factor out the common code into a shared enqueue label.
This makes adding new enqueue cases easier. No functional changes.
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
There have been reported cases of bypass mode not making forward progress fast
enough. The 20ms default slice is unnecessarily long for bypass mode where the
primary goal is ensuring all tasks can make forward progress.
Introduce SCX_SLICE_BYPASS set to 5ms and make the scheduler automatically
switch to it when entering bypass mode. Also make the bypass slice value
tunable through the slice_bypass_us module parameter (adjustable between 100us
and 100ms) to make it easier to test whether slice durations are a factor in
problem cases.
v3: Use READ_ONCE/WRITE_ONCE for scx_slice_dfl access (Dan).
v2: Removed slice_dfl_us module parameter. Fixed typos (Andrea).
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Cc: Dan Schatzberg <schatzberg.dan@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Replace set_access(), set_majmin(), and type_to_char() with new helpers
seq_putaccess(), seq_puttype(), and seq_putversion() that write directly
to 'seq_file'.
Simplify devcgroup_seq_show() by hard-coding "a *:* rwm", and use the
new seq_put* helper functions to list the exceptions otherwise.
This allows us to remove the intermediate string buffers while
maintaining the same functionality, including wildcard handling.
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Acked-by: Serge Hallyn <serge@hallyn.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
Previously, update_cpumasks_hier() used need_rebuild_sched_domains to
decide whether to invoke rebuild_sched_domains_locked(). Now that
rebuild_sched_domains_locked() only sets force_rebuild, the flag is
redundant. Hence, remove it.
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The remote_children list is used to track all remote partitions attached
to a cpuset. However, it serves no other purpose. Using a boolean flag to
indicate whether a cpuset is a remote partition is a more direct approach,
making remote_children unnecessary.
This patch replaces the list with a remote_partition flag in the cpuset
structure and removes remote_children entirely.
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
There is no need to jump to the 'done' label upon failure, as no cleanup
is required. Return the error code directly instead.
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
To compile cgroup.c with PREEMPT_RT=y include header which declares
struct irq_work.
Fixes: 9311e6c29b ("cgroup: Fix sleeping from invalid context warning on PREEMPT_RT")
Signed-off-by: Bert Karwatzki <spasswolf@web.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
C macro references are erroneously written using :c:macro:: (note the
double colon). This causes the references to be outputted as combination
of verbatim roles and italicized names instead.
Correct the syntax.
Fixes: dce5488896 ("Documentation: Add TI TPS6594 PFSM")
Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Tested-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Message-ID: <20251104041812.31402-4-bagasdotme@gmail.com>
C macro references are erroneously written using :c:macro:: (note the
double colon). This causes the references to be outputted as combination
of verbatim roles and italicized names instead.
Correct the syntax.
Fixes: 5f67eef6df ("misc: mrvl-cn10k-dpi: add Octeon CN10K DPI administrative driver")
Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Tested-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Message-ID: <20251104041812.31402-3-bagasdotme@gmail.com>
Payload kinds list text is indented at the first text column, rather
than aligned to the list number. As an effect, the third item becomes
sublist of second item's third sublist item (TASKTYPE_TYPE_STATS).
Reindent the list text.
Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Tested-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Message-ID: <20251104130751.22755-1-bagasdotme@gmail.com>
Commit be9d0411f1 ("parport-lowlevel.txt: standardize document
format") reSTify parport interface documentation but forgets to properly
mark function listing code blocks up. As such, these are rendered as
long-running normal paragraph instead.
Fix them by adding missing separator between the code block marker and
the listing.
Fixes: be9d0411f1 ("parport-lowlevel.txt: standardize document format")
Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Tested-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Message-ID: <20251105124947.45048-1-bagasdotme@gmail.com>
In several functions we declare local struct slab variables so we can
work with the freelist and counters fields (including the sub-counters
that are in the union) comfortably.
With struct freelist_counters containing the full counters definition,
we can now reduce the local variables to that type as we don't need the
other fields in struct slab.
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
In struct slab we currently have freelist and counters pair, where
counters itself is a union of unsigned long with a sub-struct of
several smaller fields. Then for the usage with double cmpxchg we have
freelist_aba_t that duplicates the definition of the freelist+counters
with implicitly the same layout as the full definition in struct slab.
Thanks to -fms-extension we can now move the full counters definition to
freelist_aba_t (while changing it to struct freelist_counters as a
typedef is unnecessary and discouraged) and replace the relevant part in
struct slab to an unnamed reference to it.
The immediate benefit is the removal of duplication and no longer
relying on the same layout implicitly. It also allows further cleanups
thanks to having the full definition of counters in struct
freelist_counters.
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
In kmem_cache_cpu we currently have a union of the freelist+tid pair
with freelist_aba_t, relying implicitly on the type compatibility with the
freelist+counters pair used in freelist_aba_t.
To allow further changes to freelist_aba_t, we can instead define a
separate struct freelist_tid (instead of a typedef, per the coding
style) for kmem_cache_cpu, as that affects only a single helper
__update_cpu_freelist_fast().
We can add the resulting struct freelist_tid to kmem_cache_cpu as
unnamed field thanks to -fms-extensions, so that freelist and tid fields
can still be accessed directly.
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Shared branch between Kbuild and other trees for enabling '-fms-extensions' for 6.19
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
The decision whether some more space is needed is tricky in the printk
ring buffer code:
1. The given lpos values might overflow. A subtraction must be used
instead of a simple "lower than" check.
2. Another CPU might reuse the space in the mean time. It can be
detected when the subtraction is bigger than DATA_SIZE(data_ring).
3. There is exactly enough space when the result of the subtraction
is zero. But more space is needed when the result is exactly
DATA_SIZE(data_ring).
Add a helper function to make sure that the check is done correctly
in all situations. Also it helps to make the code consistent and
better documented.
Suggested-by: John Ogness <john.ogness@linutronix.de>
Link: https://lore.kernel.org/r/87tsz7iea2.fsf@jogness.linutronix.de
Reviewed-by: John Ogness <john.ogness@linutronix.de>
Link: https://patch.msgid.link/20251107194720.1231457-3-pmladek@suse.com
[pmladek@suse.com: Updated wording as suggested by John]
Signed-off-by: Petr Mladek <pmladek@suse.com>
Commit 8817270042 ("docs/memory-barriers.txt: Add wait_event_cmd()
and wait_event_exclusive_cmd()") added two APIs without taking care
of the list order. Sort the list for readability.
While there, make it clear that this is incomplete by saying
"for example".
Signed-off-by: Akira Yokosawa <akiyks@gmail.com>
Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The gen_compile_commands.py script currently only creates entries for the
primary source files found in .cmd files, but some kernel source files
text-include others (i.e. kernel/sched/build_policy.c).
This prevents tools like clangd from working properly on text-included c
files, such as kernel/sched/ext.c because the generated compile_commands.json
does not have entries for them.
Extend process_line() to detect when a source file includes .c files, and
generate additional compile_commands.json entries for them. For included c
files, use the same compile flags as their parent and add their parents headers.
This enables lsp tools like clangd to work properly on files like
kernel/sched/ext.c
Signed-off-by: Pat Somaru <patso@likewhatevs.io>
Reviewed-by: Nathan Chancellor <nathan@kernel.org>
Tested-by: Justin Stitt <justinstitt@google.com>
Tested-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://patch.msgid.link/20251008004615.2690081-1-patso@likewhatevs.io
Signed-off-by: Nicolas Schier <nsc@kernel.org>
Update kbuild entry by my kernel.org address, and change my mailmap
entry appropriately.
Acked-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Nicolas Schier <nsc@kernel.org>
This patch adds an example of how to set KBUILD_BUILD_TIMESTAMP to a
specific date. Also, note that the provided timestamp is used for
initramfs mtime fields, which are 32-bit and thus limited to dates
between the Unix epoch and 2106-02-07 06:28:15 UTC. Dates outside this
range will cause errors.
Suggested-by: David Disseldorp <ddiss@suse.de>
Signed-off-by: Gang Yan <yangang@kylinos.cn>
Reviewed-by: David Disseldorp <ddiss@suse.de>
Reviewed-by: Nicolas Schier <nsc@kernel.org>
Link: https://patch.msgid.link/20251017021209.6586-1-gang.yan@linux.dev
Signed-off-by: Nicolas Schier <nsc@kernel.org>
When building out-of-tree modules with CONFIG_MODULE_SIG_FORCE=y,
module signing fails because the private key path uses $(srctree)
while the public key path uses $(objtree). Since signing keys are
generated in the build directory during kernel compilation, both
paths should use $(objtree) for consistency.
This causes SSL errors like:
SSL error:02001002:system library:fopen:No such file or directory
sign-file: /kernel-src/certs/signing_key.pem
The issue occurs because:
- sig-key uses: $(srctree)/certs/signing_key.pem (source tree)
- cmd_sign uses: $(objtree)/certs/signing_key.x509 (build tree)
But both keys are generated in $(objtree) during the build.
This complements commit 25ff08aa43 ("kbuild: Fix signing issue for
external modules") which fixed the scripts path and public key path,
but missed the private key path inconsistency.
Fixes out-of-tree module signing for configurations with separate
source and build directories (e.g., O=/kernel-out).
Signed-off-by: Mikhail Malyshev <mike.malyshev@gmail.com>
Reviewed-by: Nathan Chancellor <nathan@kernel.org>
Tested-by: Nicolas Schier <nsc@kernel.org>
Link: https://patch.msgid.link/20251015163452.3754286-1-mike.malyshev@gmail.com
Signed-off-by: Nicolas Schier <nsc@kernel.org>
The newly introduced -fms-extensions compiler flag allows defining
struct fs_path in such a way that inline_buf becomes a proper array
with a size known to the compiler.
This also makes the problem fixed by commit 8aec9dbf2d ("btrfs: send:
fix -Wflex-array-member-not-at-end warning in struct send_ctx") go away.
Whether cur_inode_path should be put back to its original place in
struct send_ctx I don't know, but at least the comment no longer
applies.
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Acked-by: David Sterba <dsterba@suse.com>
Link: https://patch.msgid.link/20251020142228.1819871-3-linux@rasmusvillemoes.dk
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Nicolas Schier <nsc@kernel.org>
Shared branch between Kbuild and other trees for enabling '-fms-extensions' for 6.19
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Nicolas Schier <nsc@kernel.org>
Whenever there's audit context, __audit_inode_child() gets called
numerous times, which can lead to high latency in scenarios that
create too many sysfs/debugfs entries at once, for instance, upon
device_add_disk() invocation.
# uname -r
6.18.0-rc2+
# auditctl -a always,exit -F path=/tmp -k foo
# time insmod loop max_loop=1000
real 0m46.676s
user 0m0.000s
sys 0m46.405s
# perf record -a insmod loop max_loop=1000
# perf report --stdio |grep __audit_inode_child
32.73% insmod [kernel.kallsyms] [k] __audit_inode_child
__audit_inode_child() searches for both the parent and the child
in two different loops that iterate over the same list. This
process can be optimized by merging these into a single loop,
without changing the function behavior or affecting the code's
readability.
This patch merges the two loops that walk through the list
context->names_list into a single loop. This optimization resulted
in around 51% performance enhancement for the benchmark.
# uname -r
6.18.0-rc2-enhancedv3+
# auditctl -a always,exit -F path=/tmp -k foo
# time insmod loop max_loop=1000
real 0m22.899s
user 0m0.001s
sys 0m22.652s
Signed-off-by: Ricardo Robaina <rrobaina@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
Replace kmalloc+memset by kzalloc for better readability and simplicity.
This addresses the warning below:
WARNING: kzalloc should be used for data, instead of kmalloc/memset
Signed-off-by: Gongwei Li <ligongwei@kylinos.cn>
[PM: subject and description tweaks]
Signed-off-by: Paul Moore <paul@paul-moore.com>
Some kernel configurations prohibit invoking local_bh_enable() while
interrupts are disabled. However, refscale disables interrupts to reduce
OS noise during the tests, which results in splats. This commit therefore
adds an ->enable_irqs flag to the ref_scale_ops structure, and refrains
from disabling interrupts when that flag is set. This flag is set for
the "bh" and "incpercpubh" scale_type module-parameter values.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
This commit adds refscale readers based on READ_ONCE() and WRITE_ONCE()
that are unprotected (can lose counts, "refscale.scale_type=incpercpu"),
preempt-disabled ("refscale.scale_type=incpercpupreempt"),
bh-disabled ("refscale.scale_type=incpercpubh"), and irq-disabled
("refscale.scale_type=incpercpuirqsave"). On my x86 laptop, these are
about 4.3ns, 3.8ns, and 7.3ns per pair, respectively.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
This commit adds refscale readers based on this_cpu_inc() and
this_cpu_inc() ("refscale.scale_type=percpuinc"). On my x86 laptop,
these are about 4.5ns per pair.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
This commit adds refscale readers based on preempt_disable() and
preempt_enable() ("refscale.scale_type=preempt"). On my x86 laptop, these
are about 2.8ns.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
This commit adds refscale readers based on local_bh_disable() and
local_bh_enable() ("refscale.scale_type=bh"). On my x86 laptop, these
are about 4.9ns.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
This commit adds refscale readers based on local_irq_disable() and
local_irq_enable() ("refscale.scale_type=irq") and on local_irq_save()
and local_irq_restore ("refscale.scale_type=irqsave"). On my x86 laptop,
these are about 2.8ns and 7.5ns per enable/disable pair, respectively.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
There may be console drivers that have not yet figured out a way
to implement safe atomic printing (->write_atomic() callback).
These drivers could choose to only implement threaded printing
(->write_thread() callback), but then it is guaranteed that _no_
output will be printed during panic. Not even attempted.
As a result, developers may be tempted to implement unsafe
->write_atomic() callbacks and/or implement some sort of custom
deferred printing trickery to try to make it work. This goes
against the principle intention of the nbcon API as well as
endangers other nbcon drivers that are doing things correctly
(safely).
As a compromise, allow nbcon drivers to implement unsafe
->write_atomic() callbacks by providing a new console flag
CON_NBCON_ATOMIC_UNSAFE. When specified, the ->write_atomic()
callback for that console will _only_ be called during the
final "hope and pray" flush attempt at the end of a panic:
nbcon_atomic_flush_unsafe().
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Link: https://lore.kernel.org/lkml/b2qps3uywhmjaym4mht2wpxul4yqtuuayeoq4iv4k3zf5wdgh3@tocu6c7mj4lt
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/all/swdpckuwwlv3uiessmtnf2jwlx3jusw6u7fpk5iggqo4t2vdws@7rpjso4gr7qp/ [1]
Link: https://lore.kernel.org/all/20251103-fix_netpoll_aa-v4-1-4cfecdf6da7c@debian.org/ [2]
Link: https://patch.msgid.link/20251027161212.334219-2-john.ogness@linutronix.de
[pmladek@suse.com: Fix build with rework/nbcon-in-kdb branch.]
Signed-off-by: Petr Mladek <pmladek@suse.com>
This commit loosens the kvm.sh script's regular expressions to permit
negative-valued Kconfig options, for example:
--kconfig CONFIG_CMDLINE_LOG_WRAP_IDEAL_LEN=-1
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
This commit adds the SRCU_READ_FLAVOR_FAST_UPDOWN=0x8 macro
and adjusts rcutorture to make use of it. In this commit, both
SRCU_READ_FLAVOR_FAST=0x4 and the new SRCU_READ_FLAVOR_FAST_UPDOWN
test SRCU-fast. When the SRCU-fast-updown is added, the new
SRCU_READ_FLAVOR_FAST_UPDOWN macro will test it when passed to the
rcutorture.reader_flavor module parameter.
The old SRCU_READ_FLAVOR_FAST macro's value changed from 0x8 to 0x4.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: <bpf@vger.kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
The rcu_lockdep_current_cpu_online(), rcu_read_lock_sched_held(),
rcu_read_lock_held(), rcu_read_lock_bh_held(), rcu_read_lock_any_held()
are used by tracing-related code paths, so putting traces on them is
unlikely to make anyone happy. This commit therefore marks them all
"notrace".
Reported-by: Leon Hwang <leon.hwang@linux.dev>
Reported-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
We want to expand usage of sheaves to all non-boot caches, including
kmalloc caches. Since sheaves themselves are also allocated by
kmalloc(), we need to prevent excessive or infinite recursion -
depending on sheaf size, the sheaf can be allocated from smaller, same
or larger kmalloc size bucket, there's no particular constraint.
This is similar to allocating the objext arrays so let's just reuse the
existing mechanisms for those. __GFP_NO_OBJ_EXT in alloc_empty_sheaf()
will prevent a nested kmalloc() from allocating a sheaf itself - it will
either have sheaves already, or fallback to a non-sheaf-cached
allocation (so bootstrap of sheaves in a kmalloc cache that allocates
sheaves from its own size bucket is possible). Additionally, reuse
OBJCGS_CLEAR_MASK to clear unwanted gfp flags from the nested
allocation.
Link: https://patch.msgid.link/20251105-sheaves-cleanups-v1-5-b8218e1ac7ef@suse.cz
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
The function is tricky and many of its tests are hard to understand. Try
to improve that by using more descriptively named variables and added
comments.
- rename 'prior' to 'old_head' to match the head and tail parameters
- introduce a 'bool was_full' to make it more obvious what we are
testing instead of the !prior and prior tests
- add or improve comments in various places to explain what we're doing
Also replace kmem_cache_has_cpu_partial() tests with
IS_ENABLED(CONFIG_SLUB_CPU_PARTIAL) which are compile-time constants.
We can do that because the kmem_cache_debug(s) case is handled upfront
via free_to_partial_list().
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Link: https://patch.msgid.link/20251105-sheaves-cleanups-v1-1-b8218e1ac7ef@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
CONFIG_SLUB_TINY minimizes the SLUB's memory overhead in multiple ways,
mainly by avoiding percpu caching of slabs and objects. It also reduces
code size by replacing some code paths with simplified ones through
ifdefs, but the benefits of that are smaller and would complicate the
upcoming changes.
Thus remove these code paths and associated ifdefs and simplify the code
base.
Link: https://patch.msgid.link/20251105-sheaves-cleanups-v1-4-b8218e1ac7ef@suse.cz
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
When a pfmemalloc allocation actually dips into reserves, the slab is
marked accordingly and non-pfmemalloc allocations should not be allowed
to allocate from it. The sheaves percpu caching currently doesn't follow
this rule, so implement it before we expand sheaves usage to all caches.
Make sure objects from pfmemalloc slabs don't end up in percpu sheaves.
When freeing, skip sheaves when freeing an object from pfmemalloc slab.
When refilling sheaves, use __GFP_NOMEMALLOC to override any pfmemalloc
context - the allocation will fallback to regular slab allocations when
sheaves are depleted and can't be refilled because of the override.
For kfree_rcu(), detect pfmemalloc slabs after processing the rcu_sheaf
after the grace period in __rcu_free_sheaf_prepare() and simply flush
it if any object is from pfmemalloc slabs.
For prefilled sheaves, try to refill them first with __GFP_NOMEMALLOC
and if it fails, retry without __GFP_NOMEMALLOC but then mark the sheaf
pfmemalloc, which makes it flushed back to slabs when returned.
Link: https://patch.msgid.link/20251105-sheaves-cleanups-v1-3-b8218e1ac7ef@suse.cz
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
SLUB's internal bulk allocation __kmem_cache_alloc_bulk() can currently
allocate some objects from KFENCE, i.e. when refilling a sheaf. It works
but it's conceptually the wrong layer, as KFENCE allocations should only
happen when objects are actually handed out from slab to its users.
Currently for sheaf-enabled caches, slab_alloc_node() can return KFENCE
object via kfence_alloc(), but also via alloc_from_pcs() when a sheaf
was refilled with KFENCE objects. Continuing like this would also
complicate the upcoming sheaf refill changes.
Thus remove KFENCE allocation from __kmem_cache_alloc_bulk() and move it
to the places that return slab objects to users. slab_alloc_node() is
already covered (see above). Add kfence_alloc() to
kmem_cache_alloc_from_sheaf() to handle KFENCE allocations from
prefilled sheafs, with a comment that the caller should not expect the
sheaf size to decrease after every allocation because of this
possibility.
For kmem_cache_alloc_bulk() implement a different strategy to handle
KFENCE upfront and rely on internal batched operations afterwards.
Assume there will be at most once KFENCE allocation per bulk allocation
and then assign its index in the array of objects randomly.
Cc: Alexander Potapenko <glider@google.com>
Cc: Marco Elver <elver@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Link: https://patch.msgid.link/20251105-sheaves-cleanups-v1-2-b8218e1ac7ef@suse.cz
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
cgroup_task_dead() is called from finish_task_switch() which runs with
preemption disabled and doesn't allow scheduling even on PREEMPT_RT. The
function needs to acquire css_set_lock which is a regular spinlock that can
sleep on RT kernels, leading to "sleeping function called from invalid
context" warnings.
css_set_lock is too large in scope to convert to a raw_spinlock. However,
the unlinking operations don't need to run synchronously - they just need
to complete after the task is done running.
On PREEMPT_RT, defer the work through irq_work. While the work doesn't need
to happen immediately, it can't be delayed indefinitely either as the dead
task pins the cgroup and task_struct can be pinned indefinitely. Use the
lazy version of irq_work to allow batching and lower impact while ensuring
timely completion.
v2: Use IRQ_WORK_INIT_LAZY instead of immediate irq_work and add explanation
for why the work can't be delayed indefinitely (Sebastian Andrzej Siewior).
Fixes: d245698d72 ("cgroup: Defer task cgroup unlink until after the task is done switching out")
Reported-by: Calvin Owens <calvin@wbinvd.org>
Link: https://lore.kernel.org/r/20251104181114.489391-1-calvin@wbinvd.org
Signed-off-by: Tejun Heo <tj@kernel.org>
Commit 64cf7d058a ("tracing: Have trace_marker use per-cpu data to read
user space") made an update that fixed both trace_marker and
trace_marker_raw. But the small difference made to trace_marker_raw had a
blatant bug in it that any basic testing would have uncovered.
Unfortunately, the self tests have tests for trace_marker but nothing for
trace_marker_raw which allowed the bug to get upstream.
Add basic selftests to test trace_marker_raw so that this doesn't happen
again.
Link: https://lore.kernel.org/r/20251014145149.3e3c1033@gandalf.local.home
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
strcpy() is deprecated; use the safer strscpy() instead.
The destination buffer is only zero-initialized for the first iteration
and since strscpy() guarantees its NUL termination anyway, remove
zero-initializing 'eng_type'.
Link: https://github.com/KSPP/linux/issues/88
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Use struct_size(), which provides additional compile-time checks for
structures with flexible array members (e.g., __must_be_array()), to
calculate the allocation size for a new 'deflate_stream'.
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
The previous version check made it difficult to support newer major
versions (e.g., v6.0) without adding extra checks/macros. Update the
logic to only reject v5.0 and allow future versions without additional
changes.
Signed-off-by: Gaurav Kashyap <gaurav.kashyap@oss.qualcomm.com>
Signed-off-by: Jingyi Wang <jingyi.wang@oss.qualcomm.com>
Reviewed-by: Bjorn Andersson <andersson@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
This commit removes a harmless but potentially confusing invocation of
rcutorture_one_extend() within rcu_torture_one_read(). The immediately
preceding call to rcu_torture_one_read_start() already does this cleanup,
and the other call to rcu_torture_one_read_start() already relies on this.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
This commit adds "inplace" and "inplace-force" values to the kvm-again.sh
"--link" argument, which causes the run's output to be placed into the
build directory. This could be used to save build time if the machine
went down partway into a run, but it can also be used to do a large
number of builds, and run the resulting kernels concurrently even if the
builds are based on different commits. A later commit will add this
latter capability to kvm-series.sh in order to produce large speedups
for branch-checking operations.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
This commit adds a kvm-series.sh script that takes a list of scenarios and
a list of commits, and then runs each scenario on all of the commits.
A given scenario is run on all the commits before advancing to the
next scenario to minimize build times. The successes and failures are
summarized at the end of the run.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
In rculist_nulls.h we can still see ordinary assignments to ->pprev and
->next of hlist_nulls.
As noted in the two patches below:
commit efd04f8a8b ("rcu: Use WRITE_ONCE() for assignments to ->next for
rculist_nulls")
commit 860c8802ac ("rcu: Use WRITE_ONCE() for assignments to ->pprev for
hlist_nulls")
We should use WRITE_ONCE().
Signed-off-by: Xuanqiang Luo <luoxuanqiang@kylinos.cn>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
With CONFIG_CPUMASK_OFFSTACK=y, the 'bind_writers' buffer is allocated via
alloc_cpumask_var() in param_set_cpumask(). But it is not freed, when
setting the module parameter multiple times by sysfs interface or removing
module.
Below kmemleak trace is seen for this issue:
unreferenced object 0xffff888100aabff8 (size 8):
comm "bash", pid 323, jiffies 4295059233
hex dump (first 8 bytes):
07 00 00 00 00 00 00 00 ........
backtrace (crc ac50919):
__kmalloc_node_noprof+0x2e5/0x420
alloc_cpumask_var_node+0x1f/0x30
param_set_cpumask+0x26/0xb0 [locktorture]
param_attr_store+0x93/0x100
module_attr_store+0x1b/0x30
kernfs_fop_write_iter+0x114/0x1b0
vfs_write+0x300/0x410
ksys_write+0x60/0xd0
do_syscall_64+0xa4/0x260
entry_SYSCALL_64_after_hwframe+0x77/0x7f
This issue can be reproduced by:
insmod locktorture.ko bind_writers=1
rmmod locktorture
or:
insmod locktorture.ko bind_writers=1
echo 2 > /sys/module/locktorture/parameters/bind_writers
Considering that setting the module parameter 'bind_writers' or
'bind_readers' by sysfs interface has no real effect, set the parameter
permissions to 0444. To fix the memory leak when removing module, free
'bind_writers' and 'bind_readers' memory in lock_torture_cleanup().
Fixes: 73e3412424 ("locktorture: Add readers_bind and writers_bind module parameters")
Suggested-by: Zhang Changzhong <zhangchangzhong@huawei.com>
Signed-off-by: Wang Liang <wangliang74@huawei.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
This commit makes CONFIG_PROVE_RCU=y kernels enforce the new rule
that srcu_struct structures that are passed to srcu_read_lock_fast()
and other SRCU-fast read-side markers be either initialized with
init_srcu_struct_fast() on the one hand or defined with DEFINE_SRCU_FAST()
or DEFINE_STATIC_SRCU_FAST() on the other.
This eliminates the read-side test that was formerly included in
srcu_read_lock_fast() and friends, speeding these primitives up by
about 25% (admittedly only about half of a nanosecond, but when tracing
on fastpaths...)
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: <bpf@vger.kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
This commit adds CONFIG_PROVE_RCU=y checking to enforce the new rule that
srcu_struct structures passed to srcu_read_lock_fast() and other SRCU-fast
read-side markers be either initialized with init_srcu_struct_fast()
on the one hand or defined using either DEFINE_SRCU_FAST() or
DEFINE_STATIC_SRCU_FAST(). This will enable removal of the non-debug
read-side checks from srcu_read_lock_fast() and friends, which on my
laptop provides a 25% speedup (which admittedly amounts to about half
a nanosecond, but when tracing fastpaths...)
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: <bpf@vger.kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
This commit causes the srcu_readers_unlock_idx() function to take the
srcu_struct structure's ->srcu_reader_flavor field into account. This
ensures that structures defined via DEFINE_SRCU_FAST( or initialized via
init_srcu_struct_fast() have their grace periods use synchronize_srcu()
or synchronize_srcu_expedited() instead of smp_mb(), even before the
first SRCU reader has been entered.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: <bpf@vger.kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
This commit creates DEFINE_SRCU_FAST() and DEFINE_STATIC_SRCU_FAST()
macros that are similar to DEFINE_SRCU() and DEFINE_STATIC_SRCU(),
but which create srcu_struct structures that are usable only by readers
initiated by srcu_read_lock_fast() and friends.
This commit does make DEFINE_SRCU_FAST() available to modules, in which
case the per-CPU srcu_data structures are not created at compile time, but
rather at module-load time. This means that the >srcu_reader_flavor field
of the srcu_data structure is not available. Therefore,
this commit instead creates an ->srcu_reader_flavor field in the
srcu_struct structure, adds arguments to the DEFINE_SRCU()-related
macros to initialize this new field, and extends the checks in the
__srcu_check_read_flavor() function to include this new field.
This commit also allows dynamically allocated srcu_struct structure
to be marked for SRCU-fast readers. It does so by defining a new
init_srcu_struct_fast() function that marks the specified srcu_struct
structure for use by srcu_read_lock_fast() and friends.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: <bpf@vger.kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
This commit creates an srcu_expedite_current() function that expedites
the current (and possibly the next) SRCU grace period for the specified
srcu_struct structure. This functionality will be inherited by RCU
Tasks Trace courtesy of its mapping to SRCU fast.
If the current SRCU grace period is already waiting, that wait will
complete before the expediting takes effect. If there is no SRCU grace
period in flight, this function might well create one.
[ paulmck: Apply Zqiang feedback for PREEMPT_RT use. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: <bpf@vger.kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
The current Tiny SRCU implementation of srcu_read_unlock() awakens
the grace-period processing when exiting the outermost SRCU read-side
critical section. However, not all Linux-kernel configurations and
contexts permit swake_up_one() to be invoked while interrupts are
disabled, and this can result in indefinitely extended SRCU grace periods.
This commit therefore only invokes swake_up_one() when interrupts are
enabled, and introduces polling to the grace-period workqueue handler.
Reported-by: kernel test robot <oliver.sang@intel.com>
Reported-by: Zqiang <qiang.zhang@linux.dev>
Closes: https://lore.kernel.org/oe-lkp/202508261642.b15eefbb-lkp@intel.com
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
The warned bitfields in struct scx_sched are updated racily from concurrent
CPUs causing RMW races, which is fine for these boolean warning flags. Add a
comment marking this area to prevent future fields that can't tolerate racy
updates from being added here.
Signed-off-by: Tejun Heo <tj@kernel.org>
Avoid case-sensitive sorting when listing Documentation targets in make help.
Previously, targets like PCI and RCU appeared ahead of others due to uppercase
names.
Normalize casing during _SPHINXDIRS generation to ensure consistent and
intuitive ordering.
Fixes: 965fc39f73 ("Documentation: sort _SPHINXDIRS for 'make help'")
Tested-by: Randy Dunlap <rdunlap@infradead.org>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Bhanu Seshu Kumar Valluri <bhanuseshukumar@gmail.com>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Message-ID: <20251104061723.16629-1-bhanuseshukumar@gmail.com>
Commit 0122938a7a ("rtla: Always set all tracer options") changed the
behavior of RTLA to always set all osnoise and timerlat tracer options
to default values taken from the tracers whenever an RTLA measurement
is started. The change was done to make RTLA results consistent on
subsequent runs of the same command.
Include the default values for tracer options also in documentation
where appropriate.
Signed-off-by: Tomas Glozar <tglozar@redhat.com>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Message-ID: <20251010083338.478961-10-tglozar@redhat.com>
The timerlat tracer documentation mentions that threads are created with
real-time priority, but does not mention which priority and scheduling
class is used.
Add the information so that users do not have to look it up in
trace_osnoise.c.
Signed-off-by: Tomas Glozar <tglozar@redhat.com>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Message-ID: <20251010083338.478961-9-tglozar@redhat.com>
The RTLA option -C/--cgroup is used to set a cgroup for workload
threads. This is either a specific cgroup, if passed an argument, or
rtla's cgroup, if no argument is given.
Expand the documentation of the -C option to also include the
information about the cgroup settings when the option is not specified.
Signed-off-by: Tomas Glozar <tglozar@redhat.com>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Message-ID: <20251010083338.478961-8-tglozar@redhat.com>
RTLA allows the priority of workload threads to be set using the -P
option. This is covered in docs, but the default state for RTLA's own
user workload (implemented in timerlat_u.c) is not mentioned.
Add mention of the default user workload priority as well as a reference
to osnoise and timerlat tracers for kernel workload priority.
Signed-off-by: Tomas Glozar <tglozar@redhat.com>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Message-ID: <20251010083338.478961-7-tglozar@redhat.com>
Several options in common_options.rst say "osnoise tracer" for both
osnoise and timerlat.
Use |tool| variable so that the correct tool name is used.
Fixes: b1be48307d ("rtla: Add rtla osnoise top documentation")
Signed-off-by: Tomas Glozar <tglozar@redhat.com>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Message-ID: <20251010083338.478961-6-tglozar@redhat.com>
When kernel-doc parses the sections for the documentation some errors
may occur. In many cases the warning is simply stored to the current
"entry" object. However, in the most of such cases this object gets
discarded and there is no way for the output engine to even know about
that. To avoid that, check if the "entry" is going to be discarded and
if there warnings have been collected, issue them to the current logger
as is and then flush the "entry". This fixes the problem that original
Perl implementation doesn't have.
As of Linux kernel v6.18-rc4 the reproducer can be:
$ scripts/kernel-doc -v -none -Wall include/linux/util_macros.h
...
Info: include/linux/util_macros.h:138 Scanning doc for function to_user_ptr
...
while with the proposed change applied it gives one more line:
$ scripts/kernel-doc -v -none -Wall include/linux/util_macros.h
...
Info: include/linux/util_macros.h:138 Scanning doc for function to_user_ptr
Warning: include/linux/util_macros.h:144 expecting prototype for to_user_ptr(). Prototype was for u64_to_user_ptr() instead
...
And with the original Perl script:
$ scripts/kernel-doc.pl -v -none -Wall include/linux/util_macros.h
...
include/linux/util_macros.h:139: info: Scanning doc for function to_user_ptr
include/linux/util_macros.h:149: warning: expecting prototype for to_user_ptr(). Prototype was for u64_to_user_ptr() instead
...
Fixes: 9cbc2d3b13 ("scripts/kernel-doc.py: postpone warnings to the output plugin")
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Tested-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Message-ID: <20251104215502.1049817-1-andriy.shevchenko@linux.intel.com>
The current cpuset code passes a local isolcpus_updated flag around in a
number of functions to determine if external isolation related cpumasks
like wq_unbound_cpumask should be updated. It is a bit cumbersome and
makes the code more complex. Simplify the code by using a global boolean
flag "isolated_cpus_updating" to track this. This flag will be set in
isolated_cpus_update() and cleared in update_isolation_cpumasks().
No functional change is expected.
Signed-off-by: Waiman Long <longman@redhat.com>
Reviewed-by: Chen Ridong <chenridong@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Commit 4a74e41888 ("cgroup/cpuset: Check partition conflict with
housekeeping setup") is supposed to ensure that domain isolated CPUs
designated by the "isolcpus" boot command line option stay either in
root partition or in isolated partitions. However, the required check
wasn't implemented when a remote partition was created or when an
existing partition changed type from "root" to "isolated".
Even though this is a relatively minor issue, we still need to add the
required prstate_housekeeping_conflict() call in the right places to
ensure that the rule is strictly followed.
The following steps can be used to reproduce the problem before this
fix.
# fmt -1 /proc/cmdline | grep isolcpus
isolcpus=9
# cd /sys/fs/cgroup/
# echo +cpuset > cgroup.subtree_control
# mkdir test
# echo 9 > test/cpuset.cpus
# echo isolated > test/cpuset.cpus.partition
# cat test/cpuset.cpus.partition
isolated
# cat test/cpuset.cpus.effective
9
# echo root > test/cpuset.cpus.partition
# cat test/cpuset.cpus.effective
9
# cat test/cpuset.cpus.partition
root
With this fix, the last few steps will become:
# echo root > test/cpuset.cpus.partition
# cat test/cpuset.cpus.effective
0-8,10-95
# cat test/cpuset.cpus.partition
root invalid (partition config conflicts with housekeeping setup)
Reported-by: Chen Ridong <chenridong@huawei.com>
Signed-off-by: Waiman Long <longman@redhat.com>
Reviewed-by: Chen Ridong <chenridong@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Move up the prstate_housekeeping_conflict() helper so that it can be
used in remote partition code.
Signed-off-by: Waiman Long <longman@redhat.com>
Reviewed-by: Chen Ridong <chenridong@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Currently the user can set up isolated cpus via cpuset and nohz_full in
such a way that leaves no housekeeping CPU (i.e. no CPU that is neither
domain isolated nor nohz full). This can be a problem for other
subsystems (e.g. the timer wheel imgration).
Prevent this configuration by blocking any assignation that would cause
the union of domain isolated cpus and nohz_full to covers all CPUs.
[longman: Remove isolated_cpus_should_update() and rewrite the checking
in update_prstate() and update_parent_effective_cpumask()]
Originally-by: Gabriele Monaco <gmonaco@redhat.com>
Signed-off-by: Waiman Long <longman@redhat.com>
Reviewed-by: Chen Ridong <chenridong@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
update_unbound_workqueue_cpumask() updates unbound workqueues settings
when there's a change in isolated CPUs, but it can be used for other
subsystems requiring updated when isolated CPUs change.
Generalise the name to update_isolation_cpumasks() to prepare for other
functions unrelated to workqueues to be called in that spot.
[longman: Change the function name to update_isolation_cpumasks()]
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Waiman Long <longman@redhat.com>
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
Signed-off-by: Waiman Long <longman@redhat.com>
Reviewed-by: Chen Ridong <chenridong@huaweicloud.com>
Reviewed-by: Chen Ridong <chenridong@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The files in Documentation/core-api/ are by virtue of their top-level
directory part of the Documentation section in MAINTAINERS. Each file in
Documentation/core-api/ should however also have a further section in
MAINTAINERS it belongs to, which fits to the technical area of the
documented API in that file.
The printk.rst provides some explanation to the printk API defined in
include/linux/printk.h, which itself is part of the PRINTK section.
Add this core-api document to PRINTK.
Signed-off-by: Lukas Bulwahn <lukas.bulwahn@redhat.com>
Link: https://patch.msgid.link/20251105102832.155823-1-lukas.bulwahn@redhat.com
Signed-off-by: Petr Mladek <pmladek@suse.com>
- Use memset() in scx_task_iter_start() instead of zeroing fields individually.
- In scx_task_iter_next(), move __scx_task_iter_maybe_relock() after the batch
check which is simpler.
- Update comment to reflect that tasks are removed from scx_tasks when dead
(commit 7900aa699c ("sched_ext: Fix cgroup exit ordering by moving
sched_ext_free() to finish_task_switch()")).
No functional changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
The BUILD_BUG_ON() which checks that __SCX_DSQ_ITER_ALL_FLAGS doesn't
overlap with the private lnode bits was in scx_task_iter_start() which has
nothing to do with DSQ iteration. Move it to bpf_iter_scx_dsq_new() where it
belongs.
No functional changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Due to commit abd61d1ff8 ("scripts: sphinx-pre-install: move it to
tools/docs"), checkpatch.pl --self-test=patterns reported a non-matching
file entry in DOCUMENTATION SCRIPTS. Clearly, there are now multiple
documentation scripts, all located in Documentation/sphinx/ and tools/docs/
and Mauro is the maintainer of those.
Update the DOCUMENTATION SCRIPTS section to cover these directories. While
at it, also make the DOCUMENTATION section cover the subdirectories of
tools/docs/.
Signed-off-by: Lukas Bulwahn <lukas.bulwahn@redhat.com>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Message-ID: <20251103075948.26026-1-lukas.bulwahn@redhat.com>
Our documentation-related tools are spread out over various directories;
several are buried in the scripts/ dumping ground. That makes them harder
to discover and harder to maintain.
Recent work has started accumulating our documentation-related tools in
/tools/docs. This series nearly completes that task, moving most of the
rest of our various utilities there, hopefully fixing up all of the
relevant references in the process.
The one exception is scripts/kernel-doc; that move turned up some other
problems, so I have dropped it until those are ironed out.
At the end, rather than move the old, Perl kernel-doc, I simply removed it.
Chinese translation docs for 6.18
This is the Chinese translation subtree for 6.18. It includes
the following changes:
- docs/zh_CN: Add rust Chinese translations
- docs/zh_CN: Add scsi Chinese translations
- docs/zh_CN: Add gfs2 Chinese translations
- Add some other Chinese translations and fixes
Above patches are tested by 'make htmldocs'
sched_ext_free() was called from __put_task_struct() when the last reference
to the task is dropped, which could be long after the task has finished
running. This causes cgroup-related problems:
- ops.init_task() can be called on a cgroup which didn't get ops.cgroup_init()'d
during scheduler load, because the cgroup might be destroyed/unlinked
while the zombie or dead task is still lingering on the scx_tasks list.
- ops.cgroup_exit() could be called before ops.exit_task() is called on all
member tasks, leading to incorrect exit ordering.
Fix by moving it to finish_task_switch() to be called right after the final
context switch away from the dying task, matching when sched_class->task_dead()
is called. Rename it to sched_ext_dead() to match the new calling context.
By calling sched_ext_dead() before cgroup_task_dead(), we ensure that:
- Tasks visible on scx_tasks list have valid cgroups during scheduler load,
as cgroup_mutex prevents cgroup destruction while the task is still linked.
- All member tasks have ops.exit_task() called and are removed from scx_tasks
before the cgroup can be destroyed and trigger ops.cgroup_exit().
This fix is made possible by the cgroup_task_dead() split in the previous patch.
This also makes more sense resource-wise as there's no point in keeping
scheduler side resources around for dead tasks.
Reported-by: Dan Schatzberg <dschatzberg@meta.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Pull cgroup/for-6.19 to receive:
16dad7801a ("cgroup: Rename cgroup lifecycle hooks to cgroup_task_*()")
260fbcb92b ("cgroup: Move dying_tasks cleanup from cgroup_task_release() to cgroup_task_free()")
d245698d72 ("cgroup: Defer task cgroup unlink until after the task is done switching out")
These are needed for the sched_ext cgroup exit ordering fix.
Signed-off-by: Tejun Heo <tj@kernel.org>
When a task exits, css_set_move_task(tsk, cset, NULL, false) unlinks the task
from its cgroup. From the cgroup's perspective, the task is now gone. If this
makes the cgroup empty, it can be removed, triggering ->css_offline() callbacks
that notify controllers the cgroup is going offline resource-wise.
However, the exiting task can still run, perform memory operations, and schedule
until the final context switch in finish_task_switch(). This creates a confusing
situation where controllers are told a cgroup is offline while resource
activities are still happening in it. While this hasn't broken existing
controllers, it has caused direct confusion for sched_ext schedulers.
Split cgroup_task_exit() into two functions. cgroup_task_exit() now only calls
the subsystem exit callbacks and continues to be called from do_exit(). The
css_set cleanup is moved to the new cgroup_task_dead() which is called from
finish_task_switch() after the final context switch, so that the cgroup only
appears empty after the task is truly done running.
This also reorders operations so that subsys->exit() is now called before
unlinking from the cgroup, which shouldn't break anything.
Cc: Dan Schatzberg <dschatzberg@meta.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Currently, cgroup_task_exit() adds thread group leaders with live member
threads to their css_set's dying_tasks list (so cgroup.procs iteration can
still see the leader), and cgroup_task_release() later removes them with
list_del_init(&task->cg_list).
An upcoming patch will defer the dying_tasks list addition, moving it from
cgroup_task_exit() (called from do_exit()) to a new function called from
finish_task_switch(). However, release_task() (which calls
cgroup_task_release()) can run either before or after finish_task_switch(),
creating a race where cgroup_task_release() might try to remove the task from
dying_tasks before or while it's being added.
Move the list_del_init() from cgroup_task_release() to cgroup_task_free() to
fix this race. cgroup_task_free() runs from __put_task_struct(), which is
always after both paths, making the cleanup safe.
Cc: Dan Schatzberg <dschatzberg@meta.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
The current names cgroup_exit(), cgroup_release(), and cgroup_free() are
confusing because they look like they're operating on cgroups themselves when
they're actually task lifecycle hooks. For example, cgroup_init() initializes
the cgroup subsystem while cgroup_exit() is a task exit notification to
cgroup. Rename them to cgroup_task_exit(), cgroup_task_release(), and
cgroup_task_free() to make it clear that these operate on tasks.
Cc: Dan Schatzberg <dschatzberg@meta.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: Chen Ridong <chenridong@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Along the code reorganizations, the file has been keeping the original
comments about argv and envp which are no longer relevant to this file
anymore. Let's just drop them.
Acked-by: Thomas Weißschuh <linux@weissschuh.net>
Signed-off-by: Willy Tarreau <w@1wt.eu>
The efforts we go through by installing a single arch are counter
productive when the base directory already supports them all, and
the arch-specific files are really small. Let's make the "headers"
target simply install headers for all supported archs and stop
trying to build a hybrid "arch.h" file on the fly, to instead keep
the generic one. Now the same nolibc headers installation will be
usable with any arch-specific uapi installation.
Signed-off-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
It's often recommended to only use inttypes.h instead of stdint.h for
portability reasons since the former is always present when the latter
is present, but not conversely, and the former includes the latter. Due
to this some simple programs fail to build when including inttypes.h.
Let's add one that simply includes nolibc.h to better support these
programs.
Signed-off-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Modern programs tend to include sys/select.h to get FD_SET() and
FD_CLR() definitions as well as struct fd_set, but in our case it
didn't exist.
The definitions were moved from types.h to sys/select.h, which is
now included from nolibc.h, and the sys_select() definition moved
there as well from sys.h.
Signed-off-by: Willy Tarreau <w@1wt.eu>
[Thomas: adapt to current -next branch]
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Surprisingly we forgot to add this common one. It was added with a
per-arch guard allowing to later implement it in arch-specific asm
code like was done for a few other ones.
The test verifies that we don't search past the indicated length.
Signed-off-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
The help message says the headers are going to be installed into
tools/include/nolibc but this is only the default if $OUTPUT is not set,
so better clarify this (the current value of $OUTPUT is already shown).
Signed-off-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Replace the manual string copying and parsing logic with a call to
simple_strtoull() to simplify and improve qat_uclo_parse_num().
Ensure that the parsed number does not exceed UINT_MAX, and add an
approximate upper-bound check (no more than 19 digits) to guard against
overflow.
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Acked-by: Giovanni Cabiddu <giovanni.cabiddu@intel.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Before calling acc_get_sgl(), the mem_block has already been
created. acc_get_sgl() will not return NULL or any other error.
so the return value check can be removed.
Signed-off-by: nieweiqiang <nieweiqiang@huawei.com>
Signed-off-by: Chenghai Huang <huangchenghai2@huawei.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
The isolate_strategy_store function is not protected
by a lock. If sysfs operations and functions that depend
on the err_threshold variable,such as qm_hw_err_isolate(),
execute concurrently, the outcome will be unpredictable.
Therefore, concurrency protection should be added for
the err_threshold variable.
Signed-off-by: nieweiqiang <nieweiqiang@huawei.com>
Signed-off-by: Chenghai Huang <huangchenghai2@huawei.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
The eqe and aeqe are device updated values that include the
valid bit and queue number. In the current process, there is no
memory barrier added, so it cannot be guaranteed that the valid
bit is read before other processes are executed. Since eqe and aeqe
are only 4 bytes and the device writes them to memory in a single
operation, saving the values of eqe and aeqe ensures that the valid
bit and queue number read by the CPU were written by the device
simultaneously.
Signed-off-by: nieweiqiang <nieweiqiang@huawei.com>
Signed-off-by: Chenghai Huang <huangchenghai2@huawei.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
On multiple occasions the qce device have shown up in devices_deferred,
without the explanation that this came from the failure to acquire the
DMA channels from the associated BAM.
Use dev_err_probe() to associate this context with the failure to faster
pinpoint the culprit when this happens in the future.
Signed-off-by: Bjorn Andersson <bjorn.andersson@oss.qualcomm.com>
Reviewed-by: Abel Vesa <abel.vesa@linaro.org>
Reviewed-by: David Heidelberg <david@ixit.cz>
Reviewed-by: Konrad Dybcio <konrad.dybcio@oss.qualcomm.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Add the __counted_by() compiler attribute to the flexible array member
'data' to improve access bounds-checking via CONFIG_UBSAN_BOUNDS and
CONFIG_FORTIFY_SOURCE.
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Reviewed-by: Lukas Wunner <lukas@wunner.de>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Add support for XTS mode of operation for AES algorithm in the AES
Engine of the DTHEv2 hardware cryptographic engine.
Signed-off-by: T Pratham <t-pratham@ti.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
This patch introduces infrastructure for allocating req objects on the
stack for AEADs. The additions mirror the existing sync skcipher APIs.
This can be used in cases where simple sync AEAD operations are being
done. So allocating the request on stack avoides possible out-of-memory
errors.
The struct crypto_sync_aead is a wrapper around crypto_aead and should
be used in its place when sync only requests will be done on the stack.
Correspondingly, the request should be allocated with
SYNC_AEAD_REQUEST_ON_STACK().
Similar to sync_skcipher APIs, the new sync_aead APIs are wrappers
around the regular aead APIs to facilitate sync only operations. The
following crypto APIs are added:
- struct crypto_sync_aead
- crypto_alloc_sync_aead()
- crypto_free_sync_aead()
- crypto_aync_aead_tfm()
- crypto_sync_aead_setkey()
- crypto_sync_aead_setauthsize()
- crypto_sync_aead_authsize()
- crypto_sync_aead_maxauthsize()
- crypto_sync_aead_ivsize()
- crypto_sync_aead_blocksize()
- crypto_sync_aead_get_flags()
- crypto_sync_aead_set_flags()
- crypto_sync_aead_clear_flags()
- crypto_sync_aead_reqtfm()
- aead_request_set_sync_tfm()
- SYNC_AEAD_REQUEST_ON_STACK()
Signed-off-by: T Pratham <t-pratham@ti.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
This reverts commit d3e7609c6e5ec92587ed1043a985749d22cc78d1.
The commit cause a warning:
Documentation/networking/skbuff.rst:34: WARNING: duplicate label crc, other instance in Documentation/translations/zh_CN/networking/skbuff.rst
And there's no simple way to keep the meaningful doc context and avoid the
warning, so, let's remove the doc.
Signed-off-by: Alex Shi <alexs@kernel.org>
The python version of the kernel-doc parser emits some strange warnings
with just a line number in certain cases:
$ ./scripts/kernel-doc -Wall -none 'include/linux/virtio_config.h'
Warning: 174
Warning: 184
Warning: 190
Warning: include/linux/virtio_config.h:226 No description found for return value of '__virtio_test_bit'
Warning: include/linux/virtio_config.h:259 No description found for return value of 'virtio_has_feature'
Warning: include/linux/virtio_config.h:283 No description found for return value of 'virtio_has_dma_quirk'
Warning: include/linux/virtio_config.h:392 No description found for return value of 'virtqueue_set_affinity'
I eventually tracked this down to the lone call of emit_msg() in the
KernelEntry class, which looks like:
self.emit_msg(self.new_start_line, f"duplicate section name '{name}'\n")
This looks like all the other emit_msg calls. Unfortunately, the definition
within the KernelEntry class takes only a message parameter and not a line
number. The intended message is passed as the warning!
Pass the filename to the KernelEntry class, and use this to build the log
message in the same way as the KernelDoc class does.
To avoid future errors, mark the warning parameter for both emit_msg
definitions as a keyword-only argument. This will prevent accidentally
passing a string as the warning parameter in the future.
Also fix the call in dump_section to avoid an unnecessary additional
newline.
Fixes: e3b42e94cf ("scripts/lib/kdoc/kdoc_parser.py: move kernel entry to a class")
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Message-ID: <20251030-jk-fix-kernel-doc-duplicate-return-warning-v2-1-ec4b5c662881@intel.com>
Use core_param() and __core_param_cb() instead of __setup() or
__setup_param() to improve syntax checking and error messages.
Replace get_option() with kstrtouint(), because:
* the latter accepts a pointer to const char,
* these parameters should not accept ranges,
* error value can be passed directly to parser.
There is one more change apart from the parsing of numeric parameters:
slab_strict_numa parameter name must match exactly. Before this patch the
kernel would silently accept any option that starts with the name as an
undocumented alias.
Signed-off-by: Petr Tesarik <ptesarik@suse.com>
Link: https://patch.msgid.link/6ae7e0ddc72b7619203c07dd5103a598e12f713b.1761324765.git.ptesarik@suse.com
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
printk() tries to flush messages with NBCON_PRIO_EMERGENCY on
nbcon consoles immediately. It might take seconds to flush all
pending lines on slow serial consoles. Note that there might be
hundreds of messages, for example:
[ 3.771531][ T1] pci 0000:3e:08.1: [8086:324
** replaying previous printk message **
[ 3.771531][ T1] pci 0000:3e:08.1: [8086:3246] type 00 class 0x088000 PCIe Root Complex Integrated Endpoint
[ ... more than 2000 lines, about 200kB messages ... ]
[ 3.837752][ T1] pci 0000:20:01.0: Adding to iommu group 18
[ 3.837851][ T
** replaying previous printk message **
[ 3.837851][ T1] pci 0000:20:03.0: Adding to iommu group 19
[ 3.837946][ T1] pci 0000:20:05.0: Adding to iommu group 20
[ ... more than 500 messages for iommu groups 21-590 ...]
[ 3.912932][ T1] pci 0000:f6:00.1: Adding to iommu group 591
[ 3.913070][ T1] pci 0000:f6:00.2: Adding to iommu group 592
[ 3.913243][ T1] DMAR: Intel(R) Virtualization Technology for Directed I/O
[ 3.913245][ T1] PCI-DMA: Using software bounce buffering for IO (SWIOTLB)
[ 3.913245][ T1] software IO TLB: mapped [mem 0x000000004f000000-0x0000000053000000] (64MB)
[ 3.913324][ T1] RAPL PMU: API unit is 2^-32 Joules, 3 fixed counters, 655360 ms ovfl timer
[ 3.913325][ T1] RAPL PMU: hw unit of domain package 2^-14 Joules
[ 3.913326][ T1] RAPL PMU: hw unit of domain dram 2^-14 Joules
[ 3.913327][ T1] RAPL PMU: hw unit of domain psys 2^-0 Joules
[ 3.933486][ T1] ------------[ cut here ]------------
[ 3.933488][ T1] WARNING: CPU: 2 PID: 1 at arch/x86/events/intel/uncore.c:1156 uncore_pci_pmu_register+0x15e/0x180
[ 3.930291][ C0] watchdog: Watchdog detected hard LOCKUP on cpu 0
[ 3.930291][ C0] Kernel panic - not syncing: Hard LOCKUP
[...]
[ 3.930291][ C0] CPU: 0 UID: 0 PID: 18 Comm: pr/ttyS0 Not tainted...
[...]
[ 3.930291][ C0] RIP: 0010:nbcon_reacquire_nobuf+0x11/0x50
[ 3.930291][ C0] Call Trace:
[...]
[ 3.930291][ C0] <TASK>
[ 3.930291][ C0] serial8250_console_write+0x16d/0x5c0
[ 3.930291][ C0] nbcon_emit_next_record+0x22c/0x250
[ 3.930291][ C0] nbcon_emit_one+0x93/0xe0
[ 3.930291][ C0] nbcon_kthread_func+0x13c/0x1c0
The are visible two takeovers of the console ownership:
- The 1st one is triggered by the "WARNING: CPU: 2 PID: 1 at
arch/x86/..." line printed with NBCON_PRIO_EMERGENCY.
- The 2nd one is triggered by the "Kernel panic - not syncing:
Hard LOCKUP" line printed with NBCON_PRIO_PANIC.
There are more than 2500 lines, at about 240kB, emitted between
the takeover and the 1st "WARNING" line in the emergency context.
This amount of pending messages had to be flushed by
nbcon_atomic_flush_pending() when WARN() printed its first line.
The atomic flush was holding the nbcon console context for too long so
that it triggered hard lockup on the CPU running the printk kthread
"pr/ttyS0". The kthread needed to reacquire the console ownership
for restoring the original serial port state in serial8250_console_write().
Prevent the hardlockup by releasing the nbcon console ownership after
each emitted record.
Note that __nbcon_atomic_flush_pending_con() used to hold the console
ownership all the time because it blocked the printk kthread. Otherwise
the kthread tried to flush the messages in parallel which caused repeated
takeovers and more replayed messages.
It is not longer a problem because the repeated takeovers are blocked
by the counter of emergency contexts, see nbcon_cpu_emergency_cnt.
Link: https://lore.kernel.org/all/aNQO-zl3k1l4ENfy@pathway.suse.cz
Reviewed-by: Andrew Murray <amurray@thegoodpenguin.co.uk>
Reviewed-by: John Ogness <john.ogness@linutronix.de>
Link: https://patch.msgid.link/20250926124912.243464-4-pmladek@suse.com
Signed-off-by: Petr Mladek <pmladek@suse.com>
The printk kthread might be running when there is a panic in progress.
But it is not able to acquire the console ownership any longer.
Prevent the desperate attempts to acquire the ownership and allow sleeping
in panic. It would make it behave the same as when there is any CPU
in an emergency context.
Reviewed-by: Andrew Murray <amurray@thegoodpenguin.co.uk>
Reviewed-by: John Ogness <john.ogness@linutronix.de>
Link: https://patch.msgid.link/20250926124912.243464-3-pmladek@suse.com
[pmladek@suse.com: Rebased on top of 6.18-rc1 (panic_in_progress() moved to panic.c)]
Signed-off-by: Petr Mladek <pmladek@suse.com>
In emergency contexts, printk() tries to flush messages directly even
on nbcon consoles. And it is allowed to takeover the console ownership
and interrupt the printk kthread in the middle of a message.
Only one takeover and one repeated message should be enough in most
situations. The first emergency message flushes the backlog and printk
kthreads get to sleep. Next emergency messages are flushed directly
and printk() does not wake up the kthreads.
However, the one takeover is not guaranteed. Any printk() in normal
context on another CPU could wake up the kthreads. Or a new emergency
message might be added before the kthreads get to sleep. Note that
the interrupted .write_thread() callbacks usually have to call
nbcon_reacquire_nobuf() and restore the original device setting
before checking for pending messages.
The risk of the repeated takeovers will be even bigger because
__nbcon_atomic_flush_pending_con is going to release the console
ownership after each emitted record. It will be needed to prevent
hardlockup reports on other CPUs which are busy waiting for
the context ownership, for example, by nbcon_reacquire_nobuf() or
__uart_port_nbcon_acquire().
The repeated takeovers break the output, for example:
[ 5042.650211][ T2220] Call Trace:
[ 5042.6511
** replaying previous printk message **
[ 5042.651192][ T2220] <TASK>
[ 5042.652160][ T2220] kunit_run_
** replaying previous printk message **
[ 5042.652160][ T2220] kunit_run_tests+0x72/0x90
[ 5042.653340][ T22
** replaying previous printk message **
[ 5042.653340][ T2220] ? srso_alias_return_thunk+0x5/0xfbef5
[ 5042.654628][ T2220] ? stack_trace_save+0x4d/0x70
[ 5042.6553
** replaying previous printk message **
[ 5042.655394][ T2220] ? srso_alias_return_thunk+0x5/0xfbef5
[ 5042.656713][ T2220] ? save_trace+0x5b/0x180
A more robust solution is to block the printk kthread entirely whenever
*any* CPU enters an emergency context. This ensures that critical messages
can be flushed without contention from the normal, non-atomic printing
path.
Link: https://lore.kernel.org/all/aNQO-zl3k1l4ENfy@pathway.suse.cz
Reviewed-by: Andrew Murray <amurray@thegoodpenguin.co.uk>
Reviewed-by: John Ogness <john.ogness@linutronix.de>
Link: https://patch.msgid.link/20250926124912.243464-2-pmladek@suse.com
[pmladek@suse.com: Added changes proposed by John Ogness]
Signed-off-by: Petr Mladek <pmladek@suse.com>
For PR_SPEC_STORE_BYPASS and PR_SPEC_INDIRECT_BRANCH, PR_SPEC_DISABLE
means "disable the speculation bug" i.e. "enable the mitigation".
For PR_SPEC_L1D_FLUSH, PR_SPEC_DISABLE means "disable the mitigation".
This is not obvious, so document it.
Signed-off-by: Brendan Jackman <jackmanb@google.com>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Message-ID: <20251015-l1d-flush-doc-v1-1-f8cefea3f2f2@google.com>
Commit 111a79800a ("tools/sched_ext: Strip compatibility macros for
cgroup and dispatch APIs") removed the compat layer for v6.12-v6.13 kfunc
renaming, but v6.12 is the current LTS kernel and will remain supported
through 2026. Restore backward compatibility so schedulers built with v6.19+
headers can run on v6.12 kernels.
The restored compat differs from the original in two ways:
1. Uses ___new/___old suffixes instead of ___compat for clarity. The new
macros check for v6.13+ names (scx_bpf_dsq_move*), fall back to v6.12
names (scx_bpf_dispatch_from_dsq*, scx_bpf_consume), then return safe
no-ops for missing symbols.
2. Integrates with the args-struct-packing changes added in c0d630ba34
("sched_ext: Wrap kfunc args in struct to prepare for aux__prog").
scx_bpf_dsq_insert_vtime() now tries __scx_bpf_dsq_insert_vtime (args
struct), then scx_bpf_dsq_insert_vtime___compat (v6.13-v6.18), then
scx_bpf_dispatch_vtime___compat (pre-v6.13).
Forward compatibility is not restored - binaries built against v6.13 or
earlier headers won't run on v6.19+ kernels, as the old kfunc names are not
exported. This is acceptable since the priority is new binaries running on
older kernels.
Also add missing compat checks for ops.cgroup_set_bandwidth() (added v6.17)
and ops.cgroup_set_idle() (added v6.19). These need to be NULLed out in
userspace on older kernels.
Reported-by: Andrea Righi <arighi@nvidia.com>
Acked-and-tested-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
This is generally useful and struct iovec is also needed for other
purposes such as ptrace.
Signed-off-by: Benjamin Berg <benjamin.berg@intel.com>
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
In principle, it is possible to use nolibc for only some object files in
a program. In that case, the startup code in _start and _start_c is not
going to be used. Add the NOLIBC_NO_RUNTIME compile time option to
disable it entirely and also remove anything that depends on it.
Doing this avoids warnings from modpost for UML as the _start_c code
references the main function from the .init.text section while it is not
inside .init itself.
Signed-off-by: Benjamin Berg <benjamin.berg@intel.com>
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Use the version of the attribute with underscores to avoid issues if
fallthrough has been defined by another header file already.
Signed-off-by: Benjamin Berg <benjamin.berg@intel.com>
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
For improved compatibility, print %m as "unknown error" when nolibc is
compiled using NOLIBC_IGNORE_ERRNO.
Signed-off-by: Benjamin Berg <benjamin.berg@intel.com>
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Using errno is not possible when NOLIBC_IGNORE_ERRNO is set. Use
sys_lseek instead of lseek as that avoids using errno.
Fixes: 665fa8dea9 ("tools/nolibc: add support for directory access")
Signed-off-by: Benjamin Berg <benjamin.berg@intel.com>
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
There is no errno variable when NOLIBC_IGNORE_ERRNO is defined. As such,
simply print the message with "unknown error" rather than the integer
value of errno.
Fixes: acab7bcdb1 ("tools/nolibc/stdio: add perror() to report the errno value")
Signed-off-by: Benjamin Berg <benjamin.berg@intel.com>
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Since commit fb01ff635e ("tools/nolibc: keep brk(), sbrk(), mmap()
away from __sysret()") the implementation of mmap() does not use the
__sysret() macro anymore.
Remove the outdated comment.
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
The ops.cpu_acquire/release() callbacks miss events under multiple conditions.
There are two distinct task dispatch gaps that can cause cpu_released flag
desynchronization:
1. balance-to-pick_task gap: This is what was originally reported. balance_scx()
can enqueue a task, but during consume_remote_task() when the rq lock is
released, a higher priority task can be enqueued and ultimately picked while
cpu_released remains false. This gap is closeable via RETRY_TASK handling.
2. ttwu-to-pick_task gap: ttwu() can directly dispatch a task to a CPU's local
DSQ. By the time the sched path runs on the target CPU, higher class tasks may
already be queued. In such cases, nothing on sched_ext side will be invoked,
and the only solution would be a hook invoked regardless of sched class, which
isn't desirable.
Rather than adding invasive core hooks, BPF schedulers can use generic BPF
mechanisms like tracepoints. From SCX scheduler's perspective, this is congruent
with other mechanisms it already uses and doesn't add further friction.
The main use case for cpu_release() was calling scx_bpf_reenqueue_local() when
a CPU gets preempted by a higher priority scheduling class. However, the old
scx_bpf_reenqueue_local() could only be called from cpu_release() context.
Add a new version of scx_bpf_reenqueue_local() that can be called from any
context by deferring the actual re-enqueue operation. This eliminates the need
for cpu_acquire/release() ops entirely. Schedulers can now use standard BPF
mechanisms like the sched_switch tracepoint to detect and handle CPU preemption.
Update scx_qmap to demonstrate the new approach using sched_switch instead of
cpu_release, with compat support for older kernels. Mark cpu_acquire/release()
as deprecated. The old scx_bpf_reenqueue_local() variant will be removed in
v6.23.
Reported-by: Wen-Fang Liu <liuwenfang@honor.com>
Link: https://lore.kernel.org/all/8d64c74118c6440f81bcf5a4ac6b9f00@honor.com/
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Factor out the core re-enqueue logic from scx_bpf_reenqueue_local() into a
new reenq_local() helper function. scx_bpf_reenqueue_local() now handles the
BPF kfunc checks and calls reenq_local() to perform the actual work.
This is a prep patch to allow reenq_local() to be called from other contexts.
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
schedule_deferred() currently requires the rq lock to be held so that it can
use scheduler hooks for efficiency when available. However, there are cases
where deferred actions need to be scheduled from contexts that don't hold the
rq lock.
Split into schedule_deferred() which can be called from any context and just
queues irq_work, and schedule_deferred_locked() which requires the rq lock and
can optimize by using scheduler hooks when available. Update the existing call
site to use the _locked variant.
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Pull to receive f4fa7c25f6 ("sched_ext: Fix use of uninitialized variable
in scx_bpf_cpuperf_set()") which conflicts with changes for planned
sub-sched changes.
printk_legacy_map is used to hide lock nesting violations caused by
legacy drivers and is using the wrong override type. LD_WAIT_SLEEP is
for always sleeping lock types such as mutex_t. LD_WAIT_CONFIG is for
lock type which are sleeping while spinning on PREEMPT_RT such as
spinlock_t.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://patch.msgid.link/20251026150726.GA23223@redhat.com
[pmladek@suse.com: Fixed indentation.]
Signed-off-by: Petr Mladek <pmladek@suse.com>
We've been using the Python version and nobody has missed this one. All
credit goes to Mauro Carvalho Chehab for creating the replacement.
Reviewed-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Acked-by: Jani Nikula <jani.nikula@intel.com>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Move this tool out of scripts/ to join the other documentation tools; fix
up a couple of erroneous references in the process.
It's worth noting that this script will fail badly unless one has a
PYTHONPATH referencing scripts/lib/abi.
Reviewed-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Acked-by: Jani Nikula <jani.nikula@intel.com>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
The scripts for managing the features docs are found in three different
directories; unite them all under tools/docs and update references as
needed.
Reviewed-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Acked-by: Jani Nikula <jani.nikula@intel.com>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
ddf7233fca ("sched/ext: Fix invalid task state transitions on class
switch") added tryget_task_struct() test during scx_enable()'s class
switching loop. The reason for the addition was to avoid enabling tasks which
skipped prep in the previous loop due to being dead.
While tryget_task_struct() does work for this purpose as tasks that fail
tryget always will fail it, it's a bit roundabout. A more direct way is
testing whether the task is in READY state. Switch to testing SCX_TASK_READY
directly.
Cc: Andrea Righi <arighi@nvidia.com>
Acked-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
A later commit will reduce the size of the RCU watching counter to free up
some bits for another purpose. Paul suggested adding a config option to
test the extreme case where the counter is reduced to its minimum usable
width for rcutorture to poke at, so do that.
Make it only configurable under RCU_EXPERT. While at it, add a comment to
explain the layout of context_tracking->state.
Link: http://lore.kernel.org/r/4c2cb573-168f-4806-b1d9-164e8276e66a@paulmck-laptop
Suggested-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
I recently got occasional build failures at -Os or -Oz that would always
involve waitpid(), where the assembler would complain about this:
init.s: Error: .size expression for waitpid.constprop.0 does not evaluate to a constant
And without -fno-asynchronous-unwind-tables it could also spit such
errors:
init.s:836: Error: CFI instruction used without previous .cfi_startproc
init.s:838: Error: .cfi_endproc without corresponding .cfi_startproc
init.s: Error: open CFI at the end of file; missing .cfi_endproc directive
A trimmed down reproducer is as simple as this:
int main(int argc, char **argv)
{
int ret, status;
if (argc == 0)
ret = waitpid(-1, &status, 0);
else
ret = waitpid(-1, &status, 0);
return status;
}
It produces the following asm code on x86_64:
.text
.section .text.nolibc_memmove_memcpy
.weak memmove
.weak memcpy
memmove:
memcpy:
movq %rdx, %rcx
(...)
retq
.section .text.nolibc_memset
.weak memset
memset:
xchgl %eax, %esi
movq %rdx, %rcx
pushq %rdi
rep stosb
popq %rax
retq
.type waitpid.constprop.0.isra.0, @function
waitpid.constprop.0.isra.0:
subq $8, %rsp
(...)
jmp *.L5(,%rax,8)
.section .rodata
.align 8
.align 4
.L5:
.quad .L10
(...)
.quad .L4
.text
.L10:
(...)
.cfi_def_cfa_offset 8
ret
.cfi_endproc
.LFE273:
.size waitpid.constprop.0.isra.0, .-waitpid.constprop.0.isra.0
It's a bit dense, but here's the explanation: the compiler has emitted a
".text" statement because it knows it's working in the .text section.
Then, our hand-written asm code for the mem* functions forced the section
to .text.something without the compiler knowing about it, so it thinks
the code is still being emitted for .text. As such, without any .section
statement, the waitpid.constprop.0.isra.0 label is in fact placed in the
previously created section, here .text.nolibc_memset.
The waitpid() function involves a switch/case statement that can be
turned to a jump table, which is what the compiler does with the .rodata
section, and after that it restores .text, which is no longer the
previous .text.nolibc_memset section. Then the CFI statements cross a
section, so does the .size calculation, which explains the error.
While a first approach consisting in placing an explicit ".text" at the
end of these functions was verified to work, it's still unreliable as
it depends on what the compiler remembers having emitted previously. A
better approach is to replace the ".section" with ".pushsection", and
place a ".popsection" at the end, so that these code blocks are agnostic
to where they're placed relative to other blocks.
Fixes: 553845eebd ("tools/nolibc: x86-64: Use `rep movsb` for `memcpy()` and `memmove()`")
Fixes: 12108aa8c1 ("tools/nolibc: x86-64: Use `rep stosb` for `memset()`")
Signed-off-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Use __core_param_cb() to parse the "slab_debug" kernel parameter instead of
the obsolescent __setup(). For now, the parameter is not exposed in sysfs,
and no get ops is provided.
There is a slight change in behavior. Before this patch, the following
parameter would silently turn on full debugging for all slabs:
slub_debug_yada_yada_gotta_love_this=hail_satan!
This syntax is now rejected, and the parameter will be passed to user
space, making the kernel a holier place.
Signed-off-by: Petr Tesarik <ptesarik@suse.com>
Link: https://patch.msgid.link/9674b34861394088c7853edf8e9d2b439fd4b42f.1761324765.git.ptesarik@suse.com
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2dbbdeda77 ("sched_ext: Fix scx_bpf_dsq_insert() backward binary
compatibility") renamed the new bool-returning variant to scx_bpf_dsq_insert___v2
in the kernel. However, libbpf currently only strips ___SUFFIX on the BPF side,
not on kernel symbols, so the compat wrapper couldn't match the kernel kfunc and
would always fall back to the old variant even when the new one was available.
Add an extra ___compat suffix as a workaround - libbpf strips one suffix on the
BPF side leaving ___v2, which then matches the kernel kfunc directly. In the
future when libbpf strips all suffixes on both sides, all suffixes can be
dropped.
Fixes: 2dbbdeda77 ("sched_ext: Fix scx_bpf_dsq_insert() backward binary compatibility")
Cc: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
When removing a task from a FIFO DSQ, we must delete it from the list
before updating dsq->first_task, otherwise the following lookup will
just re-read the same task, leaving first_task pointing to removed
entry.
This issue only affects DSQs operating in FIFO mode, as priority DSQs
correctly update the rbtree before re-evaluating the new first task.
Remove the item from the list before refreshing the first task to
guarantee the correct behavior in FIFO DSQs.
Fixes: 44f5c8ec5b ("sched_ext: Add lockless peek operation for DSQs")
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Function kdb_msg_write was calling con->write for any found console,
but it won't work on NBCON consoles. In this case we should acquire the
ownership of the console using NBCON_PRIO_EMERGENCY, since printing
kdb messages should only be interrupted by a panic.
At this point, the console is required to use the atomic callback. The
console is skipped if the write_atomic callback is not set or if the
context could not be acquired. The validation of NBCON is done by the
console_is_usable helper. The context is released right after
write_atomic finishes.
The oops_in_progress handling is only needed in the legacy consoles,
so it was moved around the con->write callback.
Suggested-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: John Ogness <john.ogness@linutronix.de>
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Link: https://patch.msgid.link/20251016-nbcon-kgdboc-v6-5-866aac60a80e@suse.com
[pmladek@suse.com: Fixed compilation with !CONFIG_PRINTK.]
Signed-off-by: Petr Mladek <pmladek@suse.com>
KDB can interrupt any console to execute the "mirrored printing" at any
time, so add an exception to nbcon_context_try_acquire_direct to allow
to get the context if the current CPU is the same as kdb_printf_cpu.
This change will be necessary for the next patch, which fixes
kdb_msg_write to work with NBCON consoles by calling ->write_atomic on
such consoles. But to print it first needs to acquire the ownership of
the console, so nbcon_context_try_acquire_direct is fixed here.
Reviewed-by: John Ogness <john.ogness@linutronix.de>
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://patch.msgid.link/20251016-nbcon-kgdboc-v6-3-866aac60a80e@suse.com
[pmladek@suse.com: Fix compilation with !CONFIG_KGDB_KDB.]
Signed-off-by: Petr Mladek <pmladek@suse.com>
These helpers will be used when calling console->write_atomic on
KDB code in the next patch. It's basically the same implementation
as nbcon_device_try_acquire, but using NBCON_PRIO_EMERGENCY when
acquiring the context.
If the acquire succeeds, the message and message length are assigned to
nbcon_write_context so ->write_atomic can print the message.
After release try to flush the console since there may be a backlog of
messages in the ringbuffer. The kthread console printers do not get a
chance to run while kdb is active.
Reviewed-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: John Ogness <john.ogness@linutronix.de>
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Link: https://patch.msgid.link/20251016-nbcon-kgdboc-v6-2-866aac60a80e@suse.com
Signed-off-by: Petr Mladek <pmladek@suse.com>
On mobile device high-load situations, permission check can happen
more than 90,000/s (8 core system). With default 512 cache nodes
configuration, avc cache miss happens more often and occasionally
leads to long time (>2ms) irqs off on both big and little cores,
which decreases system real-time capability.
An actual call stack is as follows:
=> avc_compute_av
=> avc_perm_nonode
=> avc_has_perm_noaudit
=> selinux_capable
=> security_capable
=> capable
=> __sched_setscheduler
=> do_sched_setscheduler
=> __arm64_sys_sched_setscheduler
=> invoke_syscall
=> el0_svc_common
=> do_el0_svc
=> el0_svc
=> el0t_64_sync_handler
=> el0t_64_sync
Although we can expand avc nodes through /sys/fs/selinux/cache_threshold
to mitigate long time irqs off, hash conflicts make the bucket average
length longer because of the fixed size of cache slots, leading to
avc_search_node() latency increase.
So introduce a new config to make avc cache slot size also configurable,
and with fine tuning, we can mitigate long time irqs off with slightly
avc_search_node() performance regression.
Theoretically, the main overhead is memory consumption.
Signed-off-by: Hongru Zhang <zhanghongru@xiaomi.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
The legacy printer kthread uses console_lock and
__console_flush_and_unlock to flush records to the console. This
approach results in the console_lock being held for the entire
duration of a flush. This can result in large waiting times for
those waiting for console_lock especially where there is a large
volume of records or where the console is slow (e.g. serial). This
contention is observed during boot, as the call to filp_open in
console_on_rootfs will delay progression to userspace until any
in-flight flush is completed.
Let's instead use console_flush_one_record and release/reacquire
the console_lock between records.
On a PocketBeagle 2, with the following boot args:
"console=ttyS2,9600 initcall_debug=1 loglevel=10"
Without this patch:
[ 5.613166] +console_on_rootfs/filp_open
[ 5.643473] mmc1: SDHCI controller on fa00000.mmc [fa00000.mmc] using ADMA 64-bit
[ 5.643823] probe of fa00000.mmc returned 0 after 258244 usecs
[ 5.710520] mmc1: new UHS-I speed SDR104 SDHC card at address 5048
[ 5.721976] mmcblk1: mmc1:5048 SD32G 29.7 GiB
[ 5.747258] mmcblk1: p1 p2
[ 5.753324] probe of mmc1:5048 returned 0 after 40002 usecs
[ 15.595240] ti_sci_pm_domains 44043000.system-controller:power-controller: sync_state() pending due to 30040000.pruss
[ 15.595282] ti_sci_pm_domains 44043000.system-controller:power-controller: sync_state() pending due to e010000.watchdog
[ 15.595297] ti_sci_pm_domains 44043000.system-controller:power-controller: sync_state() pending due to e000000.watchdog
[ 15.595437] ti_sci_pm_domains 44043000.system-controller:power-controller: sync_state() pending due to 30300000.crc
[ 146.275961] -console_on_rootfs/filp_open ...
and with:
[ 5.477122] +console_on_rootfs/filp_open
[ 5.595814] mmc1: SDHCI controller on fa00000.mmc [fa00000.mmc] using ADMA 64-bit
[ 5.596181] probe of fa00000.mmc returned 0 after 312757 usecs
[ 5.662813] mmc1: new UHS-I speed SDR104 SDHC card at address 5048
[ 5.674367] mmcblk1: mmc1:5048 SD32G 29.7 GiB
[ 5.699320] mmcblk1: p1 p2
[ 5.705494] probe of mmc1:5048 returned 0 after 39987 usecs
[ 6.418682] -console_on_rootfs/filp_open ...
...
[ 15.593509] ti_sci_pm_domains 44043000.system-controller:power-controller: sync_state() pending due to 30040000.pruss
[ 15.593551] ti_sci_pm_domains 44043000.system-controller:power-controller: sync_state() pending due to e010000.watchdog
[ 15.593566] ti_sci_pm_domains 44043000.system-controller:power-controller: sync_state() pending due to e000000.watchdog
[ 15.593704] ti_sci_pm_domains 44043000.system-controller:power-controller: sync_state() pending due to 30300000.crc
Where I've added a printk surrounding the call in console_on_rootfs
to filp_open.
Suggested-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Andrew Murray <amurray@thegoodpenguin.co.uk>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: John Ogness <john.ogness@linutronix.de>
Link: https://patch.msgid.link/20251020-printk_legacy_thread_console_lock-v3-3-00f1f0ac055a@thegoodpenguin.co.uk
[pmladek@suse.com: Fixed ordering of variable definition suggested by John.]
Signed-off-by: Petr Mladek <pmladek@suse.com>
console_flush_one_record() and console_flush_all() duplicate several
checks. They both want to tell the caller that consoles are not
longer usable in this context because it has lost the lock or
the lock has to be reserved for the panic CPU.
Remove the duplication by changing the semantic of the function
console_flush_one_record() return value and parameters.
The function will return true when it is able to do the job. It means
that there is at least one usable console. And the flushing was
not interrupted by a takeover or panic_on_other_cpu().
Also replace the @any_usable parameter with @try_again. The @try_again
parameter will be set to true when the function could do the job
and at least one console made a progress.
Motivation:
The callers need to know when
+ they should continue flushing => @try_again
+ when the console is flushed => can_do_the_job(return) && !@try_again
+ when @next_seq is valid => same as flushed
+ when lost console_lock => @takeover
The proposed change makes it clear when the function can do
the job. It simplifies the answer for the other questions.
Also the return value from console_flush_one_record() can
be used as return value from console_flush_all().
Reviewed-by: John Ogness <john.ogness@linutronix.de>
Link: https://patch.msgid.link/20251020-printk_legacy_thread_console_lock-v3-2-00f1f0ac055a@thegoodpenguin.co.uk
[pmladek@suse.com: Fixed type of any_usable variable reported by John]
Signed-off-by: Petr Mladek <pmladek@suse.com>
console_flush_all prints all remaining records to all usable consoles
whilst its caller holds console_lock. This can result in large waiting
times for those waiting for console_lock especially where there is a
large volume of records or where the console is slow (e.g. serial).
Let's extract the parts of this function which print a single record
into a new function named console_flush_one_record. This can later
be used for functions that will release and reacquire console_lock
between records.
This commit should not change existing functionality.
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Andrew Murray <amurray@thegoodpenguin.co.uk>
Reviewed-by: John Ogness <john.ogness@linutronix.de>
Link: https://patch.msgid.link/20251020-printk_legacy_thread_console_lock-v3-1-00f1f0ac055a@thegoodpenguin.co.uk
Signed-off-by: Petr Mladek <pmladek@suse.com>
Instead of passing pkey_info into dump_options by value, using a
pointer instead.
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
With gcc-14 I get
../drivers/crypto/allwinner/sun8i-ss/sun8i-ss-hash.c: In function ‘sun8i_ss_hash_run’:
../drivers/crypto/allwinner/sun8i-ss/sun8i-ss-hash.c:631:21: warning: ‘j’ may be used uninitialized [-Wmaybe-uninitialized]
631 | j = hash_pad(bf, 4096, j, byte_count, true, bs);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
../drivers/crypto/allwinner/sun8i-ss/sun8i-ss-hash.c:493:13: note: ‘j’ was declared here
493 | int j, i, k, todo;
| ^
Fix this false positive by moving the initialisation of j earlier.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Use check_add_overflow() to guard against potential integer overflows
when adding the binary blob lengths and the size of an asymmetric_key_id
structure and return ERR_PTR(-EOVERFLOW) accordingly. This prevents a
possible buffer overflow when copying data from potentially malicious
X.509 certificate fields that can be arbitrarily large, such as ASN.1
INTEGER serial numbers, issuer names, etc.
Fixes: 7901c1a8ef ("KEYS: Implement binary asymmetric key ID handling")
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Reviewed-by: Lukas Wunner <lukas@wunner.de>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Prior to this change, no security hooks were called at the creation of a
memfd file. It means that, for SELinux as an example, it will receive
the default type of the filesystem that backs the in-memory inode. In
most cases, that would be tmpfs, but if MFD_HUGETLB is passed, it will
be hugetlbfs. Both can be considered implementation details of memfd.
It also means that it is not possible to differentiate between a file
coming from memfd_create and a file coming from a standard tmpfs mount
point.
Additionally, no permission is validated at creation, which differs from
the similar memfd_secret syscall.
Call security_inode_init_security_anon during creation. This ensures
that the file is setup similarly to other anonymous inodes. On SELinux,
it means that the file will receive the security context of its task.
The ability to limit fexecve on memfd has been of interest to avoid
potential pitfalls where /proc/self/exe or similar would be executed
[1][2]. Reuse the "execute_no_trans" and "entrypoint" access vectors,
similarly to the file class. These access vectors may not make sense for
the existing "anon_inode" class. Therefore, define and assign a new
class "memfd_file" to support such access vectors.
Guard these changes behind a new policy capability named "memfd_class".
[1] https://crbug.com/1305267
[2] https://lore.kernel.org/lkml/20221215001205.51969-1-jeffxu@google.com/
Signed-off-by: Thiébaud Weksteen <tweek@google.com>
Reviewed-by: Stephen Smalley <stephen.smalley.work@gmail.com>
Tested-by: Stephen Smalley <stephen.smalley.work@gmail.com>
Acked-by: Hugh Dickins <hughd@google.com>
[PM: subj tweak]
Signed-off-by: Paul Moore <paul@paul-moore.com>
The LSM framework itself registers a small number of initcalls, this
patch converts these initcalls into the new initcall mechanism.
Reviewed-by: Casey Schaufler <casey@schaufler-ca.com>
Reviewed-by: John Johansen <john.johhansen@canonical.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
SELinux currently has a number of initcalls so we've created a new
function, selinux_initcall(), which wraps all of these initcalls so
that we have a single initcall function that can be registered with the
LSM framework.
Signed-off-by: Paul Moore <paul@paul-moore.com>
This patch converts IMA and EVM to use the LSM frameworks's initcall
mechanism. It moved the integrity_fs_init() call to ima_fs_init() and
evm_init_secfs(), to work around the fact that there is no "integrity" LSM,
and introduced integrity_fs_fini() to remove the integrity directory, if
empty. Both integrity_fs_init() and integrity_fs_fini() support the
scenario of being called by both the IMA and EVM LSMs.
This patch does not touch any of the platform certificate code that
lives under the security/integrity/platform_certs directory as the
IMA/EVM developers would prefer to address that in a future patchset.
Signed-off-by: Roberto Sassu <roberto.sassu@huawei.com>
Acked-by: Mimi Zohar <zohar@linux.ibm.com>
[PM: adjust description as discussed over email]
Signed-off-by: Paul Moore <paul@paul-moore.com>
As the LSM framework only supports one LSM initcall callback for each
initcall type, the init_smk_fs() and smack_nf_ip_init() functions were
wrapped with a new function, smack_initcall() that is registered with
the LSM framework.
Acked-by: Casey Schaufler <casey@schaufler-ca.com>
Reviewed-by: John Johansen <john.johhansen@canonical.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
Currently the individual LSMs register their own initcalls, and while
this should be harmless, it can be wasteful in the case where a LSM
is disabled at boot as the initcall will still be executed. This
patch introduces support for managing the initcalls in the LSM
framework, and future patches will convert the existing LSMs over to
this new mechanism.
Only initcall types which are used by the current in-tree LSMs are
supported, additional initcall types can easily be added in the future
if needed.
Reviewed-by: Kees Cook <kees@kernel.org>
Reviewed-by: Casey Schaufler <casey@schaufler-ca.com>
Reviewed-by: John Johansen <john.johhansen@canonical.com>
Reviewed-by: Mimi Zohar <zohar@linux.ibm.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
Move away from an init specific init_debug() macro to a more general
lsm_pr()/lsm_pr_cont()/lsm_pr_dbg() set of macros that are available
both before and after init. In the process we do a number of minor
changes to improve the LSM initialization output and cleanup the code
somewhat.
Reviewed-by: Casey Schaufler <casey@schaufler-ca.com>
Reviewed-by: John Johansen <john.johhansen@canonical.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
Add function header comments for lsm_static_call_init() and
early_security_init(), tweak the existing comment block for
security_add_hooks().
Reviewed-by: Casey Schaufler <casey@schaufler-ca.com>
Reviewed-by: John Johansen <john.johhansen@canonical.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
With only security_init() calling lsm_init_ordered, it makes little
sense to keep lsm_init_ordered() as a standalone function. Fold
lsm_init_ordered() into security_init().
Reviewed-by: Casey Schaufler <casey@schaufler-ca.com>
Reviewed-by: John Johansen <john.johhansen@canonical.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
Rename initialize_lsm() to be more consistent with the rest of the LSM
initialization changes and rework the function itself to better fit
with the "exit on fail" coding pattern.
Reviewed-by: Kees Cook <kees@kernel.org>
Reviewed-by: John Johansen <john.johansen@canonical.com>
Reviewed-by: Casey Schaufler <casey@schaufler-ca.com>
Reviewed-by: Mimi Zohar <zohar@linux.ibm.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
Convert the lsm_blob_size fields to unsigned integers as there is no
current need for them to be negative, change "lsm_set_blob_size()" to
"lsm_blob_size_update()" to better reflect reality, and perform some
other minor cleanups to the associated code.
Reviewed-by: Kees Cook <kees@kernel.org>
Reviewed-by: John Johansen <john.johansen@canonical.com>
Reviewed-by: Casey Schaufler <casey@schaufler-ca.com>
Reviewed-by: Mimi Zohar <zohar@linux.ibm.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
Rename ordered_lsm_parse() to lsm_order_parse() for the sake of
consistency with the other LSM initialization routines, and also
do some minor rework of the function. Aside from some minor style
decisions, the majority of the rework involved shuffling the order
of the LSM_FLAG_LEGACY and LSM_ORDER_FIRST code so that the
LSM_FLAG_LEGACY checks are handled first; it is important to note
that this doesn't affect the order in which the LSMs are registered.
Reviewed-by: Casey Schaufler <casey@schaufler-ca.com>
Reviewed-by: John Johansen <john.johhansen@canonical.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
Rename append_ordered_lsm() to lsm_order_append() to better match
convention and do some rework. The rework includes moving the
LSM_FLAG_EXCLUSIVE logic from lsm_prepare() to lsm_order_append()
in order to consolidate the individual LSM append/activation code,
and adding logic to skip appending explicitly disabled LSMs to the
active LSM list.
Reviewed-by: Casey Schaufler <casey@schaufler-ca.com>
Reviewed-by: John Johansen <john.johhansen@canonical.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
In addition to style changes, rename set_enabled() to lsm_enabled_set()
and is_enabled() to lsm_is_enabled() to better fit within the LSM
initialization code.
Reviewed-by: Casey Schaufler <casey@schaufler-ca.com>
Reviewed-by: John Johansen <john.johhansen@canonical.com>
Reviewed-by: Mimi Zohar <zohar@linux.ibm.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
The LSM currently has a lot of code to maintain a list of the currently
active LSMs in a human readable string, with the only user being the
"/sys/kernel/security/lsm" code. Let's drop all of that code and
generate the string on first use and then cache it for subsequent use.
Signed-off-by: Paul Moore <paul@paul-moore.com>
Move the LSM active count and lsm_id list declarations out of a header
that is visible across the kernel and into a header that is limited to
the LSM framework. This not only helps keep the include/linux headers
smaller and cleaner, it helps prevent misuse of these variables.
Reviewed-by: Casey Schaufler <casey@schaufler-ca.com>
Reviewed-by: John Johansen <john.johhansen@canonical.com>
Reviewed-by: Mimi Zohar <zohar@linux.ibm.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
Rename the builtin_lsm_order variable to lsm_order_builtin,
chosen_lsm_order to lsm_order_cmdline, chosen_major_lsm to
lsm_order_legacy, ordered_lsms[] to lsm_order[], and exclusive
to lsm_exclusive.
This patch also renames the associated kernel command line parsing
functions and adds some basic function comment blocks. The parsing
function choose_major_lsm() was renamed to lsm_choose_security(),
choose_lsm_order() to lsm_choose_lsm(), and enable_debug() to
lsm_debug_enable().
Reviewed-by: Kees Cook <kees@kernel.org>
Reviewed-by: John Johansen <john.johansen@canonical.com>
Reviewed-by: Casey Schaufler <casey@schaufler-ca.com>
Reviewed-by: Mimi Zohar <zohar@linux.ibm.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
Reduce the duplication between the lsm_id struct and the DEFINE_LSM()
definition by linking the lsm_id struct directly into the individual
LSM's DEFINE_LSM() instance.
Linking the lsm_id into the LSM definition also allows us to simplify
the security_add_hooks() function by removing the code which populates
the lsm_idlist[] array and moving it into the normal LSM startup code
where the LSM list is parsed and the individual LSMs are enabled,
making for a cleaner implementation with less overhead at boot.
Reviewed-by: Kees Cook <kees@kernel.org>
Reviewed-by: John Johansen <john.johansen@canonical.com>
Reviewed-by: Casey Schaufler <casey@schaufler-ca.com>
Reviewed-by: Mimi Zohar <zohar@linux.ibm.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
The new name more closely fits the rest of the naming scheme in
security/lsm_init.c. This patch also adds a trivial comment block to
the top of the function.
Reviewed-by: Casey Schaufler <casey@schaufler-ca.com>
Reviewed-by: John Johansen <john.johhansen@canonical.com>
Reviewed-by: Mimi Zohar <zohar@linux.ibm.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
With only one caller of lsm_early_cred() and lsm_early_task(), insert
the functions' code directly into the caller and ger rid of the two
functions.
Reviewed-by: Casey Schaufler <casey@schaufler-ca.com>
Reviewed-by: John Johansen <john.johhansen@canonical.com>
Reviewed-by: Mimi Zohar <zohar@linux.ibm.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
There are three common for loop patterns in the LSM initialization code
to loop through the ordered LSM list and the registered "early" LSMs.
This patch implements these loop patterns as macros to help simplify the
code and reduce the chance for errors.
Reviewed-by: Casey Schaufler <casey@schaufler-ca.com>
Reviewed-by: John Johansen <john.johhansen@canonical.com>
Reviewed-by: Mimi Zohar <zohar@linux.ibm.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
Continue to pull code out of security/security.c to help improve
readability by pulling all of the LSM framework initialization
code out into a new file.
No code changes.
Reviewed-by: Kees Cook <kees@kernel.org>
Reviewed-by: John Johansen <john.johansen@canonical.com>
Reviewed-by: Casey Schaufler <casey@schaufler-ca.com>
Reviewed-by: Mimi Zohar <zohar@linux.ibm.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
In an effort to decompose security/security.c somewhat to make it less
twisted and unwieldy, pull out the LSM notifier code into a new file
as it is fairly well self-contained.
No code changes.
Reviewed-by: Kees Cook <kees@kernel.org>
Reviewed-by: John Johansen <john.johansen@canonical.com>
Reviewed-by: Casey Schaufler <casey@schaufler-ca.com>
Reviewed-by: Mimi Zohar <zohar@linux.ibm.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
The find_user_dsq() function is called from contexts that are already
under RCU read lock protection. Switch from rhashtable_lookup_fast() to
rhashtable_lookup() to avoid redundant RCU locking.
Requires: bee8a520eb ("rhashtable: Use rcu_dereference_all and rcu_dereference_all_check")
Acked-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The pnt_seq field and related infrastructure were originally named for
"pick next task sequence", reflecting their original implementation in
scx_next_task_picked(). However, the sequence counter is now incremented in
both put_prev_task_scx() and pick_task_scx() and its purpose is to
synchronize kick operations via SCX_KICK_WAIT, not specifically to track
pick_next_task events.
Rename to better reflect the actual semantics:
- pnt_seq -> kick_sync
- scx_kick_pseqs -> scx_kick_syncs
- pseqs variables -> ksyncs
- Update comments to refer to "kick_sync sequence" instead of "pick_task
sequence"
This is a pure renaming with no functional changes.
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
SCX_KICK_WAIT is used to synchronously wait for the target CPU to complete
a reschedule and can be used to implement operations like core scheduling.
This used to be implemented by scx_next_task_picked() incrementing pnt_seq,
which was always called when a CPU picks the next task to run, allowing
SCX_KICK_WAIT to reliably wait for the target CPU to enter the scheduler and
pick the next task.
However, commit b999e365c2 ("sched_ext: Replace scx_next_task_picked()
with switch_class()") replaced scx_next_task_picked() with the
switch_class() callback, which is only called when switching between sched
classes. This broke SCX_KICK_WAIT because pnt_seq would no longer be
reliably incremented unless the previous task was SCX and the next task was
not.
This fix leverages commit 4c95380701 ("sched/ext: Fold balance_scx() into
pick_task_scx()") which refactored the pick path making put_prev_task_scx()
the natural place to track task switches for SCX_KICK_WAIT. The fix moves
pnt_seq increment to put_prev_task_scx() and also increments it in
pick_task_scx() to handle cases where the same task is re-selected, whether
by BPF scheduler decision or slice refill. The semantics: If the current
task on the target CPU is SCX, SCX_KICK_WAIT waits until the CPU enters the
scheduling path. This provides sufficient guarantee for use cases like core
scheduling while keeping the operation self-contained within SCX.
v2: - Also increment pnt_seq in pick_task_scx() to handle same-task
re-selection (Andrea Righi).
- Use smp_cond_load_acquire() for the busy-wait loop for better
architecture optimization (Peter Zijlstra).
Reported-by: Wen-Fang Liu <liuwenfang@honor.com>
Link: http://lkml.kernel.org/r/228ebd9e6ed3437996dffe15735a9caa@honor.com
Cc: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
When a sched_ext scheduler tries to kick a CPU, the CPU may be running a
higher class task. sched_ext has no control over such CPUs. A sched_ext
scheduler couldn't have expected to get access to the CPU after kicking it
anyway. Skip kicking when the target CPU is running a higher class.
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Previously, data blocks that perfectly fit the data ring buffer would
get wrapped around to the beginning for no reason since the calculated
offset of the next data block would belong to the next wrap. Since this
offset is not actually part of the data block, but rather the offset of
where the next data block is going to start, there is no reason to
include it when deciding whether the current block fits the buffer.
Signed-off-by: Daniil Tatianin <d-tatianin@yandex-team.ru>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Tested-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: John Ogness <john.ogness@linutronix.de>
Tested-by: John Ogness <john.ogness@linutronix.de>
Link: https://patch.msgid.link/20250905144152.9137-2-d-tatianin@yandex-team.ru
[pmladek@suse.com: Updated indentation.]
Signed-off-by: Petr Mladek <pmladek@suse.com>
`kernel::ffi::CStr` was introduced in commit d126d23801 ("rust: str:
add `CStr` type") in November 2022 as an upstreaming of earlier work
that was done in May 2021[0]. That earlier work, having predated the
inclusion of `CStr` in `core`, largely duplicated the implementation of
`std::ffi::CStr`.
`std::ffi::CStr` was moved to `core::ffi::CStr` in Rust 1.64 in
September 2022. Hence replace `kernel::str::CStr` with `core::ffi::CStr`
to reduce our custom code footprint, and retain needed custom
functionality through an extension trait.
Add `CStr` to `ffi` and the kernel prelude.
Link: faa3cbcca0 [0]
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Alice Ryhl <aliceryhl@google.com>
Acked-by: Danilo Krummrich <dakr@kernel.org>
Reviewed-by: Benno Lossin <lossin@kernel.org>
Signed-off-by: Tamir Duberstein <tamird@gmail.com>
Link: https://patch.msgid.link/20251018-cstr-core-v18-16-9378a54385f8@gmail.com
[ Removed assert that would now depend on the Rust version. - Miguel ]
Signed-off-by: Miguel Ojeda <ojeda@kernel.org>
cded46d971 ("sched_ext: Make scx_bpf_dsq_insert*() return bool")
introduced a new bool-returning scx_bpf_dsq_insert() and renamed the old
void-returning version to scx_bpf_dsq_insert___compat, with the expectation
that libbpf would match old binaries to the ___compat variant, maintaining
backward binary compatibility. However, while libbpf ignores ___suffix on
the BPF side when matching symbols, it doesn't do so for kernel-side symbols.
Old binaries compiled with the original scx_bpf_dsq_insert() could no longer
resolve the symbol.
Fix by reversing the naming: Keep scx_bpf_dsq_insert() as the old
void-returning interface and add ___v2 to the new bool-returning version.
This allows old binaries to continue working while new code can use the
___v2 variant. Once libbpf is updated to ignore kernel-side ___SUFFIX, the
___v2 suffix can be dropped when the compat interface is removed.
v2: Use ___v2 instead of ___new.
Fixes: cded46d971 ("sched_ext: Make scx_bpf_dsq_insert*() return bool")
Signed-off-by: Tejun Heo <tj@kernel.org>
gdb and kgdb debugging documentation were moved to
Documentation/process/debugging/ as a part of
Commit d5af79c05e ("Documentation: move
dev-tools debugging files to process/debugging/"), but translations/
were not updated. Fix them
Signed-off-by: Ally Heev <allyheev@gmail.com>
Fixes: d5af79c05e ("Documentation: move dev-tools debugging files to process/debugging/")
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Message-ID: <20251020-aheev-fix-docs-dev-tools-broken-links-v2-1-7db64bf0405a@gmail.com>
Running "make htmldocs" generates the following build error and
warning in trusted-encrypted.rst:
Documentation/security/keys/trusted-encrypted.rst:18: ERROR: Unexpected indentation.
Documentation/security/keys/trusted-encrypted.rst:19: WARNING: Block quote ends without a blank line; unexpected unindent.
Add a blank line before bullet list and fix the indentation of text to
fix the build error and resolve the warning.
Fixes: 38f6880759 ("docs: trusted-encrypted: trusted-keys as protected keys")
Signed-off-by: Gopi Krishna Menon <krishnagopi487@gmail.com>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Tested-by: Randy Dunlap <rdunlap@infradead.org>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
The cpuset structure has a nr_subparts field which tracks the number
of child local partitions underneath a particular cpuset. Right now,
nr_subparts is only used in partition_is_populated() to avoid iteration
of child cpusets if the condition is right. So by always performing the
child iteration, we can avoid tracking the number of child partitions
and simplify the code a bit.
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Sometimes, the result of the rhashtable_lookup() is expected to be found.
Therefore, we can use likely() for such cases.
Following new functions are introduced, which will use likely or unlikely
during the lookup:
rhashtable_lookup_likely
rhltable_lookup_likely
A micro-benchmark is made for these new functions: lookup a existed entry
repeatedly for 100000000 times, and rhashtable_lookup_likely() gets ~30%
speedup.
Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Commit afddce13ce ("crypto: api - Add reqsize to crypto_alg")
introduced cra_reqsize field in crypto_alg struct to replace type
specific reqsize fields. It looks like this was introduced specifically
for ahash and acomp from the commit description as subsequent commits
add necessary changes in these alg frameworks.
However, this is being recommended for use in all crypto algs
instead of setting reqsize using crypto_*_set_reqsize(). Using
cra_reqsize in aead algorithms, hence, causes memory corruptions and
crashes as the underlying functions in the algorithm framework have not
been updated to set the reqsize properly from cra_reqsize. [1]
Add proper set_reqsize calls in the aead init function to properly
initialize reqsize for these algorithms in the framework.
[1]: https://gist.github.com/Pratham-T/24247446f1faf4b7843e4014d5089f6b
Fixes: afddce13ce ("crypto: api - Add reqsize to crypto_alg")
Signed-off-by: T Pratham <t-pratham@ti.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
- CAAM supports two types of protected keys:
-- Plain key encrypted with ECB
-- Plain key encrypted with CCM
Due to robustness, default encryption used for protected key is CCM.
- Generate protected key blob and add it to trusted key payload.
This is done as part of sealing operation, which is triggered
when below two operations are requested:
-- new key generation
-- load key,
Signed-off-by: Pankaj Gupta <pankaj.gupta@nxp.com>
Signed-off-by: Meenakshi Aggarwal <meenakshi.aggarwal@nxp.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Add a section in trusted key document describing the protected-keys.
- Detailing need for protected keys.
- Detailing the usage for protected keys.
Signed-off-by: Pankaj Gupta <pankaj.gupta@nxp.com>
Signed-off-by: Meenakshi Aggarwal <meenakshi.aggarwal@nxp.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Refactor pick_task_scx() adding a new argument to forcibly pick a
SCHED_EXT task, ignoring any higher-priority sched class activity.
This refactoring prepares the code for future scenarios, e.g., allowing
the ext dl_server to force a SCHED_EXT task selection.
No functional changes.
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Complete the translation of rust/testing.rst and add the testing TOC entry
to rust/index.rst.
Add the translation based on commit a3b2347343
("Documentation: rust: testing: add docs on the new KUnit `#[test]` tests").
Signed-off-by: Ben Guo <benx.guo@gmail.com>
Reviewed-by: Yanteng Si <siyanteng@cqsoftware.com.cm>
Signed-off-by: Alex Shi <alexs@kernel.org>
Sphinx reports htmldocs errors:
Documentation/tools/rtla/common_options.rst:58: ERROR: Undefined substitution referenced: "threshold".
Documentation/tools/rtla/common_options.rst:88: ERROR: Undefined substitution referenced: "tool".
Documentation/tools/rtla/common_options.rst:88: ERROR: Undefined substitution referenced: "thresharg".
Documentation/tools/rtla/common_options.rst:88: ERROR: Undefined substitution referenced: "tracer".
Documentation/tools/rtla/common_options.rst:92: ERROR: Undefined substitution referenced: "tracer".
Documentation/tools/rtla/common_options.rst:98: ERROR: Undefined substitution referenced: "actionsperf".
Documentation/tools/rtla/common_options.rst:113: ERROR: Undefined substitution referenced: "tool".
common_*.rst files are snippets that are intended to be included by rtla
docs (rtla*.rst). common_options.rst in particular contains
substitutions which depend on other common_* includes, so building it
independently as reST source results in above errors.
Rename all common_*.rst files to common_*.txt to prevent Sphinx from
building these snippets as standalone reST source and update all include
references accordingly.
Link: https://www.sphinx-doc.org/en/master/usage/restructuredtext/basics.html#substitutions
Suggested-by: Tomas Glozar <tglozar@redhat.com>
Suggested-by: Bagas Sanjaya <bagasdotme@gmail.com>
Signed-off-by: Gopi Krishna Menon <krishnagopi487@gmail.com>
Reviewed-by: Tomas Glozar <tglozar@redhat.com>
Fixes: 05b7e10687 ("tools/rtla: Add remaining support for osnoise actions")
Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com>
Link: https://lore.kernel.org/r/20251008184522.13201-1-krishnagopi487@gmail.com
[Bagas: massage commit message and apply trailers]
Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Tested-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Message-ID: <20251013092719.30780-2-bagasdotme@gmail.com>
Paragraphs of function explanation are currently not indented following
their appropriate numbered list item, which causes only the first
paragraph and function prototype code blocks to be indented in the
numbered list in htmldocs output.
Indent the explanation.
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Tested-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Message-ID: <20251013095630.34235-3-bagasdotme@gmail.com>
Quoth Mauro:
This series should probably be called:
"Move the trick-or-treat build hacks accumulated over time
into a single place and document them."
as this reflects its main goal. As such:
- it places the jobserver logic on a library;
- it removes sphinx/parallel-wrapper.sh;
- the code now properly implements a jobserver-aware logic
to do the parallelism when called via GNU make, failing back to
"-j" when there's no jobserver;
- converts check-variable-fonts.sh to Python and uses it via
function call;
- drops an extra script to generate man pages, adding a makefile
target for it;
- ensures that return code is 0 when PDF successfully builds;
- about half of the script is comments and documentation.
I tried to do my best to document all tricks that are inside the
script. This way, the docs build steps is now documented.
It should be noticed that it is out of the scope of this series
to change the implementation. Surely the process can be improved,
but first let's consolidate and document everything on a single
place.
Such script was written in a way that it can be called either
directly or via a Makefile. Running outside Makefile is
interesting specially when debug is needed. The command line
interface replaces the need of having lots of env vars before
calling sphinx-build:
$ ./tools/docs/sphinx-build-wrapper --help
usage: sphinx-build-wrapper [-h]
[--sphinxdirs SPHINXDIRS [SPHINXDIRS ...]] [--conf CONF]
[--builddir BUILDDIR] [--theme THEME] [--css CSS] [--paper {,a4,letter}] [-v]
[-j JOBS] [-i] [-V [VENV]]
{cleandocs,linkcheckdocs,htmldocs,epubdocs,texinfodocs,infodocs,mandocs,latexdocs,pdfdocs,xmldocs}
Kernel documentation builder
positional arguments:
{cleandocs,linkcheckdocs,htmldocs,epubdocs,texinfodocs,infodocs,mandocs,latexdocs,pdfdocs,xmldocs}
Documentation target to build
options:
-h, --help show this help message and exit
--sphinxdirs SPHINXDIRS [SPHINXDIRS ...]
Specific directories to build
--conf CONF Sphinx configuration file
--builddir BUILDDIR Sphinx configuration file
--theme THEME Sphinx theme to use
--css CSS Custom CSS file for HTML/EPUB
--paper {,a4,letter} Paper size for LaTeX/PDF output
-v, --verbose place build in verbose mode
-j, --jobs JOBS Sets number of jobs to use with sphinx-build
-i, --interactive Change latex default to run in interactive mode
-V, --venv [VENV] If used, run Sphinx from a venv dir (default dir: sphinx_latest)
the only mandatory argument is the target, which is identical with
"make" targets.
The call inside Makefile doesn't use the last four arguments. They're
there to help identifying problems at the build:
-v makes the output verbose;
-j helps to test parallelism;
-i runs latexmk in interactive mode, allowing to debug PDF
build issues;
-V is useful when testing it with different venvs.
When used with GNU make (or some other make which implements jobserver),
a call like:
make -j <targets> htmldocs
will make the wrapper to automatically use POSIX jobserver to claim
the number of available job slots, calling sphinx-build with a
"-j" parameter reflecting it. ON such case, the default can be
overriden via SPHINXDIRS argument.
Visiable changes when compared with the old behavior:
When V=0, the only visible difference is that:
- pdfdocs target now returns 0 on success, 1 on failures.
This addresses an issue over the current process where we
it always return success even on failures;
- it will now print the name of PDF files that failed to build,
if any.
In verbose mode, sphinx-build-wrapper and sphinx-build command lines
are now displayed.
Mauro says:
In the past, media used Docbook to generate documentation, together
with some logic to ensure that cross-references would match the
actual defined uAPI.
The rationale is that we wanted to automatically check for uAPI
documentation gaps.
The same logic was migrated to Sphinx. Back then, broken links
were reported. However, recent versions of it and/or changes at
conf.py disabled such checks.
The result is that several symbols are now not cross-referenced,
and we don't get warnings anymore when something breaks.
This series consist on 2 parts:
Part 1: extra patches to parse_data_structs.py and kernel_include.py;
Part 2: media documentation fixes.
Improve the suggestions algorithm by using get_close_matches() if
no suggestions with the same name are found. As we're now building
a dict, when the name is identical, but on a different domain,
the search is O(1), making it a lot faster.
The get_close_matches is also fast, as there is just one loop,
instead of 3.
This can be useful to detect typos on references, with could
be the base of a futuere extension that will handle ref unmatches
for the entire build, allowing someone to find typos and fix them.
As difflib and get_close_matches are there since the early
Python 3.x days, we don't need to handle any extra dependencies
to use it.
We're keeping the default values for the search, e.g. n=3, cutoff=0.6.
With that, we now have things like:
$ make SPHINXDIRS="userspace-api/media" htmldocs
...
include/uapi/linux/videodev2.h:199: WARNING: Invalid xref: c:type:`v4l2_memory`. Possible alternatives:
c:type:`v4l2_meta_format` (from v4l/dev-meta)
c:type:`v4l2_rect` (from v4l/dev-overlay)
c:type:`v4l2_area` (from v4l/ext-ctrls-image-source) [ref.missing]
...
include/uapi/linux/videodev2.h:1985: WARNING: Invalid xref: c:type:`V4L.v4l2_queryctrl`. Possible alternatives:
std🏷️`v4l2-queryctrl` (from v4l/vidioc-queryctrl)
std🏷️`v4l2-query-ext-ctrl` (from v4l/vidioc-queryctrl)
At the first example, it was not a typo, but a symbol that doesn't
seem to be properly documented. The second example points to
v4l2-queryctrl, which is a close match for the symbol.
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Message-ID: <7365feb74cbdd6b982c87baf5863360ab98cf727.1759329363.git.mchehab+huawei@kernel.org>
All we wanted were to have a way to link the comprehensive
documentation with the actual symbols parsed from the header
file, as this helps to identify any broken references.
Use the new :toc: flag for media controller and enable warnings.
Here, we need to adjust the exceptions file to setup the C
namespace accordingly.
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Message-ID: <5c8a87be712397563fc8ca914c3d92fe675e4071.1759329363.git.mchehab+huawei@kernel.org>
Except for two exception rules and dmx.h, the other files
are already handling properly cross references.
Fix the two exception rules for frontend.h, as those are
false-positives:
include/uapi/linux/dvb/frontend.h:959: WARNING: can't link to: c:type:: FE_GET_PROPERTY
include/uapi/linux/dvb/frontend.h:933: WARNING: can't link to: c:func:: FE_SET_FRONTEND_TUNE_MODE
The dmx.h are actual issues that will require an extra
patch to fill gaps.
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Message-ID: <9ae6556ebd47de4f066a149ab0bbe7ce27acf2c4.1759329363.git.mchehab+huawei@kernel.org>
C domain supports a ".. c:namespace::" tag that allows setting a
symbol namespace. This is used within the kernel to avoid warnings
about duplicated symbols. This is specially important for syscalls,
as each subsystem may have their own documentation for them.
This is specially true for ioctl.
When such tag is used, all C domain symbols have c++ style,
e.g. they'll become "{namespace}.{reference}".
Allow specifying C namespace at the exception files, avoiding
the need of override rules for every symbol.
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Message-ID: <cc27ec60ceb3bdac4197fb7266d2df8155edacda.1759329363.git.mchehab+huawei@kernel.org>
Specially when using c::namespace, it is not hard to break
a reference by forgetting to add a domain. Also, different
cases and using "-"/"_" the wrong way are typical cases that
people often gets wrong.
We might use a more complex logic here to also check for typos,
but let's keep it plain, simple.
This is enough to get thos exeptions from media controller:
.../include/uapi/linux/media.h:26: WARNING: Invalid xref: c:type:`media_device_info`. Possible alternatives:
c:type:`MC.media_device_info` (from mediactl/media-ioc-device-info)
.../include/uapi/linux/media.h:149: WARNING: Invalid xref: c:type:`media_entity_desc`. Possible alternatives:
c:type:`MC.media_entity_desc` (from mediactl/media-ioc-enum-entities)
.../include/uapi/linux/media.h:228: WARNING: Invalid xref: c:type:`media_link_desc`. Possible alternatives:
c:type:`MC.media_link_desc` (from mediactl/media-ioc-enum-links)
.../include/uapi/linux/media.h:235: WARNING: Invalid xref: c:type:`media_links_enum`. Possible alternatives:
c:type:`MC.media_links_enum` (from mediactl/media-ioc-enum-links)
.../include/uapi/linux/media.h:212: WARNING: Invalid xref: c:type:`media_pad_desc`. Possible alternatives:
c:type:`MC.media_pad_desc` (from mediactl/media-ioc-enum-links)
.../include/uapi/linux/media.h:298: WARNING: Invalid xref: c:type:`media_v2_entity`. Possible alternatives:
c:type:`MC.media_v2_entity` (from mediactl/media-ioc-g-topology)
.../include/uapi/linux/media.h:312: WARNING: Invalid xref: c:type:`media_v2_interface`. Possible alternatives:
c:type:`MC.media_v2_interface` (from mediactl/media-ioc-g-topology)
.../include/uapi/linux/media.h:307: WARNING: Invalid xref: c:type:`media_v2_intf_devnode`. Possible alternatives:
c:type:`MC.media_v2_intf_devnode` (from mediactl/media-ioc-g-topology)
.../include/uapi/linux/media.h:341: WARNING: Invalid xref: c:type:`media_v2_link`. Possible alternatives:
c:type:`MC.media_v2_link` (from mediactl/media-ioc-g-topology)
.../include/uapi/linux/media.h:333: WARNING: Invalid xref: c:type:`media_v2_pad`. Possible alternatives:
c:type:`MC.media_v2_pad` (from mediactl/media-ioc-g-topology)
.../include/uapi/linux/media.h:349: WARNING: Invalid xref: c:type:`media_v2_topology`. Possible alternatives:
c:type:`MC.media_v2_topology` (from mediactl/media-ioc-g-topology)
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Message-ID: <4c75d277e950e619ea00ba2dea336853a4aac976.1759329363.git.mchehab+huawei@kernel.org>
When used in practice, one may want to have multiple header
files on a single rst file, like:
***********************
Digital TV uAPI symbols
***********************
.. contents:: Table of Contents
:depth: 2
:local:
Frontend
========
.. kernel-include:: include/uapi/linux/dvb/frontend.h
:generate-cross-refs:
:toc:
Demux
=====
.. kernel-include:: include/uapi/linux/dvb/dmx.h
:generate-cross-refs:
:toc:
...
So, don't add ..contents:: here.
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Message-ID: <4bf353e5248133a3b0abd82519a38453402fe7c6.1759329363.git.mchehab+huawei@kernel.org>
drgb_kcapi_sym() always returns 0, so make it return void instead.
Consequently, make drbg_ctr_bcc() return void too.
Found by Linux Verification Center (linuxtesting.org) with the Svace static
analysis tool.
[Sergey: fixed the subject, refreshed the patch]
Signed-off-by: Karina Yankevich <k.yankevich@omp.ru>
Signed-off-by: Sergey Shtylyov <s.shtylyov@omp.ru>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
When authenc is invoked with MAY_BACKLOG, it needs to pass EINPROGRESS
notifications back up to the caller when the underlying algorithm
returns EBUSY synchronously.
However, if the EBUSY comes from the second part of an authenc call,
i.e., it is asynchronous, both the EBUSY and the subsequent EINPROGRESS
notification must not be passed to the caller.
Implement this by passing a mask to the function that starts the
second half of authenc and using it to determine whether EBUSY
and EINPROGRESS should be passed to the caller.
This was a deficiency in the original implementation of authenc
because it was not expected to be used with MAY_BACKLOG.
Reported-by: Ingo Franzki <ifranzki@linux.ibm.com>
Reported-by: Mikulas Patocka <mpatocka@redhat.com>
Fixes: 180ce7e810 ("crypto: authenc - Add EINPROGRESS check")
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Several crypto user API contexts and requests allocated with
sock_kmalloc() were left uninitialized, relying on callers to
set fields explicitly. This resulted in the use of uninitialized
data in certain error paths or when new fields are added in the
future.
The ACVP patches also contain two user-space interface files:
algif_kpp.c and algif_akcipher.c. These too rely on proper
initialization of their context structures.
A particular issue has been observed with the newly added
'inflight' variable introduced in af_alg_ctx by commit:
67b164a871 ("crypto: af_alg - Disallow multiple in-flight AIO requests")
Because the context is not memset to zero after allocation,
the inflight variable has contained garbage values. As a result,
af_alg_alloc_areq() has incorrectly returned -EBUSY randomly when
the garbage value was interpreted as true:
https://github.com/gregkh/linux/blame/master/crypto/af_alg.c#L1209
The check directly tests ctx->inflight without explicitly
comparing against true/false. Since inflight is only ever set to
true or false later, an uninitialized value has triggered
-EBUSY failures. Zero-initializing memory allocated with
sock_kmalloc() ensures inflight and other fields start in a known
state, removing random issues caused by uninitialized data.
Fixes: fe869cdb89 ("crypto: algif_hash - User-space interface for hash operations")
Fixes: 5afdfd22e6 ("crypto: algif_rng - add random number generator support")
Fixes: 2d97591ef4 ("crypto: af_alg - consolidation of duplicate code")
Fixes: 67b164a871 ("crypto: af_alg - Disallow multiple in-flight AIO requests")
Cc: stable@vger.kernel.org
Signed-off-by: Shivani Agarwal <shivani.agarwal@broadcom.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
The HW RNG core allows for manual selection of which RNG device to use,
but does not allow for no device to be enabled. It may be desirable to
do this on systems with only a single suitable hardware RNG, where we
need exclusive access to other functionality on this device. In
particular when performing TPM firmware upgrades this lets us ensure the
kernel does not try to access the device.
Before:
root@debian-qemu-efi:~# grep "" /sys/devices/virtual/misc/hw_random/rng_*
/sys/devices/virtual/misc/hw_random/rng_available:tpm-rng-0
/sys/devices/virtual/misc/hw_random/rng_current:tpm-rng-0
/sys/devices/virtual/misc/hw_random/rng_quality:1024
/sys/devices/virtual/misc/hw_random/rng_selected:0
After:
root@debian-qemu-efi:~# grep "" /sys/devices/virtual/misc/hw_random/rng_*
/sys/devices/virtual/misc/hw_random/rng_available:tpm-rng-0 none
/sys/devices/virtual/misc/hw_random/rng_current:tpm-rng-0
/sys/devices/virtual/misc/hw_random/rng_quality:1024
/sys/devices/virtual/misc/hw_random/rng_selected:0
root@debian-qemu-efi:~# echo none > /sys/devices/virtual/misc/hw_random/rng_current
root@debian-qemu-efi:~# grep "" /sys/devices/virtual/misc/hw_random/rng_*
/sys/devices/virtual/misc/hw_random/rng_available:tpm-rng-0 none
/sys/devices/virtual/misc/hw_random/rng_current:none
grep: /sys/devices/virtual/misc/hw_random/rng_quality: No such device
/sys/devices/virtual/misc/hw_random/rng_selected:1
(Observe using bpftrace no calls to TPM being made)
root@debian-qemu-efi:~# echo "" > /sys/devices/virtual/misc/hw_random/rng_current
root@debian-qemu-efi:~# grep "" /sys/devices/virtual/misc/hw_random/rng_*
/sys/devices/virtual/misc/hw_random/rng_available:tpm-rng-0 none
/sys/devices/virtual/misc/hw_random/rng_current:tpm-rng-0
/sys/devices/virtual/misc/hw_random/rng_quality:1024
/sys/devices/virtual/misc/hw_random/rng_selected:0
(Observe using bpftrace that calls to the TPM resume)
Signed-off-by: Jonathan McDowell <noodles@meta.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
As kcalloc() may fail, check its return value to avoid a NULL pointer
dereference when passing the buffer to rng->read(). On allocation
failure, log the error and return since test_len() returns void.
Fixes: 2be0d806e2 ("crypto: caam - add a test for the RNG")
Cc: stable@vger.kernel.org
Signed-off-by: Guangshuo Li <lgs201920130244@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Convert the Devicetree binding documentation for microchip,pic32mzda-rng
from plain text to YAML.
Signed-off-by: Kael D'Alcamo <dev@kael-k.io>
Reviewed-by: Rob Herring (Arm) <robh@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Replace simple_strtoul() with the recommended kstrtouint() for parsing
the 'hifn_pll_ref=' module parameter. Unlike simple_strtoul(), which
returns an unsigned long, kstrtouint() converts the string directly to
an unsigned integer and avoids implicit casting.
Check the return value of kstrtouint() and fall back to 66 MHz if
parsing fails. This adds error handling while preserving existing
behavior for valid values, and removes use of the deprecated
simple_strtoul() helper.
Add a space in the log message to correctly format "66 MHz" while we're
at it.
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Replace simple_strtol() with the recommended kstrtoint() for parsing the
'fips=' boot parameter. Unlike simple_strtol(), which returns a long,
kstrtoint() converts the string directly to an integer and avoids
implicit casting.
Check the return value of kstrtoint() and reject invalid values. This
adds error handling while preserving existing behavior for valid values,
and removes use of the deprecated simple_strtol() helper.
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Versal TRNG IP does not support Derivation Function (DF) of seed.
Add DF processing for CTR_DRBG mode.
Signed-off-by: Harsh Jain <h.jain@amd.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Export drbg_ctr_df() derivative function to new module df_sp80090.
Signed-off-by: Harsh Jain <h.jain@amd.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Pull in tip/sched/core to receive:
50653216e4 ("sched: Add support to pick functions to take rf")
4c95380701 ("sched/ext: Fold balance_scx() into pick_task_scx()")
which will enable clean integration of DL server support among other things.
This conflicts with the following from sched_ext/for-6.18-fixes:
a8ad873113 ("sched_ext: defer queue_balance_callback() until after ops.dispatch")
which adds maybe_queue_balance_callback() to balance_scx() which is removed
by 50653216e4. Resolve by moving the invocation to pick_task_scx() in the
equivalent location.
Signed-off-by: Tejun Heo <tj@kernel.org>
Pull sched_ext/for-6.18-fixes to sync trees to receive:
05e63305c8 ("sched_ext: Fix scx_kick_pseqs corruption on concurrent scheduler loads")
to avoid conflicts with planned cgroup sub-sched support.
Signed-off-by: Tejun Heo <tj@kernel.org>
Linux systems often use FUSE for several different purposes, where the
contents of some FUSE instances can be of more interest for auditing
than others.
Allow distinguishing between them based on the filesystem subtype
(s_subtype) using the new condition "fs_subtype".
The subtype string is supplied by userspace FUSE daemons
when a FUSE connection is initialized, so policy authors who want to
filter based on subtype need to ensure that FUSE mount operations are
sufficiently audited or restricted.
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Mimi Zohar <zohar@linux.ibm.com>
"measure", "appraise" and "hash" actions all have corresponding "dont_*"
actions, but "audit" currently lacks that. This means it is not
currently possible to have a policy that audits everything by default,
but excludes specific cases.
This seems to have been an oversight back when the "audit" action was
added.
Add a corresponding "dont_audit" action to enable such uses.
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Mimi Zohar <zohar@linux.ibm.com>
This commit adds two tests. The first is the most basic unit test:
make sure an empty queue peeks as empty, and when we put one element
in the queue, make sure peek returns that element.
However, even this simple test is a little complicated by the different
behavior of scx_bpf_dsq_insert in different calling contexts:
- insert is for direct dispatch in enqueue
- insert is delayed when called from select_cpu
In this case we split the insert and the peek that verifies the
result between enqueue/dispatch.
Note: An alternative would be to call `scx_bpf_dsq_move_to_local` on an
empty queue, which in turn calls `flush_dispatch_buf`, in order to flush
the buffered insert. Unfortunately, this is not viable within the
enqueue path, as it attempts a voluntary context switch within an RCU
read-side critical section.
The second test is a stress test that performs many peeks on all DSQs
and records the observed tasks.
Signed-off-by: Ryan Newton <newton@meta.com>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The builtin DSQ queue data structures are meant to be used by a wide
range of different sched_ext schedulers with different demands on these
data structures. They might be per-cpu with low-contention, or
high-contention shared queues. Unfortunately, DSQs have a coarse-grained
lock around the whole data structure. Without going all the way to a
lock-free, more scalable implementation, a small step we can take to
reduce lock contention is to allow a lockless, small-fixed-cost peek at
the head of the queue.
This change allows certain custom SCX schedulers to cheaply peek at
queues, e.g. during load balancing, before locking them. But it
represents a few extra memory operations to update the pointer each
time the DSQ is modified, including a memory barrier on ARM so the write
appears correctly ordered.
This commit adds a first_task pointer field which is updated
atomically when the DSQ is modified, and allows any thread to peek at
the head of the queue without holding the lock.
Signed-off-by: Ryan Newton <newton@meta.com>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Since v4.1 kernel, a new interface for ftrace called "tracefs" was
introduced, which is usually mounted in /sys/kernel/tracing. Therefore,
tracing files can now be accessed via either the legacy path
/sys/kernel/debug/tracing or the newer path /sys/kernel/tracing.
Signed-off-by: Fushuai Wang <wangfushuai@baidu.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Tested-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
When there is only one function of the same name, old_sympos of 0 and 1
are logically identical. Match them in klp_find_func().
This is to avoid a corner case with different toolchain behavior.
In this specific issue, two versions of kpatch-build were used to
build livepatch for the same kernel. One assigns old_sympos == 0 for
unique local functions, the other assigns old_sympos == 1 for unique
local functions. Both versions work fine by themselves. (PS: This
behavior change was introduced in a downstream version of kpatch-build.
This change does not exist in upstream kpatch-build.)
However, during livepatch upgrade (with the replace flag set) from a
patch built with one version of kpatch-build to the same fix built with
the other version of kpatch-build, livepatching fails with errors like:
[ 14.218706] sysfs: cannot create duplicate filename 'xxx/somefunc,1'
...
[ 14.219466] Call Trace:
[ 14.219468] <TASK>
[ 14.219469] dump_stack_lvl+0x47/0x60
[ 14.219474] sysfs_warn_dup.cold+0x17/0x27
[ 14.219476] sysfs_create_dir_ns+0x95/0xb0
[ 14.219479] kobject_add_internal+0x9e/0x260
[ 14.219483] kobject_add+0x68/0x80
[ 14.219485] ? kstrdup+0x3c/0xa0
[ 14.219486] klp_enable_patch+0x320/0x830
[ 14.219488] patch_init+0x443/0x1000 [ccc_0_6]
[ 14.219491] ? 0xffffffffa05eb000
[ 14.219492] do_one_initcall+0x2e/0x190
[ 14.219494] do_init_module+0x67/0x270
[ 14.219496] init_module_from_file+0x75/0xa0
[ 14.219499] idempotent_init_module+0x15a/0x240
[ 14.219501] __x64_sys_finit_module+0x61/0xc0
[ 14.219503] do_syscall_64+0x5b/0x160
[ 14.219505] entry_SYSCALL_64_after_hwframe+0x4b/0x53
[ 14.219507] RIP: 0033:0x7f545a4bd96d
...
[ 14.219516] kobject: kobject_add_internal failed for somefunc,1 with
-EEXIST, don't try to register things with the same name ...
This happens because klp_find_func() thinks somefunc with old_sympos==0
is not the same as somefunc with old_sympos==1, and klp_add_object_nops
adds another xxx/func,1 to the list of functions to patch.
Signed-off-by: Song Liu <song@kernel.org>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
[pmladek@suse.com: Fixed some typos.]
Reviewed-by: Petr Mladek <pmladek@suse.com>
Tested-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Implement the missing cgroup_set_idle() callback that was marked as a
TODO. This allows BPF schedulers to be notified when a cgroup's idle
state changes, enabling them to adjust their scheduling behavior
accordingly.
The implementation follows the same pattern as other cgroup callbacks
like cgroup_set_weight() and cgroup_set_bandwidth(). It checks if the
BPF scheduler has implemented the callback and invokes it with the
appropriate parameters.
Fixes a spelling error in the cgroup_set_bandwidth() documentation.
tj: s/scx_cgroup_rwsem/scx_cgroup_ops_rwsem/ to fix build breakage.
Signed-off-by: zhidao su <soolaugust@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The big picture section of 2.Process.rst currently hardcodes major
version number to 5 since fb0e0ffe7f ("Documentation: bring process
docs up to date"). As it can get outdated when it is actually
incremented (the recent is 6 and will be 7 in the near future),
arbitrarily bump it to 9, giving a headroom for a decade.
Note that the version number examples are kept to illustrate the
numbering scheme.
Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com>
[jc: tweaked the initial 9.x mention slightly]
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Message-ID: <20250922074219.26241-1-bagasdotme@gmail.com>
Sync the ja_JP translation with the following upstream commits:
commit 8401aa1f59 ("Documentation/SubmittingPatches: describe the Fixes: tag")
commit 19c3fe285c ("docs: Explicitly state that the 'Fixes:' tag shouldn't split lines")
commit 5b5bbb8cc5 ("docs: process: Add an example for creating a fixes tag")
commit 6356f18f09 ("Align git commit ID abbreviation guidelines and checks")
The mix of plain text and reST markup for ``git bisect`` is intentional to
align with the eventual reST conversion.
Signed-off-by: Akiyoshi Kurita <weibu@redadmin.org>
Reviewed-by: Akira Yokosawa <akiyks@gmail.com>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Message-ID: <20250924192426.2743495-1-weibu@redadmin.org>
Since the Handover Protocol was deprecated, the recommended approach is
to provide an initrd using a UEFI boot service with the
LINUX_EFI_INITRD_MEDIA_GUID device path. Documentation for the new
approach has been no more than an admonition with a link to an existing
implementation.
Provide a short explanation of this functionality, to ease future
implementations without having to reverse engineer existing ones.
[Bagas: Don't use :ref: link to EFI stub documentation and refer to
OVMF/edk2 implementation]
Signed-off-by: Hugo Osvaldo Barrera <hugo@whynothugo.nl>
Link: https://lore.kernel.org/r/20250428131206.8656-2-hugo@whynothugo.nl
Co-developed-by: Bagas Sanjaya <bagasdotme@gmail.com>
Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Message-ID: <20251013085718.27085-1-bagasdotme@gmail.com>
for sub-scheduler authority checks. Add compat wrappers which fall back to
direct p->scx field writes on older kernels.
Suggested-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
In preparation for hierarchical schedulers, change scx_bpf_dsq_insert() and
scx_bpf_dsq_insert_vtime() to return bool instead of void. With
sub-schedulers, there will be no reliable way to guarantee a task is still
owned by the sub-scheduler at insertion time (e.g., the task may have been
migrated to another scheduler). The bool return value will enable
sub-schedulers to detect and gracefully handle insertion failures.
For the root scheduler, insertion failures will continue to trigger scheduler
abort via scx_error(), so existing code doesn't need to check the return
value. Backward compatibility is maintained through compat wrappers.
Also update scx_bpf_dsq_move() documentation to clarify that it can return
false for sub-schedulers when @dsq_id points to a disallowed local DSQ.
Reviewed-by: Changwoo Min <changwoo@igalia.com>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Acked-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
scx_bpf_dsq_insert_vtime() and scx_bpf_select_cpu_and() currently have 5
parameters. An upcoming change will add aux__prog parameter which will exceed
BPF's 5 argument limit.
Prepare by adding new kfuncs __scx_bpf_dsq_insert_vtime() and
__scx_bpf_select_cpu_and() that take args structs. The existing kfuncs are
kept as compatibility wrappers. BPF programs use inline wrappers that detect
kernel API version via bpf_core_type_exists() and use the new struct-based
kfuncs when available, falling back to compat kfuncs otherwise. This allows
BPF programs to work with both old and new kernels.
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Acked-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
With the planned hierarchical scheduler support, sub-schedulers will need to
be verified for authority before being allowed to modify task->scx.slice and
task->scx.dsq_vtime. Add scx_bpf_task_set_slice() and
scx_bpf_task_set_dsq_vtime() which will perform the necessary permission
checks.
Root schedulers can still directly write to these fields, so this doesn't
affect existing schedulers.
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Enough time has passed since the introduction of scx_bpf_task_cgroup() and
the scx_bpf_dispatch* -> scx_bpf_dsq* kfunc renaming. Strip the compatibility
macros.
Acked-by: Changwoo Min <changwoo@igalia.com>
Acked-by: Andrea Righi <arighi@nvidia.com>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
There is no need to complete the entire scx initialization if a
scheduler is failing to be attached due to a hotplug event.
Exit early to avoid unnecessary work and simplify the attach flow.
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Since commit 56305aa9b6 ("exec: Compute file based creds only once"), the
credentials to be applied to the process after execution are not calculated
anymore for each step of finding intermediate interpreters (including the
final binary), but only after the final binary to be executed without
interpreter has been found.
In particular, that means that the bprm_check_security LSM hook will not
see the updated cred->e[ug]id for the intermediate and for the final binary
to be executed, since the function doing this task has been moved from
prepare_binprm(), which calls the bprm_check_security hook, to
bprm_creds_from_file().
This breaks the IMA expectation for the CREDS_CHECK hook, introduced with
commit d906c10d8a ("IMA: Support using new creds in appraisal policy"),
which expects to evaluate "the credentials that will be committed when the
new process is started". This is clearly not the case for the CREDS_CHECK
IMA hook, which is attached to bprm_check_security.
This issue does not affect systems which load a policy with the BPRM_CHECK
hook with no other criteria, as is the case with the built-in "tcb" and/or
"appraise_tcb" IMA policies. The "tcb" built-in policy measures all
executions regardless of the new credentials, and the "appraise_tcb" policy
is written in terms of the file owner, rather than IMA hooks.
However, it does affect systems without a BPRM_CHECK policy rule or with a
BPRM_CHECK policy rule that does not include what CREDS_CHECK evaluates. As
an extreme example, taking a standalone rule like:
measure func=CREDS_CHECK euid=0
This will not measure for example sudo (because CREDS_CHECK still sees the
bprm->cred->euid set to the regular user UID), but only the subsequent
commands after the euid was applied to the children.
Make set[ug]id programs measured/appraised again by splitting
ima_bprm_check() in two separate hook implementations (CREDS_CHECK now
being implemented by ima_creds_check()), and by attaching CREDS_CHECK to
the bprm_creds_from_file LSM hook.
The limitation of this approach is that CREDS_CHECK will not be invoked
anymore for the intermediate interpreters, like it was before, but only for
the final binary. This limitation can be removed only by reverting commit
56305aa9b6 ("exec: Compute file based creds only once").
Link: https://github.com/linux-integrity/linux/issues/3
Fixes: 56305aa9b6 ("exec: Compute file based creds only once")
Cc: Serge E. Hallyn <serge@hallyn.com>
Cc: Matthew Garrett <mjg59@srcf.ucam.org>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Jann Horn <jannh@google.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Roberto Sassu <roberto.sassu@huawei.com>
Signed-off-by: Mimi Zohar <zohar@linux.ibm.com>
wstatus is allowed to be NULL. Avoid a segmentation fault in this case.
Fixes: 0c89abf5ab ("tools/nolibc: implement waitpid() in terms of waitid()")
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Acked-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Translate .../scsi/sd-parameters.rst into Chinese.
Add sd-parameters into .../scsi/index.rst.
Update the translation through commit d835971b2b
("scsi: docs: convert sd-parameters.txt to ReST")
Signed-off-by: doubled <doubled@leap-io-kernel.com>
Signed-off-by: Alex Shi <alexs@kernel.org>
Translate .../scsi/link_power_management_policy.rst into Chinese.
Add link_power_management_policy into .../scsi/index.rst.
Update the translation through commit cbbc70a8cd
("scsi: docs: convert link_power_management_policy.txt to ReST")
Signed-off-by: doubled <doubled@leap-io-kernel.com>
Signed-off-by: Alex Shi <alexs@kernel.org>
Translate .../scsi/scsi-parameters.rst into Chinese.
Add scsi-parameters into .../scsi/index.rst.
Update the translation through commit bdb39c9509
("Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi")
Signed-off-by: doubled <doubled@leap-io-kernel.com>
Signed-off-by: Alex Shi <alexs@kernel.org>
Translate .../scsi/scsi_eh.rst into Chinese.
Add scsi_eh into .../scsi/index.rst.
Update the translation through commit a9dcee18a2
("scsi: documentation: scsi_eh: updates for EH changes")
Signed-off-by: doubled <doubled@leap-io-kernel.com>
Signed-off-by: Alex Shi <alexs@kernel.org>
Translate .../scsi/scsi_mid_low_api.rst into Chinese.
Add scsi_mid_low_api into .../scsi/index.rst.
Update the translation through commit 73349697fd
("scsi: docs: Clean up some style in scsi_mid_low_api")
Signed-off-by: doubled <doubled@leap-io-kernel.com>
Signed-off-by: Alex Shi <alexs@kernel.org>
Translate .../scsi/scsi.rst into Chinese.
Add scsi into .../scsi/index.rst.
Update the translation through commit c4e672ac8c
("scsi: docs: introduction: Multiple cleanups")
Signed-off-by: doubled <doubled@leap-io-kernel.com>
Signed-off-by: Alex Shi <alexs@kernel.org>
Translate .../scsi/index.rst into Chinese and
update subsystem-apis.rst translation
Update the translation through commit 682b07d2ff
("scsi: docs: Organize the SCSI documentation")
Signed-off-by: doubled <doubled@leap-io-kernel.com>
Signed-off-by: Alex Shi <alexs@kernel.org>
Update the translated rust/index.rst with new contents,
and add a reference label in rust/general-information.rst so
that index.rst can link to it properly.
Fixes in rust/index.rst:
- Fixed broken quick-start.rst cross-reference
Update the translation through commit d0b343605f
("kernel-docs: Add new section for Rust learning materials")
Signed-off-by: Ben Guo <benx.guo@gmail.com>
Reviewed-by: Yanteng Si <si.yanteng@linux.dev>
Signed-off-by: Alex Shi <alexs@kernel.org>
As reported by Randy, running make htmldocs on a machine
without textlive now produce warnings:
$ make O=DOCS htmldocs
../Documentation/Makefile:70: warning: overriding recipe for target 'pdfdocs'
../Documentation/Makefile:61: warning: ignoring old recipe for target 'pdfdocs'
That's because the code has now two definitions for pdfdocs in
case $PDFLATEX command is not found. With the new script, such
special case is not needed anymore, as the script checks it.
Drop the special case. Even after dropping it, on a machine
without LaTeX, it will still produce an error as expected,
as running:
$ ./tools/docs/sphinx-build-wrapper pdfdocs
Error: pdflatex or latexmk required for PDF generation
does the check. After applying the patch we have:
$ make SPHINXDIRS=peci htmldocs
Using alabaster theme
Using Python kernel-doc
$ make SPHINXDIRS=peci pdfdocs
Error: pdflatex or latexmk required for PDF generation
make[2]: *** [Documentation/Makefile:64: pdfdocs] Error 1
make[1]: *** [/root/Makefile:1808: pdfdocs] Error 2
make: *** [Makefile:248: __sub-make] Error 2
Which is the expected behavior.
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Link: https://lore.kernel.org/linux-doc/e7c29532-71de-496b-a89f-743cef28736e@infradead.org/
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Tested-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Message-ID: <cd16a7436a510116ef87cd4abbb1f3cfe358012f.1759328070.git.mchehab+huawei@kernel.org>
On recent versions of openSUSE Tumbleweed, sphinx-buildis is no longer
a Python script, but something else. Such change is due to how
it now handles alternatives:
https://en.opensuse.org/openSUSE:Migrating_to_libalternatives_with_alts
The most common approach that distros use for alternatives is via
symlinks:
lrwxrwxrwx 1 root root 22 out 31 2024 /usr/bin/java -> /etc/alternatives/java
lrwxrwxrwx 1 root root 37 mar 5 2025 /etc/alternatives/java -> /usr/lib/jvm/java-21-openjdk/bin/java
With such approach, one can sun the script with either:
<sphinx>
python3 <script>
However, openSUSE's implementation uses an ELF binary (/usr/bin/alts),
which breaks the latter format.
It is needed to allow users to specify the Python version to be
used while building docs, as some distros like Leap 15.x are
shipped with:
- older, unsupported python3/python3-sphinx packages;
- more modern python3xx/python3xx-sphinx packages that work
properly.
On such distros, building docs require running make with:
make PYTHON3=python3.11 htmldocs
Heh, even on more moderen distros where python3-sphinx
is supported, one may still want to use a newer package,
for instance, due to performance issues, as:
- with Python < 3.11, kernel-doc is 3 times slower;
- while building htmldocs with Python 3.13/Sphinx 8.x
takes about 3 minutes on a modern machine, using
Sphinx < 8.0 can take up to 16 minutes to build docs
(7.x are the worse ones and require lots of RAM).
So, even with not too old distros, one still may want to use
for instance PYTHON3=python3.11.
To acommodate using PYTHON3 without breaking on Tumbleweed,
add a workaround that will only use:
$(PYTHON3) sphinx-build
if PYTHON3 env var is not default.
While here, drop the special check for venv, as, with venv,
we can just call sphinx-build directly without any extra
checks.
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Link: https://lore.kernel.org/all/883df949-0281-4a39-8745-bcdcce3a5594@infradead.org/
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Tested-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Message-ID: <8917f862e0b8484c68408c274129c9f37a7aefb4.1758539031.git.mchehab+huawei@kernel.org>
The code here was meant to handle 3 functions:
1. allow having a separate conf.py file, per subdir;
2. generate a list of latex documents.
3. set "subproject" tag if SPHINXDIRS points to a subdir.
We don't have (1) anymore, and (3) is now properly handled
entirely inside conf.py.
So, only (3) is still needed, and this is a single-line change
at conf.py.
So, drop it, moving the remaining code to conf.py.
While here, drop a duplicated $(RUSTDOC) command-line argument.
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Message-ID: <ec998f9f268a401ca6aa36e3221d39c97efeccaa.1758361087.git.mchehab+huawei@kernel.org>
translate the "timestamping.rst" into Simplified Chinese.
Update the translation through commit d5c17e3654
("docs: networking: timestamping: improve stacked PHC sentence")
Signed-off-by: Wang Yaxin <wang.yaxin@zte.com.cn>
Signed-off-by: Sun yuxi <sun.yuxi@zte.com.cn>
Reviewed-by: xu xin <xu.xin16@zte.com.cn>
Signed-off-by: Alex Shi <alexs@kernel.org>
fix the format of proofreader for
Documentation/translations/zh_CN/filesystems/gfs2-uevents.rst
Documentation/translations/zh_CN/filesystems/gfs2.rst
Documentation/translations/zh_CN/filesystems/ubifs-authentication.rst
Documentation/translations/zh_CN/filesystems/ubifs.rst
Signed-off-by: Shao Mingyin <shao.mingyin@zte.com.cn>
Signed-off-by: Alex Shi <alexs@kernel.org>
Before this patch, building htmldocs on opensuseLEAP works
fine:
# make htmldocs
Available Python versions:
/usr/bin/python3.11
Python 3.6.15 not supported. Changing to /usr/bin/python3.11
Python 3.6.15 not supported. Changing to /usr/bin/python3.11
Using alabaster theme
Using Python kernel-doc
...
As the logic detects that Python 3.6 is too old and recommends
intalling python311-Sphinx. If installed, documentation builds
work like a charm.
Yet, some develpers complained that running python3.11 instead
of python3 should not happen. So, let's break the build to make
them happier:
$ make htmldocs
Python 3.6.15 not supported. Bailing out
You could run, instead:
/usr/bin/python3.11 tools/docs/sphinx-build-wrapper htmldocs \
--sphinxdirs=. --conf=conf.py --builddir=Documentation/output --theme= --css= \
--paper=
Python 3.6.15 not supported. Bailing out
make[2]: *** [Documentation/Makefile:76: htmldocs] Error 1
make[1]: *** [Makefile:1806: htmldocs] Error 2
make: *** [Makefile:248: __sub-make] Error 2
It should be noticed that:
1. after this change, sphinx-pre-install needs to be called
by hand:
$ /usr/bin/python3.11 tools/docs/sphinx-pre-install
Detected OS: openSUSE Leap 15.6.
Sphinx version: 7.2.6
All optional dependencies are met.
Needed package dependencies are met.
2. sphinx-build-wrapper will auto-detect python3.11 and
suggest a way to build the docs using the parameters passed
via make variables. In this specific example:
/usr/bin/python3.11 tools/docs/sphinx-build-wrapper htmldocs --sphinxdirs=. --conf=conf.py --theme= --css= --paper=
3. As this needs to be executed outside docs Makefile, it won't run
the validation check scripts nor build Rust documentation if
enabled, as the extra scripts are part of the docs Makefile.
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Message-ID: <0635c311295300e9fb48c0ea607e2408910036e3.1758196090.git.mchehab+huawei@kernel.org>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Simplify even further the docs Makefile by moving rust build logic
to the wrapper.
After this change, running make on an environment with rust enabled
works as expected.
With CONFIG_RUST:
$ make O=/tmp/foo LLVM=1 SPHINXDIRS=peci htmldocs
make[1]: Entrando no diretório '/tmp/foo'
Using alabaster theme
Using Python kernel-doc
GEN Makefile
DESCEND objtool
CC arch/x86/kernel/asm-offsets.s
INSTALL libsubcmd_headers
CALL /new_devel/docs/scripts/checksyscalls.sh
RUSTC L rust/core.o
BINDGEN rust/bindings/bindings_generated.rs
BINDGEN rust/bindings/bindings_helpers_generated.rs
...
Without it:
$ make SPHINXDIRS=peci htmldocs
Using alabaster theme
Using Python kernel-doc
Both work as it is it is supposed to do.
After the change, it is also possible to build directly with the script
by passing "--rustodoc".
if CONFIG_RUST, this works fine:
$ ./tools/docs/sphinx-build-wrapper --sphinxdirs peci --rustdoc -- htmldocs
Using alabaster theme
Using Python kernel-doc
SYNC include/config/auto.conf
...
RUSTC L rust/core.o
...
If not, it will produce a warning that RUST may be disabled:
$ ./tools/docs/sphinx-build-wrapper --sphinxdirs peci --rustdoc -- htmldocs
Using alabaster theme
Using Python kernel-doc
***
*** Configuration file ".config" not found!
***
*** Please run some configurator (e.g. "make oldconfig" or
*** "make menuconfig" or "make xconfig").
***
make[1]: *** [/new_devel/docs/Makefile:829: .config] Error 1
make: *** [Makefile:248: __sub-make] Error 2
Ignored errors when building rustdoc: Command '['make', 'LLVM=1', 'rustdoc']' returned non-zero exit status 2.. Is RUST enabled?
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Message-ID: <fa1235ccf859f6ebfeef7ffba0ebde2015a75042.1758196090.git.mchehab+huawei@kernel.org>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
When running kernel-doc over multiple documents, it emits
one error message per file with is not what we want:
$ python3.6 scripts/kernel-doc.py . --none
...
Warning: ./include/trace/events/swiotlb.h:0 Python 3.7 or later is required for correct results
Warning: ./include/trace/events/iommu.h:0 Python 3.7 or later is required for correct results
Warning: ./include/trace/events/sock.h:0 Python 3.7 or later is required for correct results
...
Change the logic to warn it only once at the library:
$ python3.6 scripts/kernel-doc.py . --none
Warning: Python 3.7 or later is required for correct results
Warning: ./include/cxl/features.h:0 Python 3.7 or later is required for correct results
When running from command line, it warns twice, but that sounds
ok.
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Message-ID: <68e54cf8b1201d1f683aad9bc710a99421910356.1758196090.git.mchehab+huawei@kernel.org>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
While cross-references are complex, as related ones can be on
different files, we can at least correlate the ones that belong
to the same file, adding a SEE ALSO section for them.
The result is not bad. See for instance:
$ tools/docs/sphinx-build-wrapper --sphinxdirs driver-api/media -- mandocs
$ man Documentation/output/driver-api/man/edac_pci_add_device.9
edac_pci_add_device(9) Kernel Hacker's Manual edac_pci_add_device(9)
NAME
edac_pci_add_device - Insert the 'edac_dev' structure into the
edac_pci global list and create sysfs entries associated with
edac_pci structure.
SYNOPSIS
int edac_pci_add_device (struct edac_pci_ctl_info *pci , int
edac_idx );
ARGUMENTS
pci pointer to the edac_device structure to be added to
the list
edac_idx A unique numeric identifier to be assigned to the
RETURN
0 on Success, or an error code on failure
SEE ALSO
edac_pci_alloc_ctl_info(9), edac_pci_free_ctl_info(9),
edac_pci_alloc_index(9), edac_pci_del_device(9), edac_pci_cre‐
ate_generic_ctl(9), edac_pci_release_generic_ctl(9),
edac_pci_create_sysfs(9), edac_pci_remove_sysfs(9)
August 2025 edac_pci_add_device edac_pci_add_device(9)
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Message-ID: <fba25efb41eadad17a54e6275a6191173d702f00.1758196090.git.mchehab+huawei@kernel.org>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
When SPHINXDIRS is used, basename may be identical for different
files. If this happens, the summary and error detection won't be
accurate.
Fix it by using relative names from builddir.
While here, don't duplicate names. Report, instead:
- SUCCESS
output PDF file was built
- FAILED
latexmk/xelatex didn't build any PDF output
- FAILED: no .tex files were generated
Sphinx didn't build any tex file for SPHINXDIRS directories
- FAILED ({python exception})
When a concurrent.futures is catched. Usually indicates an
internal error at the build logic.
With that, building multiple dirs with the same name is reported
properly:
$ make V=1 SPHINXDIRS="admin-guide/media driver-api/media userspace-api/media" pdfdocs
Summary
=======
admin-guide/media/pdf/media.pdf : SUCCESS
driver-api/media/pdf/media.pdf : SUCCESS
userspace-api/media/pdf/media.pdf: SUCCESS
And if at least one of them fails, return code will be 1.
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Message-ID: <d4a4f16f6c0c423ad38531a490888be3bf01e574.1758196090.git.mchehab+huawei@kernel.org>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
On a properly set system, LANG and LC_ALL is always defined.
However, some distros like Debian, Gentoo and their variants
start with those undefioned.
When Sphinx tries to set a locale with:
locale.setlocale(locale.LC_ALL, '')
It raises an exception, making Sphinx fail. This is more likely
to happen with test containers.
Add a logic to detect and workaround such issue by setting
locale to C.
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Message-ID: <1d0afad8fe3d83182be3a08eb00dd71322e23e69.1758196090.git.mchehab+huawei@kernel.org>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Use POSIX jobserver when available or -j<number> to run PDF
builds in parallel, restoring pdf build performance. Yet,
running it when debugging troubles is a bad idea, so, when
calling directly via command line, except if "-j" is splicitly
requested, it will serialize the build.
With such change, a PDF doc builds now takes around 5 minutes
on a Ryzen 9 machine with 32 cpu threads:
# Explicitly paralelize both Sphinx and LaTeX pdf builds
$ make cleandocs; time scripts/sphinx-build-wrapper pdfdocs -j 33
real 5m17.901s
user 15m1.499s
sys 2m31.482s
# Use POSIX jobserver to paralelize both sphinx-build and LaTeX
$ make cleandocs; time make pdfdocs
real 5m22.369s
user 15m9.076s
sys 2m31.419s
# Serializes PDF build, while keeping Sphinx parallelized.
# it is equivalent of passing -jauto via command line
$ make cleandocs; time scripts/sphinx-build-wrapper pdfdocs
real 11m20.901s
user 13m2.910s
sys 1m44.553s
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Message-ID: <42eef319f9af6f9feb12bcd74ca6392c8119929d.1758196090.git.mchehab+huawei@kernel.org>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
By default, we use LaTeX batch mode to build docs. This way, when
an error happens, the build fails. This is good for normal builds,
but when debugging problems with pdf generation, the best is to
use interactive mode.
We already support it via LATEXOPTS, but having a command line
argument makes it easier:
Interactive mode:
./scripts/sphinx-build-wrapper pdfdocs --sphinxdirs peci -v -i
...
Running 'xelatex --no-pdf -no-pdf -recorder ".../Documentation/output/peci/latex/peci.tex"'
...
Default batch mode:
./scripts/sphinx-build-wrapper pdfdocs --sphinxdirs peci -v
...
Running 'xelatex --no-pdf -no-pdf -interaction=batchmode -no-shell-escape -recorder ".../Documentation/output/peci/latex/peci.tex"'
...
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Message-ID: <9e5b9a8becc981b47ca3bf3ddce034f273400738.1758196090.git.mchehab+huawei@kernel.org>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
There are too much magic inside docs Makefile to properly run
sphinx-build. Create an ancillary script that contains all
kernel-related sphinx-build call logic currently at Makefile.
Such script is designed to work both as an standalone command
and as part of a Makefile. As such, it properly handles POSIX
jobserver used by GNU make.
On a side note, there was a line number increase due to the
conversion (ignoring comments) is:
Documentation/Makefile | 131 +++----------
tools/docs/sphinx-build-wrapper | 293 +++++++++++++++++++++++++++++++
2 files changed, 323 insertions(+), 101 deletions(-)
Comments and descriptions adds:
tools/docs/sphinx-build-wrapper | 261 +++++++++++++++++++++++++++++++-
So, about half of the script are comments/descriptions.
This is because some things are more verbosed on Python and because
it requires reading env vars from Makefile. Besides it, this script
has some extra features that don't exist at the Makefile:
- It can be called directly from command line;
- It properly return PDF build errors.
When running the script alone, it will only take handle sphinx-build
targets. On other words, it won't runn make rustdoc after building
htmlfiles, nor it will run the extra check scripts.
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Message-ID: <80ae57b01fcfb1d338d93b8f8e26e57b69b5f16b.1758196090.git.mchehab+huawei@kernel.org>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Convert the code inside jobserver-exec to a class and
properly document it.
Using a class allows reusing the jobserver logic on other
scripts.
While the main code remains unchanged, being compatible with
Python 2.6 and 3.0+, its coding style now follows a more
modern standard, having tabs replaced by a 4-spaces
indent, passing autopep8, black and pylint.
The code allows using a pythonic way to enter/exit a python
code, e.g. it now supports:
with JobserverExec() as jobserver:
jobserver.run(sys.argv[1:])
With the new code, the __exit__() function should ensure
that the jobserver slot will be closed at the end, even if
something bad happens somewhere.
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Message-ID: <4749921b75d4e0bd85a25d4d94aa2c940fad084e.1758196090.git.mchehab+huawei@kernel.org>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
translate the "generic-hdlc.rst" into Simplified Chinese.
Update the translation through commit 16128ad8f9
("docs: networking: convert generic-hdlc.txt to ReST")
Signed-off-by: Wang Yaxin <wang.yaxin@zte.com.cn>
Signed-off-by: Sun yuxi <sun.yuxi@zte.com.cn>
Reviewed-by: xu xin <xu.xin16@zte.com.cn>
Signed-off-by: Alex Shi <alexs@kernel.org>
translate the "mptcp-sysctl.rst" into Simplified Chinese.
Update the translation through commit fa3ee9dd80
("mptcp: sysctl: add available_path_managers")
Signed-off-by: Sun yuxi <sun.yuxi@zte.com.cn>
Reviewed-by: xu xin <xu.xin16@zte.com.cn>
Signed-off-by: Alex Shi <alexs@kernel.org>
translate the "inotify.rst" into Simplified Chinese.
Update to commit de389cf08d47("docs: filesystems: convert inotify.txt to
ReST")
Signed-off-by: Wang Longjie <wang.longjie1@zte.com.cn>
Signed-off-by: Shao Mingyin <shao.mingyin@zte.com.cn>
Signed-off-by: Alex Shi <alexs@kernel.org>
translate the "dnotify.rst" into Simplified Chinese.
Update to commit b31763cff488("docs: filesystems: convert dnotify.txt to
ReST")
Signed-off-by: Wang Longjie <wang.longjie1@zte.com.cn>
Signed-off-by: Shao Mingyin <shao.mingyin@zte.com.cn>
Signed-off-by: Alex Shi <alexs@kernel.org>
translate the "gfs2-glocks.rst" into Simplified Chinese.
Update to commit 713f8834389f("gfs2: Get rid of emote_ok
checks")
Signed-off-by: Shao Mingyin <shao.mingyin@zte.com.cn>
Signed-off-by: yang tao <yang.tao172@zte.com.cn>
Signed-off-by: Alex Shi <alexs@kernel.org>
translate the "gfs2-uevents.rst" into Simplified Chinese.
Update to commit 5b7ac27a6e2c("docs: filesystems: convert
gfs2-uevents.txt to ReST")
Signed-off-by: Shao Mingyin <shao.mingyin@zte.com.cn>
Signed-off-by: yang tao <yang.tao172@zte.com.cn>
Signed-off-by: Alex Shi <alexs@kernel.org>
translate the "gfs2.rst" into Simplified Chinese.
Update to commit d9593868cd58("Documentation: Update
filesystems/gfs2.rst")
Signed-off-by: Shao Mingyin <shao.mingyin@zte.com.cn>
Signed-off-by: yang tao <yang.tao172@zte.com.cn>
Signed-off-by: Alex Shi <alexs@kernel.org>
translate the "ubifs-authentication.rst" into Simplified Chinese.
Update to commit d56b699d76d1("Documentation: Fix typos")
Signed-off-by: Shao Mingyin <shao.mingyin@zte.com.cn>
Signed-off-by: yang tao <yang.tao172@zte.com.cn>
Signed-off-by: Alex Shi <alexs@kernel.org>
translate the "ubifs.rst" into Simplified Chinese.
Update to commit 5f5cae9b0e81("Documentation: ubifs: Fix
compression idiom")
Signed-off-by: Shao Mingyin <shao.mingyin@zte.com.cn>
Signed-off-by: yang tao <yang.tao172@zte.com.cn>
Signed-off-by: Alex Shi <alexs@kernel.org>
This command:
# echo foo/bar >/proc/$$/attr/smack/current
gives the task a label 'foo' w/o indication
that label does not match input.
Setting the label with lsm_set_self_attr() syscall
behaves identically.
This occures because:
1) smk_parse_smack() is used to convert input to a label
2) smk_parse_smack() takes only that part from the
beginning of the input that looks like a label.
3) `/' is prohibited in labels, so only "foo" is taken.
(2) is by design, because smk_parse_smack() is used
for parsing strings which are more than just a label.
Silent failure is not a good thing, and there are two
indicators that this was not done intentionally:
(size >= SMK_LONGLABEL) ~> invalid
clause at the beginning of the do_setattr() and the
"Returns the length of the smack label" claim
in the do_setattr() description.
So I fixed this by adding one tiny check:
the taken label length == input length.
Since input length is now strictly controlled,
I changed the two ways of setting label
smack_setselfattr(): lsm_set_self_attr() syscall
smack_setprocattr(): > /proc/.../current
to accommodate the divergence in
what they understand by "input length":
smack_setselfattr counts mandatory \0 into input length,
smack_setprocattr does not.
smack_setprocattr allows various trailers after label
Related changes:
* fixed description for smk_parse_smack
* allow unprivileged tasks validate label syntax.
* extract smk_parse_label_len() from smk_parse_smack()
so parsing may be done w/o string allocation.
* extract smk_import_valid_label() from smk_import_entry()
to avoid repeated parsing.
* smk_parse_smack(): scan null-terminated strings
for no more than SMK_LONGLABEL(256) characters
* smack_setselfattr(): require struct lsm_ctx . flags == 0
to reserve them for future.
Fixes: e114e47377 ("Smack: Simplified Mandatory Access Control Kernel")
Signed-off-by: Konstantin Andreev <andreev@swemel.ru>
Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
If an unprivileged task is allowed to relabel itself
(/smack/relabel-self is not empty),
it can freely create new labels by writing their
names into own /proc/PID/attr/smack/current
This occurs because do_setattr() imports
the provided label in advance,
before checking "relabel-self" list.
This change ensures that the "relabel-self" list
is checked before importing the label.
Fixes: 38416e5393 ("Smack: limited capability for changing process label")
Signed-off-by: Konstantin Andreev <andreev@swemel.ru>
Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
According to [1], the label of a UNIX domain socket (UDS)
file (i.e., the filesystem object representing the socket)
is not supposed to participate in Smack security.
To achieve this, [1] labels UDS files with "*"
in smack_d_instantiate().
Before [2], smack_d_instantiate() was responsible
for initializing Smack security for all inodes,
except ones under /proc
[2] imposed the sole responsibility for initializing
inode security for newly created filesystem objects
on smack_inode_init_security().
However, smack_inode_init_security() lacks some logic
present in smack_d_instantiate().
In particular, it does not label UDS files with "*".
This patch adds the missing labeling of UDS files
with "*" to smack_inode_init_security().
Labeling UDS files with "*" in smack_d_instantiate()
still works for stale UDS files that already exist on
disk. Stale UDS files are useless, but I keep labeling
them for consistency and maybe to make easier for user
to delete them.
Compared to [1], this version introduces the following
improvements:
* UDS file label is held inside inode only
and not saved to xattrs.
* relabeling UDS files (setxattr, removexattr, etc.)
is blocked.
[1] 2010-11-24 Casey Schaufler
commit b4e0d5f079 ("Smack: UDS revision")
[2] 2023-11-16 roberto.sassu
Fixes: e63d86b8b7 ("smack: Initialize the in-memory inode in smack_inode_init_security()")
Link: https://lore.kernel.org/linux-security-module/20231116090125.187209-5-roberto.sassu@huaweicloud.com/
Signed-off-by: Konstantin Andreev <andreev@swemel.ru>
Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
If memory allocation for the SMACK64TRANSMUTE
xattr value fails in smack_inode_init_security(),
the SMK_INODE_INSTANT flag is not set in
(struct inode_smack *issp)->smk_flags,
leaving the inode as not "instantiated".
It does not matter if fs frees the inode
after failed smack_inode_init_security() call,
but there is no guarantee for this.
To be safe, mark the inode as "instantiated",
even if allocation of xattr values fails.
Signed-off-by: Konstantin Andreev <andreev@swemel.ru>
Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
When a new file system object is created
and the conditions for label transmutation are met,
the SMACK64TRANSMUTE extended attribute is set
on the object regardless of its type:
file, pipe, socket, symlink, or directory.
However,
SMACK64TRANSMUTE may only be set on directories.
This bug is a combined effect of the commits [1] and [2]
which both transfer functionality
from smack_d_instantiate() to smack_inode_init_security(),
but only in part.
Commit [1] set blank SMACK64TRANSMUTE on improper object types.
Commit [2] set "TRUE" SMACK64TRANSMUTE on improper object types.
[1] 2023-06-10,
Fixes: baed456a6a ("smack: Set the SMACK64TRANSMUTE xattr in smack_inode_init_security()")
Link: https://lore.kernel.org/linux-security-module/20230610075738.3273764-3-roberto.sassu@huaweicloud.com/
[2] 2023-11-16,
Fixes: e63d86b8b7 ("smack: Initialize the in-memory inode in smack_inode_init_security()")
Link: https://lore.kernel.org/linux-security-module/20231116090125.187209-5-roberto.sassu@huaweicloud.com/
Signed-off-by: Konstantin Andreev <andreev@swemel.ru>
Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
HAVE_SPHINX:=$(shell if which $(SPHINXBUILD) >/dev/null 2>&1;thenecho 1;elseecho 0;fi)
@@ -46,141 +47,46 @@ ifeq ($(HAVE_SPHINX),0)
.DEFAULT:
$(warning The '$(SPHINXBUILD)'command was not found. Make sure you have Sphinx installed and in PATH, or set the SPHINXBUILD make variable to point to the full path of the '$(SPHINXBUILD)' executable.)
@echo
@$(srctree)/scripts/sphinx-pre-install
@$(srctree)/tools/docs/sphinx-pre-install
@echo " SKIP Sphinx $@ target."
else# HAVE_SPHINX
# User-friendly check for pdflatex and latexmk
HAVE_PDFLATEX:=$(shell if which $(PDFLATEX) >/dev/null 2>&1;thenecho 1;elseecho 0;fi)
HAVE_LATEXMK:=$(shell if which latexmk >/dev/null 2>&1;thenecho 1;elseecho 0;fi)
Set scheduling parameters to the osnoise tracer threads, the format to set the priority are:
Set scheduling parameters to the |tool| tracer threads, the format to set the priority are:
- *o:prio* - use SCHED_OTHER with *prio*;
- *r:prio* - use SCHED_RR with *prio*;
- *f:prio* - use SCHED_FIFO with *prio*;
- *d:runtime[us|ms|s]:period[us|ms|s]* - use SCHED_DEADLINE with *runtime* and *period* in nanoseconds.
If not set, tracer threads keep their default priority. For rtla user threads, it is set to SCHED_FIFO with priority 95. For kernel threads, see *osnoise* and *timerlat* tracer documentation for the running kernel version.
**-C**, **--cgroup**\[*=cgroup*]
Set a *cgroup* to the tracer's threads. If the **-C** option is passed without arguments, the tracer's thread will inherit **rtla**'s *cgroup*. Otherwise, the threads will be placed on the *cgroup* passed to the option.
If not set, the behavior differs between workload types. User workloads created by rtla will inherit rtla's cgroup. Kernel workloads are assigned the root cgroup.
**--warm-up** *s*
After starting the workload, let it run for *s* seconds before starting collecting the data, allowing the system to warm-up. Statistical data generated during warm-up is discarded.
@@ -53,6 +61,8 @@
**--trace-buffer-size** *kB*
Set the per-cpu trace buffer size in kB for the tracing output.
If not set, the default tracefs buffer size is used.
**--on-threshold** *action*
Defines an action to be executed when tracing is stopped on a latency threshold
@@ -67,7 +77,7 @@
- *trace[,file=<filename>]*
Saves trace output, optionally taking a filename. Alternative to -t/--trace.
Note that nlike -t/--trace, specifying this multiple times will result in
Note that unlike -t/--trace, specifying this multiple times will result in
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.