For now, migrate_enable and migrate_disable are global, which makes them
become hotspots in some case. Take BPF for example, the function calling
to migrate_enable and migrate_disable in BPF trampoline can introduce
significant overhead, and following is the 'perf top' of FENTRY's
benchmark (./tools/testing/selftests/bpf/bench trig-fentry):
54.63% bpf_prog_2dcccf652aac1793_bench_trigger_fentry [k]
bpf_prog_2dcccf652aac1793_bench_trigger_fentry
10.43% [kernel] [k] migrate_enable
10.07% bpf_trampoline_6442517037 [k] bpf_trampoline_6442517037
8.06% [kernel] [k] __bpf_prog_exit_recur
4.11% libc.so.6 [.] syscall
2.15% [kernel] [k] entry_SYSCALL_64
1.48% [kernel] [k] memchr_inv
1.32% [kernel] [k] fput
1.16% [kernel] [k] _copy_to_user
0.73% [kernel] [k] bpf_prog_test_run_raw_tp
So in this commit, we make migrate_enable/migrate_disable inline to obtain
better performance. The struct rq is defined internally in
kernel/sched/sched.h, and the field "nr_pinned" is accessed in
migrate_enable/migrate_disable, which makes it hard to make them inline.
Alexei Starovoitov suggests to generate the offset of "nr_pinned" in [1],
so we can define the migrate_enable/migrate_disable in
include/linux/sched.h and access "this_rq()->nr_pinned" with
"(void *)this_rq() + RQ_nr_pinned".
The offset of "nr_pinned" is generated in include/generated/rq-offsets.h
by kernel/sched/rq-offsets.c.
Generally speaking, we move the definition of migrate_enable and
migrate_disable to include/linux/sched.h from kernel/sched/core.c. The
calling to __set_cpus_allowed_ptr() is leaved in ___migrate_enable().
The "struct rq" is not available in include/linux/sched.h, so we can't
access the "runqueues" with this_cpu_ptr(), as the compilation will fail
in this_cpu_ptr() -> raw_cpu_ptr() -> __verify_pcpu_ptr():
typeof((ptr) + 0)
So we introduce the this_rq_raw() and access the runqueues with
arch_raw_cpu_ptr/PERCPU_PTR directly.
The variable "runqueues" is not visible in the kernel modules, and export
it is not a good idea. As Peter Zijlstra advised in [2], we define and
export migrate_enable/migrate_disable in kernel/sched/core.c too, and use
them for the modules.
Before this patch, the performance of BPF FENTRY is:
fentry : 113.030 ± 0.149M/s
fentry : 112.501 ± 0.187M/s
fentry : 112.828 ± 0.267M/s
fentry : 115.287 ± 0.241M/s
After this patch, the performance of BPF FENTRY increases to:
fentry : 143.644 ± 0.670M/s
fentry : 149.764 ± 0.362M/s
fentry : 149.642 ± 0.156M/s
fentry : 145.263 ± 0.221M/s
Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/bpf/CAADnVQ+5sEDKHdsJY5ZsfGDO_1SEhhQWHrt2SMBG5SYyQ+jt7w@mail.gmail.com/ [1]
Link: https://lore.kernel.org/all/20250819123214.GH4067720@noisy.programming.kicks-ass.net/ [2]
In the next commit, we will move the definition of migrate_enable() and
migrate_disable() to linux/sched.h. However,
migrate_enable/migrate_disable will be used in commit
1b93c03fb3 ("rcu: add rcu_read_lock_dont_migrate()") in bpf-next tree.
In order to fix potential compiling error, replace linux/preempt.h with
linux/sched.h in include/linux/rcupdate.h.
Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
The include/generated/asm-offsets.h is generated in Kbuild during
compiling from arch/SRCARCH/kernel/asm-offsets.c. When we want to
generate another similar offset header file, circular dependency can
happen.
For example, we want to generate a offset file include/generated/test.h,
which is included in include/sched/sched.h. If we generate asm-offsets.h
first, it will fail, as include/sched/sched.h is included in asm-offsets.c
and include/generated/test.h doesn't exist; If we generate test.h first,
it can't success neither, as include/generated/asm-offsets.h is included
by it.
In x86_64, the macro COMPILE_OFFSETS is used to avoid such circular
dependency. We can generate asm-offsets.h first, and if the
COMPILE_OFFSETS is defined, we don't include the "generated/test.h".
And we define the macro COMPILE_OFFSETS for all the asm-offsets.c for this
purpose.
Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
When doing load balance and the target cfs_rq is in throttled hierarchy,
whether to allow balancing there is a question.
The good side to allow balancing is: if the target CPU is idle or less
loaded and the being balanced task is holding some kernel resources,
then it seems a good idea to balance the task there and let the task get
the CPU earlier and release kernel resources sooner. The bad part is, if
the task is not holding any kernel resources, then the balance seems not
that useful.
While theoretically it's debatable, a performance test[0] which involves
200 cgroups and each cgroup runs hackbench(20 sender, 20 receiver) in
pipe mode showed a performance degradation on AMD Genoa when allowing
load balance to throttled cfs_rq. Analysis[1] showed hackbench doesn't
like task migration across LLC boundary. For this reason, add a check in
can_migrate_task() to forbid balancing to a cfs_rq that is in throttled
hierarchy. This reduced task migration a lot and performance restored.
[0]: https://lore.kernel.org/lkml/20250822110701.GB289@bytedance/
[1]: https://lore.kernel.org/lkml/20250903101102.GB42@bytedance/
Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
With the introduction of task based throttle model, task in a throttled
hierarchy is allowed to continue to run till it gets throttled on its
ret2user path.
For this reason, remove those throttled_hierarchy() checks in the
following functions so that those tasks can get their turn as normal
tasks: dequeue_entities(), check_preempt_wakeup_fair() and
yield_to_task_fair().
The benefit of doing it this way is: if those tasks gets the chance to
run earlier and if they hold any kernel resources, they can release
those resources earlier. The downside is, if they don't hold any kernel
resouces, all they can do is to throttle themselves on their way back to
user space so the favor to let them run seems not that useful and for
check_preempt_wakeup_fair(), that favor may be bad for curr.
K Prateek Nayak pointed out prio_changed_fair() can send a throttled
task to check_preempt_wakeup_fair(), further tests showed the affinity
change path from move_queued_task() can also send a throttled task to
check_preempt_wakeup_fair(), that's why the check of task_is_throttled()
in that function.
Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
With task based throttle model, tasks in a throttled hierarchy are
allowed to continue to run if they are running in kernel mode. For this
reason, PELT clock is not stopped for these cfs_rqs in throttled
hierarchy when they still have tasks running or queued.
Since PELT clock is not stopped, whether to allow update_cfs_group()
doing its job for cfs_rqs which are in throttled hierarchy but still
have tasks running/queued is a question.
The good side is, continue to run update_cfs_group() can get these
cfs_rq entities with an up2date weight and that up2date weight can be
useful to derive an accurate load for the CPU as well as ensure fairness
if multiple tasks of different cgroups are running on the same CPU.
OTOH, as Benjamin Segall pointed: when unthrottle comes around the most
likely correct distribution is the distribution we had at the time of
throttle.
In reality, either way may not matter that much if tasks in throttled
hierarchy don't run in kernel mode for too long. But in case that
happens, let these cfs_rq entities have an up2date weight seems a good
thing to do.
Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Before task based throttle model, propagating load will stop at a
throttled cfs_rq and that propagate will happen on unthrottle time by
update_load_avg().
Now that there is no update_load_avg() on unthrottle for throttled
cfs_rq and all load tracking is done by task related operations, let the
propagate happen immediately.
While at it, add a comment to explain why cfs_rqs that are not affected
by throttle have to be added to leaf cfs_rq list in
propagate_entity_cfs_rq() per my understanding of commit 0258bdfaff
("sched/fair: Fix unfairness caused by missing load decay").
Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev>
With task based throttle model, the previous way to check cfs_rq's
nr_queued to decide if throttled time should be accounted doesn't work
as expected, e.g. when a cfs_rq which has a single task is throttled,
that task could later block in kernel mode instead of being dequeued on
limbo list and accounting this as throttled time is not accurate.
Rework throttle time accounting for a cfs_rq as follows:
- start accounting when the first task gets throttled in its hierarchy;
- stop accounting on unthrottle.
Note that there will be a time gap between when a cfs_rq is throttled
and when a task in its hierarchy is actually throttled. This accounting
mechanism only starts accounting in the latter case.
Suggested-by: Chengming Zhou <chengming.zhou@linux.dev> # accounting mechanism
Co-developed-by: K Prateek Nayak <kprateek.nayak@amd.com> # simplify implementation
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Valentin Schneider <vschneid@redhat.com>
Tested-by: Matteo Martelli <matteo.martelli@codethink.co.uk>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://lore.kernel.org/r/20250829081120.806-5-ziqianlu@bytedance.com
In current throttle model, when a cfs_rq is throttled, its entity will
be dequeued from cpu's rq, making tasks attached to it not able to run,
thus achiveing the throttle target.
This has a drawback though: assume a task is a reader of percpu_rwsem
and is waiting. When it gets woken, it can not run till its task group's
next period comes, which can be a relatively long time. Waiting writer
will have to wait longer due to this and it also makes further reader
build up and eventually trigger task hung.
To improve this situation, change the throttle model to task based, i.e.
when a cfs_rq is throttled, record its throttled status but do not remove
it from cpu's rq. Instead, for tasks that belong to this cfs_rq, when
they get picked, add a task work to them so that when they return
to user, they can be dequeued there. In this way, tasks throttled will
not hold any kernel resources. And on unthrottle, enqueue back those
tasks so they can continue to run.
Throttled cfs_rq's PELT clock is handled differently now: previously the
cfs_rq's PELT clock is stopped once it entered throttled state but since
now tasks(in kernel mode) can continue to run, change the behaviour to
stop PELT clock when the throttled cfs_rq has no tasks left.
Suggested-by: Chengming Zhou <chengming.zhou@linux.dev> # tag on pick
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Valentin Schneider <vschneid@redhat.com>
Tested-by: Chen Yu <yu.c.chen@intel.com>
Tested-by: Matteo Martelli <matteo.martelli@codethink.co.uk>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://lore.kernel.org/r/20250829081120.806-4-ziqianlu@bytedance.com
Since all these functions are address-taken in SDTL_INIT() and called
indirectly, it doesn't really make sense for them to be inline.
Suggested-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Leon [1] and Vinicius [2] noted a topology_span_sane() warning during
their testing starting from v6.16-rc1. Debug that followed pointed to
the tl->mask() for the NODE domain being incorrectly resolved to that of
the highest NUMA domain.
tl->mask() for NODE is set to the sd_numa_mask() which depends on the
global "sched_domains_curr_level" hack. "sched_domains_curr_level" is
set to the "tl->numa_level" during tl traversal in build_sched_domains()
calling sd_init() but was not reset before topology_span_sane().
Since "tl->numa_level" still reflected the old value from
build_sched_domains(), topology_span_sane() for the NODE domain trips
when the span of the last NUMA domain overlaps.
Instead of replicating the "sched_domains_curr_level" hack, get rid of
it entirely and instead, pass the entire "sched_domain_topology_level"
object to tl->cpumask() function to prevent such mishap in the future.
sd_numa_mask() now directly references "tl->numa_level" instead of
relying on the global "sched_domains_curr_level" hack to index into
sched_domains_numa_masks[].
The original warning was reproducible on the following NUMA topology
reported by Leon:
$ sudo numactl -H
available: 5 nodes (0-4)
node 0 cpus: 0 1
node 0 size: 2927 MB
node 0 free: 1603 MB
node 1 cpus: 2 3
node 1 size: 3023 MB
node 1 free: 3008 MB
node 2 cpus: 4 5
node 2 size: 3023 MB
node 2 free: 3007 MB
node 3 cpus: 6 7
node 3 size: 3023 MB
node 3 free: 3002 MB
node 4 cpus: 8 9
node 4 size: 3022 MB
node 4 free: 2718 MB
node distances:
node 0 1 2 3 4
0: 10 39 38 37 36
1: 39 10 38 37 36
2: 38 38 10 37 36
3: 37 37 37 10 36
4: 36 36 36 36 10
The above topology can be mimicked using the following QEMU cmd that was
used to reproduce the warning and test the fix:
sudo qemu-system-x86_64 -enable-kvm -cpu host \
-m 20G -smp cpus=10,sockets=10 -machine q35 \
-object memory-backend-ram,size=4G,id=m0 \
-object memory-backend-ram,size=4G,id=m1 \
-object memory-backend-ram,size=4G,id=m2 \
-object memory-backend-ram,size=4G,id=m3 \
-object memory-backend-ram,size=4G,id=m4 \
-numa node,cpus=0-1,memdev=m0,nodeid=0 \
-numa node,cpus=2-3,memdev=m1,nodeid=1 \
-numa node,cpus=4-5,memdev=m2,nodeid=2 \
-numa node,cpus=6-7,memdev=m3,nodeid=3 \
-numa node,cpus=8-9,memdev=m4,nodeid=4 \
-numa dist,src=0,dst=1,val=39 \
-numa dist,src=0,dst=2,val=38 \
-numa dist,src=0,dst=3,val=37 \
-numa dist,src=0,dst=4,val=36 \
-numa dist,src=1,dst=0,val=39 \
-numa dist,src=1,dst=2,val=38 \
-numa dist,src=1,dst=3,val=37 \
-numa dist,src=1,dst=4,val=36 \
-numa dist,src=2,dst=0,val=38 \
-numa dist,src=2,dst=1,val=38 \
-numa dist,src=2,dst=3,val=37 \
-numa dist,src=2,dst=4,val=36 \
-numa dist,src=3,dst=0,val=37 \
-numa dist,src=3,dst=1,val=37 \
-numa dist,src=3,dst=2,val=37 \
-numa dist,src=3,dst=4,val=36 \
-numa dist,src=4,dst=0,val=36 \
-numa dist,src=4,dst=1,val=36 \
-numa dist,src=4,dst=2,val=36 \
-numa dist,src=4,dst=3,val=36 \
...
[ prateek: Moved common functions to include/linux/sched/topology.h,
reuse the common bits for s390 and ppc, commit message ]
Closes: https://lore.kernel.org/lkml/20250610110701.GA256154@unreal/ [1]
Fixes: ccf74128d6 ("sched/topology: Assert non-NUMA topology masks don't (partially) overlap") # ce29a7da84, f55dac1daf
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reported-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Tested-by: Valentin Schneider <vschneid@redhat.com> # x86
Tested-by: Shrikanth Hegde <sshegde@linux.ibm.com> # powerpc
Link: https://lore.kernel.org/lkml/a3de98387abad28592e6ab591f3ff6107fe01dc1.1755893468.git.tim.c.chen@linux.intel.com/ [2]
When a CPU chooses to call push_dl_task and picks a task to push to
another CPU's runqueue then it will call find_lock_later_rq method
which would take a double lock on both CPUs' runqueues. If one of the
locks aren't readily available, it may lead to dropping the current
runqueue lock and reacquiring both the locks at once. During this window
it is possible that the task is already migrated and is running on some
other CPU. These cases are already handled. However, if the task is
migrated and has already been executed and another CPU is now trying to
wake it up (ttwu) such that it is queued again on the runqeue
(on_rq is 1) and also if the task was run by the same CPU, then the
current checks will pass even though the task was migrated out and is no
longer in the pushable tasks list.
Please go through the original rt change for more details on the issue.
To fix this, after the lock is obtained inside the find_lock_later_rq,
it ensures that the task is still at the head of pushable tasks list.
Also removed some checks that are no longer needed with the addition of
this new check.
However, the new check of pushable tasks list only applies when
find_lock_later_rq is called by push_dl_task. For the other caller i.e.
dl_task_offline_migration, existing checks are used.
Signed-off-by: Harshit Agarwal <harshit@nutanix.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Juri Lelli <juri.lelli@redhat.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20250408045021.3283624-1-harshit@nutanix.com
Pull x86 fixes from Borislav Petkov:
- Convert the SSB mitigation to the attack vector controls which got
forgotten at the time
- Prevent the CPUID topology hierarchy detection on AMD from
overwriting the correct initial APIC ID
- Fix the case of a machine shipping without microcode in the BIOS, in
the AMD microcode loader
- Correct the Pentium 4 model range which has a constant TSC
* tag 'x86_urgent_for_v6.17_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/bugs: Add attack vector controls for SSB
x86/cpu/topology: Use initial APIC ID from XTOPOLOGY leaf on AMD/HYGON
x86/microcode/AMD: Handle the case of no BIOS microcode
x86/cpu/intel: Fix the constant_tsc model check for Pentium 4
Pull scheduler fixes from Borislav Petkov:
- Fix a stall on the CPU offline path due to mis-counting a deadline
server task twice as part of the runqueue's running tasks count
- Fix a realtime tasks starvation case where failure to enqueue a timer
whose expiration time is already in the past would cause repeated
attempts to re-enqueue a deadline server task which leads to starving
the former, realtime one
- Prevent a delayed deadline server task stop from breaking the
per-runqueue bandwidth tracking
- Have a function checking whether the deadline server task has
stopped, return the correct value
* tag 'sched_urgent_for_v6.17_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/deadline: Don't count nr_running for dl_server proxy tasks
sched/deadline: Fix RT task potential starvation when expiry time passed
sched/deadline: Always stop dl-server before changing parameters
sched/deadline: Fix dl_server_stopped()
Pull irq fixes from Borislav Petkov:
- Remove unnecessary and noisy WARN_ONs in gic-v5's init path
- Avoid a kmemleak false positive for the gic-v5's L2 IST table entries
- Fix a retval check in mvebu-gicp's probe function
- Fix a wrong conversion to guards in atmel-aic[5] irqchip
* tag 'irq_urgent_for_v6.17_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqchip/gic-v5: Remove undue WARN_ON()s in the IRS affinity parsing
irqchip/gic-v5: Fix kmemleak L2 IST table entries false positives
irqchip/mvebu-gicp: Fix an IS_ERR() vs NULL check in probe()
irqchip/atmel-aic[5]: Fix incorrect lock guard conversion
Pull hardening fixes from Kees Cook:
- ARM: stacktrace: include asm/sections.h in asm/stacktrace.h (Arnd
Bergmann)
- ubsan: Fix incorrect hand-side used in handle (Junhui Pei)
- hardening: Require clang 20.1.0 for __counted_by (Nathan Chancellor)
* tag 'hardening-v6.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
hardening: Require clang 20.1.0 for __counted_by
ARM: stacktrace: include asm/sections.h in asm/stacktrace.h
ubsan: Fix incorrect hand-side used in handle
Pull gpio fixes from Bartosz Golaszewski:
- fix an off-by-one bug in interrupt handling in gpio-timberdale
- update MAINTAINERS
* tag 'gpio-fixes-for-v6.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux:
MAINTAINERS: Change Altera-PIO driver maintainer
gpio: timberdale: fix off-by-one in IRQ type boundary check
Pull arm64 fixes from Catalin Marinas:
- CFI failure due to kpti_ng_pgd_alloc() signature mismatch
- Underallocation bug in the SVE ptrace kselftest
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
kselftest/arm64: Don't open code SVE_PT_SIZE() in fp-ptrace
arm64: mm: Fix CFI failure due to kpti_ng_pgd_alloc function signature
In fp-trace when allocating a buffer to write SVE register data we open
code the addition of the header size to the VL depeendent register data
size, which lead to an underallocation bug when we cut'n'pasted the code
for FPSIMD format writes. Use the SVE_PT_SIZE() macro that the kernel
UAPI provides for this.
Fixes: b84d2b2795 ("kselftest/arm64: Test FPSIMD format data writes via NT_ARM_SVE in fp-ptrace")
Signed-off-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20250812-arm64-fp-trace-macro-v1-1-317cfff986a5@kernel.org
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Seen during KPTI initialization:
CFI failure at create_kpti_ng_temp_pgd+0x124/0xce8 (target: kpti_ng_pgd_alloc+0x0/0x14; expected type: 0xd61b88b6)
The call site is alloc_init_pud() at arch/arm64/mm/mmu.c:
pud_phys = pgtable_alloc(TABLE_PUD);
alloc_init_pud() has the prototype:
static void alloc_init_pud(p4d_t *p4dp, unsigned long addr, unsigned long end,
phys_addr_t phys, pgprot_t prot,
phys_addr_t (*pgtable_alloc)(enum pgtable_type),
int flags)
where the pgtable_alloc() prototype is declared.
The target (kpti_ng_pgd_alloc) is used in arch/arm64/kernel/cpufeature.c:
create_kpti_ng_temp_pgd(kpti_ng_temp_pgd, __pa(alloc), KPTI_NG_TEMP_VA,
PAGE_SIZE, PAGE_KERNEL, kpti_ng_pgd_alloc, 0);
which is an alias for __create_pgd_mapping_locked() with prototype:
extern __alias(__create_pgd_mapping_locked)
void create_kpti_ng_temp_pgd(pgd_t *pgdir, phys_addr_t phys,
unsigned long virt,
phys_addr_t size, pgprot_t prot,
phys_addr_t (*pgtable_alloc)(enum pgtable_type),
int flags);
__create_pgd_mapping_locked() passes the function pointer down:
__create_pgd_mapping_locked() -> alloc_init_p4d() -> alloc_init_pud()
But the target function (kpti_ng_pgd_alloc) has the wrong signature:
static phys_addr_t __init kpti_ng_pgd_alloc(int shift);
The "int" should be "enum pgtable_type".
To make "enum pgtable_type" available to cpufeature.c, move
enum pgtable_type definition from arch/arm64/mm/mmu.c to
arch/arm64/include/asm/mmu.h.
Adjust kpti_ng_pgd_alloc to use "enum pgtable_type" instead of "int".
The function behavior remains identical (parameter is unused).
Fixes: c64f46ee13 ("arm64: mm: use enum to identify pgtable level instead of *_SHIFT")
Cc: <stable@vger.kernel.org> # 6.16.x
Signed-off-by: Kees Cook <kees@kernel.org>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20250829190721.it.373-kees@kernel.org
Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Pull kvm fixes from Paolo Bonzini:
"ARM:
- Correctly handle 'invariant' system registers for protected VMs
- Improved handling of VNCR data aborts, including external aborts
- Fixes for handling of FEAT_RAS for NV guests, providing a sane
fault context during SEA injection and preventing the use of
RASv1p1 fault injection hardware
- Ensure that page table destruction when a VM is destroyed gives an
opportunity to reschedule
- Large fix to KVM's infrastructure for managing guest context loaded
on the CPU, addressing issues where the output of AT emulation
doesn't get reflected to the guest
- Fix AT S12 emulation to actually perform stage-2 translation when
necessary
- Avoid attempting vLPI irqbypass when GICv4 has been explicitly
disabled for a VM
- Minor KVM + selftest fixes
RISC-V:
- Fix pte settings within kvm_riscv_gstage_ioremap()
- Fix comments in kvm_riscv_check_vcpu_requests()
- Fix stack overrun when setting vlenb via ONE_REG
x86:
- Use array_index_nospec() to sanitize the target vCPU ID when
handling PV IPIs and yields as the ID is guest-controlled.
- Drop a superfluous cpumask_empty() check when reclaiming SEV
memory, as the common case, by far, is that at least one CPU will
have entered the VM, and wbnoinvd_on_cpus_mask() will naturally
handle the rare case where the set of have_run_cpus is empty.
Selftests (not KVM):
- Rename the is_signed_type() macro in kselftest_harness.h to
is_signed_var() to fix a collision with linux/overflow.h. The
collision generates compiler warnings due to the two macros having
different meaning"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (29 commits)
KVM: arm64: nv: Fix ATS12 handling of single-stage translation
KVM: arm64: Remove __vcpu_{read,write}_sys_reg_{from,to}_cpu()
KVM: arm64: Fix vcpu_{read,write}_sys_reg() accessors
KVM: arm64: Simplify sysreg access on exception delivery
KVM: arm64: Check for SYSREGS_ON_CPU before accessing the 32bit state
RISC-V: KVM: fix stack overrun when loading vlenb
RISC-V: KVM: Correct kvm_riscv_check_vcpu_requests() comment
RISC-V: KVM: Fix pte settings within kvm_riscv_gstage_ioremap()
KVM: arm64: selftests: Sync ID_AA64MMFR3_EL1 in set_id_regs
KVM: arm64: Get rid of ARM64_FEATURE_MASK()
KVM: arm64: Make ID_AA64PFR1_EL1.RAS_frac writable
KVM: arm64: Make ID_AA64PFR0_EL1.RAS writable
KVM: arm64: Ignore HCR_EL2.FIEN set by L1 guest's EL2
KVM: arm64: Handle RASv1p1 registers
arm64: Add capability denoting FEAT_RASv1p1
KVM: arm64: Reschedule as needed when destroying the stage-2 page-tables
KVM: arm64: Split kvm_pgtable_stage2_destroy()
selftests: harness: Rename is_signed_type() to avoid collision with overflow.h
KVM: SEV: don't check have_run_cpus in sev_writeback_caches()
KVM: arm64: Correctly populate FAR_EL2 on nested SEA injection
...
After an innocuous change in -next that modified a structure that
contains __counted_by, clang-19 start crashing when building certain
files in drivers/gpu/drm/xe. When assertions are enabled, the more
descriptive failure is:
clang: clang/lib/AST/RecordLayoutBuilder.cpp:3335: const ASTRecordLayout &clang::ASTContext::getASTRecordLayout(const RecordDecl *) const: Assertion `D && "Cannot get layout of forward declarations!"' failed.
According to a reverse bisect, a tangential change to the LLVM IR
generation phase of clang during the LLVM 20 development cycle [1]
resolves this problem. Bump the version of clang that enables
CONFIG_CC_HAS_COUNTED_BY to 20.1.0 to ensure that this issue cannot be
hit.
Link: 160fb1121c [1]
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Reviewed-by: Justin Stitt <justinstitt@google.com>
Link: https://lore.kernel.org/r/20250807-fix-counted_by-clang-19-v1-1-902c86c1d515@kernel.org
Signed-off-by: Kees Cook <kees@kernel.org>
KVM/arm64 changes for 6.17, take #2
- Correctly handle 'invariant' system registers for protected VMs
- Improved handling of VNCR data aborts, including external aborts
- Fixes for handling of FEAT_RAS for NV guests, providing a sane
fault context during SEA injection and preventing the use of
RASv1p1 fault injection hardware
- Ensure that page table destruction when a VM is destroyed gives an
opportunity to reschedule
- Large fix to KVM's infrastructure for managing guest context loaded
on the CPU, addressing issues where the output of AT emulation
doesn't get reflected to the guest
- Fix AT S12 emulation to actually perform stage-2 translation when
necessary
- Avoid attempting vLPI irqbypass when GICv4 has been explicitly
disabled for a VM
- Minor KVM + selftest fixes
KVM/riscv fixes for 6.17, take #1
- Fix pte settings within kvm_riscv_gstage_ioremap()
- Fix comments in kvm_riscv_check_vcpu_requests()
- Fix stack overrun when setting vlenb via ONE_REG
Pull EFI fixes from Ard Biesheuvel:
- Assorted fixes for the OP-TEE based pseudo-EFI variable store
- Fix for an OOB access when looking up the same non-existing efivarfs
entry multiple times in parallel
* tag 'efi-fixes-for-v6.17-1' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi:
efivarfs: Fix slab-out-of-bounds in efivarfs_d_compare
efi: stmm: Drop unneeded null pointer check
efi: stmm: Drop unused EFI error from setup_mm_hdr arguments
efi: stmm: Do not return EFI_OUT_OF_RESOURCES on internal errors
efi: stmm: Fix incorrect buffer allocation method
Pull smb client fixes from Steve French:
- Fix possible refcount leak in compound operations
- Fix remap_file_range() return code mapping, found by generic/157
* tag 'v6.17-rc3-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
fs/smb: Fix inconsistent refcnt update
smb3 client: fix return code mapping of remap_file_range
Pull xfs fixes from Carlos Maiolino:
"The highlight I'd like to point here is related to the XFS_RT
Kconfig, which has been updated to be enabled by default now if
CONFIG_BLK_DEV_ZONED is enabled.
This also contains a few fixes for zoned devices support in XFS,
specially related to swapon requests in inodes belonging to the zoned
FS.
A null-ptr dereference fix in the xattr data, due to a mishandling of
medium errors generated by block devices is also included"
* tag 'xfs-fixes-6.17-rc4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: do not propagate ENODATA disk errors into xattr code
xfs: reject swapon for inodes on a zoned file system earlier
xfs: kick off inodegc when failing to reserve zoned blocks
xfs: remove xfs_last_used_zone
xfs: Default XFS_RT to Y if CONFIG_BLK_DEV_ZONED is enabled
Pull HID fixes from Jiri Kosina:
- fixes for memory corruption in intel-thc-hid, hid-multitouch,
hid-mcp2221 and hid-asus (Aaron Ma, Qasim Ijaz, Arnaud Lecomte)
- power management/resume fix for intel-ish-hid (Zhang Lixu)
- driver reinitialization fix for intel-thc-hid (Even Xu)
- ensure that battery level status is reported as soon as possible,
which is required at least for some Android use-cases (José Expósito)
- quite a few new device ID additions and device-specific quirks
* tag 'hid-for-linus-2025082901' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid:
HID: quirks: add support for Legion Go dual dinput modes
HID: elecom: add support for ELECOM M-DT2DRBK
HID: logitech: Add ids for G PRO 2 LIGHTSPEED
HID: input: report battery status changes immediately
HID: input: rename hidinput_set_battery_charge_status()
HID: intel-thc-hid: Intel-quicki2c: Enhance driver re-install flow
HID: hid-ntrig: fix unable to handle page fault in ntrig_report_version()
HID: asus: fix UAF via HID_CLAIMED_INPUT validation
hid: fix I2C read buffer overflow in raw_event() for mcp2221
HID: wacom: Add a new Art Pen 2
HID: multitouch: fix slab out-of-bounds access in mt_report_fixup()
HID: Kconfig: Fix spelling mistake "enthropy" -> "entropy"
HID: intel-ish-hid: Increase ISHTP resume ack timeout to 300ms
HID: intel-thc-hid: intel-thc: Fix incorrect pointer arithmetic in I2C regs save
HID: intel-thc-hid: intel-quicki2c: Fix ACPI dsd ICRS/ISUB length
Pull regulator fix from Mark Brown:
"One simple fix for the pm8008 driver for poor error handling,
switching to use a helper which does the right thing in the
affected case"
* tag 'regulator-fix-v6.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator:
regulator: pm8008: fix probe failure due to negative voltage selector
Pull ata fixes from Damien Le Moal:
- Fix the type of return values to be signed in the ahci_xgen driver
(Qianfeng)
- Add the mask_port_ext module parameter to the ahci driver.
This is to allow a user to ignore ports that are advertized as
external (hotplug capable) in favor of lower link power management
policies instead of the default max_performance for these ports.
This is useful to allow e.g. laptops to go into low power states when
hooked up to docking station with sata slots, connected with an
external port for hotplug (me)
* tag 'ata-6.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/libata/linux:
ata: ahci_xgene: Use int type for 'rc' to store error codes
ata: ahci: Allow ignoring the external/hotplug capability of ports
Pull drm fixes from Dave Airlie:
"Weekly fixes, feels a bit big.
The major piece is msm fixes, then the usual amdgpu/xe along with some
mediatek and nouveau fixes and a tegra revert.
gpuvm:
- fix some typos
xe:
- Fix user-fence race issue
- Couple xe_vm fixes
- Don't trigger rebind on initial dma-buf validation
- Fix a build issue related to basename() posix vs gnu discrepancy
amdgpu:
- pin buffers while vmapping
- UserQ fixes
- Revert CSA fix
- SR-IOV fix
nouveau:
- fix linear modifier
- remove some dead code
msm:
- Core/GPU:
- fix comment doc warning in gpuvm
- fix build with KMS disabled
- fix pgtable setup/teardown race
- global fault counter fix
- various error path fixes
- GPU devcoredump snapshot fixes
- handle in-place VM_BIND remaps to solve turnip vm update race
- skip re-emitting IBs for unusable VMs
- Don't use %pK through printk
- moved display snapshot init earlier, fixing a crash
- DPU:
- Fixed crash in virtual plane checking code
- Fixed mode comparison in virtual plane checking code
- DSI:
- Adjusted width of resulution-related registers
- Fixed locking issue on 14nm PLLs
- UBWC (per Bjorn's ack)
- Added UBWC configuration for several missing platforms (fixing
regression)
mediatek:
- Add error handling for old state CRTC in atomic_disable
- Fix DSI host and panel bridge pre-enable order
- Fix device/node reference count leaks in mtk_drm_get_all_drm_priv
- mtk_hdmi: Fix inverted parameters in some regmap_update_bits calls
tegra:
- revert dma-buf change"
* tag 'drm-fixes-2025-08-29' of https://gitlab.freedesktop.org/drm/kernel: (56 commits)
drm/mediatek: mtk_hdmi: Fix inverted parameters in some regmap_update_bits calls
drm/amdgpu/userq: fix error handling of invalid doorbell
drm/amdgpu: update firmware version checks for user queue support
drm/amd/amdgpu: disable hwmon power1_cap* for gfx 11.0.3 on vf mode
Revert "drm/amdgpu: fix incorrect vm flags to map bo"
drm/amdgpu/gfx12: set MQD as appriopriate for queue types
drm/amdgpu/gfx11: set MQD as appriopriate for queue types
drm/xe: switch to local xbasename() helper
drm/xe: Don't trigger rebind on initial dma-buf validation
drm/xe/vm: Clear the scratch_pt pointer on error
drm/xe/vm: Don't pin the vm_resv during validation
drm/xe/xe_sync: avoid race during ufence signaling
Revert "drm/tegra: Use dma_buf from GEM object instance"
soc: qcom: use no-UBWC config for MSM8956/76
soc: qcom: add configuration for MSM8929
soc: qcom: ubwc: add more missing platforms
soc: qcom: ubwc: use no-uwbc config for MSM8917
drm/msm/dpu: Add a null ptr check for dpu_encoder_needs_modeset
dt-bindings: display/msm: qcom,mdp5: drop lut clock
drm/gpuvm: fix various typos in .c and .h gpuvm file
...
Pull block fixes from Jens Axboe:
- Fix a lockdep spotted issue on recursive locking for zoned writes, in
case of errors
- Update bcache MAINTAINERS entry address for Coly
- Fix for a ublk release issue, with selftests
- Fix for a regression introduced in this cycle, where it assumed
q->rq_qos was always set if the bio flag indicated that
- Fix for a regression introduced in this cycle, where loop retrieving
block device sizes got broken
* tag 'block-6.17-20250828' of git://git.kernel.dk/linux:
bcache: change maintainer's email address
ublk selftests: add --no_ublk_fixed_fd for not using registered ublk char device
ublk: avoid ublk_io_release() called after ublk char dev is closed
block: validate QoS before calling __rq_qos_done_bio()
blk-zoned: Fix a lockdep complaint about recursive locking
loop: fix zero sized loop for block special file
Pull io_uring fixes from Jens Axboe:
- Use the proper type for min_t() in getting the min of the leftover
bytes and the buffer length.
- As good practice, use READ_ONCE() consistently for reading ring
provided buffer lengths. Additionally, stop looping for incremental
commits if a zero sized buffer is hit, as no further progress can be
made at that point.
* tag 'io_uring-6.17-20250828' of git://git.kernel.dk/linux:
io_uring/kbuf: always use READ_ONCE() to read ring provided buffer lengths
io_uring/kbuf: fix signedness in this_len calculation
Pull networking fixes from Paolo Abeni:
"Including fixes from Bluetooth.
Current release - regressions:
- ipv4: fix regression in local-broadcast routes
- vsock: fix error-handling regression introduced in v6.17-rc1
Previous releases - regressions:
- bluetooth:
- mark connection as closed during suspend disconnect
- fix set_local_name race condition
- eth:
- ice: fix NULL pointer dereference on reset
- mlx5: fix memory leak in hws_pool_buddy_init error path
- bnxt_en: fix stats context reservation logic
- hv: fix loss of receive events from host during channel open
Previous releases - always broken:
- page_pool: fix incorrect mp_ops error handling
- sctp: initialize more fields in sctp_v6_from_sk()
- eth:
- octeontx2-vf: fix max packet length errors
- idpf: fix Tx flow scheduling to avoid Tx timeouts
- bnxt_en: fix memory corruption during ifdown
- ice: fix incorrect counter for buffer allocation failures
- mlx5: fix lockdep assertion on sync reset unload event
- fbnic: fixup rtnl_lock and devl_lock handling
- xgmac: do not enable RX FIFO overflow interrupts
- phy: mscc: fix when PTP clock is register and unregister
Misc:
- add Telit Cinterion LE910C4-WWX new compositions"
* tag 'net-6.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (60 commits)
net: ipv4: fix regression in local-broadcast routes
net: macb: Disable clocks once
fbnic: Move phylink resume out of service_task and into open/close
fbnic: Fixup rtnl_lock and devl_lock handling related to mailbox code
net: rose: fix a typo in rose_clear_routes()
l2tp: do not use sock_hold() in pppol2tp_session_get_sock()
sctp: initialize more fields in sctp_v6_from_sk()
MAINTAINERS: rmnet: Update email addresses
net: rose: include node references in rose_neigh refcount
net: rose: convert 'use' field to refcount_t
net: rose: split remove and free operations in rose_remove_neigh()
net: hv_netvsc: fix loss of early receive events from host during channel open.
net: stmmac: Set CIC bit only for TX queues with COE
net: stmmac: xgmac: Correct supported speed modes
net: stmmac: xgmac: Do not enable RX FIFO Overflow interrupts
net/mlx5e: Set local Xoff after FW update
net/mlx5e: Update and set Xon/Xoff upon port speed set
net/mlx5e: Update and set Xon/Xoff upon MTU set
net/mlx5: Prevent flow steering mode changes in switchdev mode
net/mlx5: Nack sync reset when SFs are present
...
Mediatek DRM Fixes - 20250829
1. Add error handling for old state CRTC in atomic_disable
2. Fix DSI host and panel bridge pre-enable order
3. Fix device/node reference count leaks in mtk_drm_get_all_drm_priv
4. mtk_hdmi: Fix inverted parameters in some regmap_update_bits calls
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Chun-Kuang Hu <chunkuang.hu@kernel.org>
Link: https://lore.kernel.org/r/20250828234116.4960-1-chunkuang.hu@kernel.org
Pull power management fix from Rafael Wysocki:
"Add missing locking annotations to two recently introduced
list_for_each_entry_rcu() loops in the core device suspend/resume
code (Johannes Berg)"
* tag 'pm-6.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
PM: sleep: annotate RCU list iterations
Pull dma-mapping fixes from Marek Szyprowski:
- another small fix for arm64 systems with memory encryption (Shanker
Donthineni)
- fix for arm32 systems with non-standard CMA configuration (Oreoluwa
Babatunde)
* tag 'dma-mapping-6.17-2025-08-28' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux:
dma/pool: Ensure DMA_DIRECT_REMAP allocations are decrypted
of: reserved_mem: Restructure call site for dma_contiguous_early_fixup()
Pull memblock fixes from Mike Rapoport:
- printk cleanups in memblock and numa_memblks
- update kernel-doc for MEMBLOCK_RSRV_NOINIT to be more accurate and
detailed
* tag 'fixes-2025-08-28' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock:
memblock: fix kernel-doc for MEMBLOCK_RSRV_NOINIT
mm: numa,memblock: Use SZ_1M macro to denote bytes to MB conversion
mm/numa_memblks: Use pr_debug instead of printk(KERN_DEBUG)
Pull powerpc fixes from Madhavan Srinivasan:
- Merge two CONFIG_POWERPC64_CPU entries in Kconfig.cputype
- Replace extra-y to always-y in Makefile
- Cleanup to use dev_fwnode helper
- Fix misleading comment in kvmppc_prepare_to_enter()
- misc cleanup and fixes
Thanks to Amit Machhiwal, Andrew Donnellan, Christophe Leroy, Gautam
Menghani, Jiri Slaby (SUSE), Masahiro Yamada, Shrikanth Hegde, Stephen
Rothwell, Venkat Rao Bagalkote, and Xichao Zhao
* tag 'powerpc-6.17-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/boot/install.sh: Fix shellcheck warnings
powerpc/prom_init: Fix shellcheck warnings
powerpc/kvm: Fix ifdef to remove build warning
powerpc: unify two CONFIG_POWERPC64_CPU entries in the same choice block
powerpc: use always-y instead of extra-y in Makefiles
powerpc/64: Drop unnecessary 'rc' variable
powerpc: Use dev_fwnode()
KVM: PPC: Fix misleading interrupts comment in kvmppc_prepare_to_enter()