Pull bpf updates from Alexei Starovoitov:
- Add BPF uprobe session support (Jiri Olsa)
- Optimize uprobe performance (Andrii Nakryiko)
- Add bpf_fastcall support to helpers and kfuncs (Eduard Zingerman)
- Avoid calling free_htab_elem() under hash map bucket lock (Hou Tao)
- Prevent tailcall infinite loop caused by freplace (Leon Hwang)
- Mark raw_tracepoint arguments as nullable (Kumar Kartikeya Dwivedi)
- Introduce uptr support in the task local storage map (Martin KaFai
Lau)
- Stringify errno log messages in libbpf (Mykyta Yatsenko)
- Add kmem_cache BPF iterator for perf's lock profiling (Namhyung Kim)
- Support BPF objects of either endianness in libbpf (Tony Ambardar)
- Add ksym to struct_ops trampoline to fix stack trace (Xu Kuohai)
- Introduce private stack for eligible BPF programs (Yonghong Song)
- Migrate samples/bpf tests to selftests/bpf test_progs (Daniel T. Lee)
- Migrate test_sock to selftests/bpf test_progs (Jordan Rife)
* tag 'bpf-next-6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (152 commits)
libbpf: Change hash_combine parameters from long to unsigned long
selftests/bpf: Fix build error with llvm 19
libbpf: Fix memory leak in bpf_program__attach_uprobe_multi
bpf: use common instruction history across all states
bpf: Add necessary migrate_disable to range_tree.
bpf: Do not alloc arena on unsupported arches
selftests/bpf: Set test path for token/obj_priv_implicit_token_envvar
selftests/bpf: Add a test for arena range tree algorithm
bpf: Introduce range_tree data structure and use it in bpf arena
samples/bpf: Remove unused variable in xdp2skb_meta_kern.c
samples/bpf: Remove unused variables in tc_l2_redirect_kern.c
bpftool: Cast variable `var` to long long
bpf, x86: Propagate tailcall info only for subprogs
bpf: Add kernel symbol for struct_ops trampoline
bpf: Use function pointers count as struct_ops links count
bpf: Remove unused member rcu from bpf_struct_ops_map
selftests/bpf: Add struct_ops prog private stack tests
bpf: Support private stack for struct_ops progs
selftests/bpf: Add tracing prog private stack tests
bpf, x86: Support private stack in jit
...
Pull kgdb updates from Daniel Thompson:
"A relatively modest collection of changes:
- Adopt kstrtoint() and kstrtol() instead of the simple_strtoXX
family for better error checking of user input.
- Align the print behavour when breakpoints are enabled and disabled
by adopting the current behaviour of breakpoint disable for both.
- Remove some of the (rather odd and user hostile) hex fallbacks and
require kdb users to prefix with 0x instead.
- Tidy up (and fix) control code handling in kdb's keyboard code.
This makes the control code handling at the keyboard behave the
same way as it does via the UART.
- Switch my own entry in MAINTAINERS to my @kernel.org address"
* tag 'kgdb-6.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux:
kdb: fix ctrl+e/a/f/b/d/p/n broken in keyboard mode
MAINTAINERS: Use Daniel Thompson's korg address for kgdb work
kdb: Fix breakpoint enable to be silent if already enabled
kdb: Remove fallback interpretation of arbitrary numbers as hex
trace: kdb: Replace simple_strtoul with kstrtoul in kdb_ftdump
kdb: Replace the use of simple_strto with safer kstrto in kdb_main
Pull ftrace updates from Steven Rostedt:
- Restructure the function graph shadow stack to prepare it for use
with kretprobes
With the goal of merging the shadow stack logic of function graph and
kretprobes, some more restructuring of the function shadow stack is
required.
Move out function graph specific fields from the fgraph
infrastructure and store it on the new stack variables that can pass
data from the entry callback to the exit callback.
Hopefully, with this change, the merge of kretprobes to use fgraph
shadow stacks will be ready by the next merge window.
- Make shadow stack 4k instead of using PAGE_SIZE.
Some architectures have very large PAGE_SIZE values which make its
use for shadow stacks waste a lot of memory.
- Give shadow stacks its own kmem cache.
When function graph is started, every task on the system gets a
shadow stack. In the future, shadow stacks may not be 4K in size.
Have it have its own kmem cache so that whatever size it becomes will
still be efficient in allocations.
- Initialize profiler graph ops as it will be needed for new updates to
fgraph
- Convert to use guard(mutex) for several ftrace and fgraph functions
- Add more comments and documentation
- Show function return address in function graph tracer
Add an option to show the caller of a function at each entry of the
function graph tracer, similar to what the function tracer does.
- Abstract out ftrace_regs from being used directly like pt_regs
ftrace_regs was created to store a partial pt_regs. It holds only the
registers and stack information to get to the function arguments and
return values. On several archs, it is simply a wrapper around
pt_regs. But some users would access ftrace_regs directly to get the
pt_regs which will not work on all archs. Make ftrace_regs an
abstract structure that requires all access to its fields be through
accessor functions.
- Show how long it takes to do function code modifications
When code modification for function hooks happen, it always had the
time recorded in how long it took to do the conversion. But this
value was never exported. Recently the code was touched due to new
ROX modification handling that caused a large slow down in doing the
modifications and had a significant impact on boot times.
Expose the timings in the dyn_ftrace_total_info file. This file was
created a while ago to show information about memory usage and such
to implement dynamic function tracing. It's also an appropriate file
to store the timings of this modification as well. This will make it
easier to see the impact of changes to code modification on boot up
timings.
- Other clean ups and small fixes
* tag 'ftrace-v6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (22 commits)
ftrace: Show timings of how long nop patching took
ftrace: Use guard to take ftrace_lock in ftrace_graph_set_hash()
ftrace: Use guard to take the ftrace_lock in release_probe()
ftrace: Use guard to lock ftrace_lock in cache_mod()
ftrace: Use guard for match_records()
fgraph: Use guard(mutex)(&ftrace_lock) for unregister_ftrace_graph()
fgraph: Give ret_stack its own kmem cache
fgraph: Separate size of ret_stack from PAGE_SIZE
ftrace: Rename ftrace_regs_return_value to ftrace_regs_get_return_value
selftests/ftrace: Fix check of return value in fgraph-retval.tc test
ftrace: Use arch_ftrace_regs() for ftrace_regs_*() macros
ftrace: Consolidate ftrace_regs accessor functions for archs using pt_regs
ftrace: Make ftrace_regs abstract from direct use
fgragh: No need to invoke the function call_filter_check_discard()
fgraph: Simplify return address printing in function graph tracer
function_graph: Remove unnecessary initialization in ftrace_graph_ret_addr()
function_graph: Support recording and printing the function return address
ftrace: Have calltime be saved in the fgraph storage
ftrace: Use a running sleeptime instead of saving on shadow stack
fgraph: Use fgraph data to store subtime for profiler
...
Pull performance events updates from Ingo Molnar:
"Uprobes:
- Add BPF session support (Jiri Olsa)
- Switch to RCU Tasks Trace flavor for better performance (Andrii
Nakryiko)
- Massively increase uretprobe SMP scalability by SRCU-protecting
the uretprobe lifetime (Andrii Nakryiko)
- Kill xol_area->slot_count (Oleg Nesterov)
Core facilities:
- Implement targeted high-frequency profiling by adding the ability
for an event to "pause" or "resume" AUX area tracing (Adrian
Hunter)
VM profiling/sampling:
- Correct perf sampling with guest VMs (Colton Lewis)
New hardware support:
- x86/intel: Add PMU support for Intel ArrowLake-H CPUs (Dapeng Mi)
Misc fixes and enhancements:
- x86/intel/pt: Fix buffer full but size is 0 case (Adrian Hunter)
- x86/amd: Warn only on new bits set (Breno Leitao)
- x86/amd/uncore: Avoid a false positive warning about snprintf
truncation in amd_uncore_umc_ctx_init (Jean Delvare)
- uprobes: Re-order struct uprobe_task to save some space
(Christophe JAILLET)
- x86/rapl: Move the pmu allocation out of CPU hotplug (Kan Liang)
- x86/rapl: Clean up cpumask and hotplug (Kan Liang)
- uprobes: Deuglify xol_get_insn_slot/xol_free_insn_slot paths (Oleg
Nesterov)"
* tag 'perf-core-2024-11-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (32 commits)
perf/core: Correct perf sampling with guest VMs
perf/x86: Refactor misc flag assignments
perf/powerpc: Use perf_arch_instruction_pointer()
perf/core: Hoist perf_instruction_pointer() and perf_misc_flags()
perf/arm: Drop unused functions
uprobes: Re-order struct uprobe_task to save some space
perf/x86/amd/uncore: Avoid a false positive warning about snprintf truncation in amd_uncore_umc_ctx_init
perf/x86/intel: Do not enable large PEBS for events with aux actions or aux sampling
perf/x86/intel/pt: Add support for pause / resume
perf/core: Add aux_pause, aux_resume, aux_start_paused
perf/x86/intel/pt: Fix buffer full but size is 0 case
uprobes: SRCU-protect uretprobe lifetime (with timeout)
uprobes: allow put_uprobe() from non-sleepable softirq context
perf/x86/rapl: Clean up cpumask and hotplug
perf/x86/rapl: Move the pmu allocation out of CPU hotplug
uprobe: Add support for session consumer
uprobe: Add data pointer to consumer handlers
perf/x86/amd: Warn only on new bits set
uprobes: fold xol_take_insn_slot() into xol_get_insn_slot()
uprobes: kill xol_area->slot_count
...
Pull ring buffer fixes from Steven Rostedt:
- Revert: "ring-buffer: Do not have boot mapped buffers hook to CPU
hotplug"
A crash that happened on cpu hotplug was actually caused by the
incorrect ref counting that was fixed by commit 2cf9733891
("ring-buffer: Fix refcount setting of boot mapped buffers"). The
removal of calling cpu hotplug callbacks on memory mapped buffers was
not an issue even though the tests at the time pointed toward it. But
in fact, there's a check in that code that tests to see if the
buffers are already allocated or not, and will not allocate them
again if they are. Not calling the cpu hotplug callbacks ended up not
initializing the non boot CPU buffers.
Simply remove that change.
- Clear all CPU buffers when starting tracing in a boot mapped buffer
To properly process events from a previous boot, the address space
needs to be accounted for due to KASLR and the events in the buffer
are updated accordingly when read. This also requires that when the
buffer has tracing enabled again in the current boot that the buffers
are reset so that events from the previous boot do not interact with
the events of the current boot and cause confusing due to not having
the proper meta data.
It was found that if a CPU is taken offline, that its per CPU buffer
is not reset when tracing starts. This allows for events to be from
both the previous boot and the current boot to be in the buffer at
the same time. Clear all CPU buffers when tracing is started in a
boot mapped buffer.
* tag 'trace-ringbuffer-v6.12-rc7-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing/ring-buffer: Clear all memory mapped CPU ring buffers on first recording
Revert: "ring-buffer: Do not have boot mapped buffers hook to CPU hotplug"
The events of a memory mapped ring buffer from the previous boot should
not be mixed in with events from the current boot. There's meta data that
is used to handle KASLR so that function names can be shown properly.
Also, since the timestamps of the previous boot have no meaning to the
timestamps of the current boot, having them intermingled in a buffer can
also cause confusion because there could possibly be events in the future.
When a trace is activated the meta data is reset so that the pointers of
are now processed for the new address space. The trace buffers are reset
when tracing starts for the first time. The problem here is that the reset
only happens on online CPUs. If a CPU is offline, it does not get reset.
To demonstrate the issue, a previous boot had tracing enabled in the boot
mapped ring buffer on reboot. On the following boot, tracing has not been
started yet so the function trace from the previous boot is still visible.
# trace-cmd show -B boot_mapped -c 3 | tail
<idle>-0 [003] d.h2. 156.462395: __rcu_read_lock <-cpu_emergency_disable_virtualization
<idle>-0 [003] d.h2. 156.462396: vmx_emergency_disable_virtualization_cpu <-cpu_emergency_disable_virtualization
<idle>-0 [003] d.h2. 156.462396: __rcu_read_unlock <-__sysvec_reboot
<idle>-0 [003] d.h2. 156.462397: stop_this_cpu <-__sysvec_reboot
<idle>-0 [003] d.h2. 156.462397: set_cpu_online <-stop_this_cpu
<idle>-0 [003] d.h2. 156.462397: disable_local_APIC <-stop_this_cpu
<idle>-0 [003] d.h2. 156.462398: clear_local_APIC <-disable_local_APIC
<idle>-0 [003] d.h2. 156.462574: mcheck_cpu_clear <-stop_this_cpu
<idle>-0 [003] d.h2. 156.462575: mce_intel_feature_clear <-stop_this_cpu
<idle>-0 [003] d.h2. 156.462575: lmce_supported <-mce_intel_feature_clear
Now, if CPU 3 is taken offline, and tracing is started on the memory
mapped ring buffer, the events from the previous boot in the CPU 3 ring
buffer is not reset. Now those events are using the meta data from the
current boot and produces just hex values.
# echo 0 > /sys/devices/system/cpu/cpu3/online
# trace-cmd start -B boot_mapped -p function
# trace-cmd show -B boot_mapped -c 3 | tail
<idle>-0 [003] d.h2. 156.462395: 0xffffffff9a1e3194 <-0xffffffff9a0f655e
<idle>-0 [003] d.h2. 156.462396: 0xffffffff9a0a1d24 <-0xffffffff9a0f656f
<idle>-0 [003] d.h2. 156.462396: 0xffffffff9a1e6bc4 <-0xffffffff9a0f7323
<idle>-0 [003] d.h2. 156.462397: 0xffffffff9a0d12b4 <-0xffffffff9a0f732a
<idle>-0 [003] d.h2. 156.462397: 0xffffffff9a1458d4 <-0xffffffff9a0d12e2
<idle>-0 [003] d.h2. 156.462397: 0xffffffff9a0faed4 <-0xffffffff9a0d12e7
<idle>-0 [003] d.h2. 156.462398: 0xffffffff9a0faaf4 <-0xffffffff9a0faef2
<idle>-0 [003] d.h2. 156.462574: 0xffffffff9a0e3444 <-0xffffffff9a0d12ef
<idle>-0 [003] d.h2. 156.462575: 0xffffffff9a0e4964 <-0xffffffff9a0d12ef
<idle>-0 [003] d.h2. 156.462575: 0xffffffff9a0e3fb0 <-0xffffffff9a0e496f
Reset all CPUs when starting a boot mapped ring buffer for the first time,
and not just the online CPUs.
Fixes: 7a1d1e4b96 ("tracing/ring-buffer: Add last_boot_info file to boot instance")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
A crash happened when testing cpu hotplug with respect to the memory
mapped ring buffers. It was assumed that the hot plug code was adding a
per CPU buffer that was already created that caused the crash. The real
problem was due to ref counting and was fixed by commit 2cf9733891
("ring-buffer: Fix refcount setting of boot mapped buffers").
When a per CPU buffer is created, it will not be created again even with
CPU hotplug, so the fix to not use CPU hotplug was a red herring. In fact,
it caused only the boot CPU buffer to be created, leaving the other CPU
per CPU buffers disabled.
Revert that change as it was not the culprit of the fix it was intended to
be.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20241113230839.6c03640f@gandalf.local.home
Fixes: 912da2c384 ("ring-buffer: Do not have boot mapped buffers hook to CPU hotplug")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Cross-merge bpf fixes after downstream PR.
In particular to bring the fix in
commit aa30eb3260 ("bpf: Force checkpoint when jmp history is too long").
The follow up verifier work depends on it.
And the fix in
commit 6801cf7890 ("selftests/bpf: Use -4095 as the bad address for bits iterator").
It's fixing instability of BPF CI on s390 arch.
No conflicts.
Adjacent changes in:
Auto-merging arch/Kconfig
Auto-merging kernel/bpf/helpers.c
Auto-merging kernel/bpf/memalloc.c
Auto-merging kernel/bpf/verifier.c
Auto-merging mm/slab_common.c
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Adding support to attach BPF program for entry and return probe
of the same function. This is common use case which at the moment
requires to create two uprobe multi links.
Adding new BPF_TRACE_UPROBE_SESSION attach type that instructs
kernel to attach single link program to both entry and exit probe.
It's possible to control execution of the BPF program on return
probe simply by returning zero or non zero from the entry BPF
program execution to execute or not the BPF program on return
probe respectively.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20241108134544.480660-4-jolsa@kernel.org
Since the beginning of ftrace, the code that did the patching had its
timings saved on how long it took to complete. But this information was
never exposed. It was used for debugging and exposing it was always
something that was on the TODO list. Now it's time to expose it. There's
even a file that is where it should go!
Also include how long patching modules took as a separate value.
# cat /sys/kernel/tracing/dyn_ftrace_total_info
57680 pages:231 groups: 9
ftrace boot update time = 14024666 (ns)
ftrace module total update time = 126070 (ns)
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/20241017113105.1edfa943@gandalf.local.home
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The ret_stack (shadow stack used by function graph infrastructure) is
created for every task on the system when function graph is enabled. Give
it its own kmem_cache. This will make it easier to see how much memory is
being used specifically for function graph shadow stacks.
In the future, this size may change and may not be a power of two. Having
its own cache can also keep it from fragmenting memory.
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Link: https://lore.kernel.org/20241026063210.7d4910a7@rorschach.local.home
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
In order to modify the code that allocates the shadow stacks, merge the
changes that fixed the CPU hotplug shadow stack allocations and build on
top of that.
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Pull ftrace fixes from Steven Rostedt:
- Fix missing mutex unlock in error path of register_ftrace_graph()
A previous fix added a return on an error path and forgot to unlock
the mutex. Instead of dealing with error paths, use guard(mutex) as
the mutex is just released at the exit of the function anyway. Other
functions in this file should be updated with this, but that's a
cleanup and not a fix.
- Change cpuhp setup name to be consistent with other cpuhp states
The same fix that the above patch fixes added a cpuhp_setup_state()
call with the name of "fgraph_idle_init". I was informed that it
should instead be something like: "fgraph:online". Update that too.
* tag 'ftrace-v6.12-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
fgraph: Change the name of cpuhp state to "fgraph:online"
fgraph: Fix missing unlock in register_ftrace_graph()
Pull bpf fixes from Daniel Borkmann:
- Fix an out-of-bounds read in bpf_link_show_fdinfo for BPF sockmap
link file descriptors (Hou Tao)
- Fix BPF arm64 JIT's address emission with tag-based KASAN enabled
reserving not enough size (Peter Collingbourne)
- Fix BPF verifier do_misc_fixups patching for inlining of the
bpf_get_branch_snapshot BPF helper (Andrii Nakryiko)
- Fix a BPF verifier bug and reject BPF program write attempts into
read-only marked BPF maps (Daniel Borkmann)
- Fix perf_event_detach_bpf_prog error handling by removing an invalid
check which would skip BPF program release (Jiri Olsa)
- Fix memory leak when parsing mount options for the BPF filesystem
(Hou Tao)
* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
bpf: Check validity of link->type in bpf_link_show_fdinfo()
bpf: Add the missing BPF_LINK_TYPE invocation for sockmap
bpf: fix do_misc_fixups() for bpf_get_branch_snapshot()
bpf,perf: Fix perf_event_detach_bpf_prog error handling
selftests/bpf: Add test for passing in uninit mtu_len
selftests/bpf: Add test for writes to .rodata
bpf: Remove MEM_UNINIT from skb/xdp MTU helpers
bpf: Fix overloading of MEM_UNINIT's meaning
bpf: Add MEM_WRITE attribute
bpf: Preserve param->string when parsing mount options
bpf, arm64: Fix address emission with tag-based KASAN enabled
strlen() returns a string length excluding the null byte. If the string
length equals to the maximum buffer length, the buffer will have no
space for the NULL terminating character.
This commit checks this condition and returns failure for it.
Link: https://lore.kernel.org/all/20241007144724.920954-1-leo.yan@arm.com/
Fixes: dec65d79fd ("tracing/probe: Check event name length correctly")
Signed-off-by: Leo Yan <leo.yan@arm.com>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
When creating a trace_probe we would set nr_args prior to truncating the
arguments to MAX_TRACE_ARGS. However, we would only initialize arguments
up to the limit.
This caused invalid memory access when attempting to set up probes with
more than 128 fetchargs.
BUG: kernel NULL pointer dereference, address: 0000000000000020
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: Oops: 0000 [#1] PREEMPT SMP PTI
CPU: 0 UID: 0 PID: 1769 Comm: cat Not tainted 6.11.0-rc7+ #8
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-1.fc39 04/01/2014
RIP: 0010:__set_print_fmt+0x134/0x330
Resolve the issue by applying the MAX_TRACE_ARGS limit earlier. Return
an error when there are too many arguments instead of silently
truncating.
Link: https://lore.kernel.org/all/20240930202656.292869-1-mikel@mikelr.com/
Fixes: 035ba76014 ("tracing/probes: cleanup: Set trace_probe::nr_args at trace_probe_init")
Signed-off-by: Mikel Rychliski <mikel@mikelr.com>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Add a MEM_WRITE attribute for BPF helper functions which can be used in
bpf_func_proto to annotate an argument type in order to let the verifier
know that the helper writes into the memory passed as an argument. In
the past MEM_UNINIT has been (ab)used for this function, but the latter
merely tells the verifier that the passed memory can be uninitialized.
There have been bugs with overloading the latter but aside from that
there are also cases where the passed memory is read + written which
currently cannot be expressed, see also 4b3786a6c5 ("bpf: Zero former
ARG_PTR_TO_{LONG,INT} args in case of error").
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20241021152809.33343-1-daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Implement bpf_send_signal_task kfunc that is similar to
bpf_send_signal_thread and bpf_send_signal helpers but can be used to
send signals to other threads and processes. It also supports sending a
cookie with the signal similar to sigqueue().
If the receiving process establishes a handler for the signal using the
SA_SIGINFO flag to sigaction(), then it can obtain this cookie via the
si_value field of the siginfo_t structure passed as the second argument
to the handler.
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20241016084136.10305-2-puranjay@kernel.org
Pull scheduling fixes from Borislav Petkov:
- Add PREEMPT_RT maintainers
- Fix another aspect of delayed dequeued tasks wrt determining their
state, i.e., whether they're runnable or blocked
- Handle delayed dequeued tasks and their migration wrt PSI properly
- Fix the situation where a delayed dequeue task gets enqueued into a
new class, which should not happen
- Fix a case where memory allocation would happen while the runqueue
lock is held, which is a no-no
- Do not over-schedule when tasks with shorter slices preempt the
currently running task
- Make sure delayed to deque entities are properly handled before
unthrottling
- Other smaller cleanups and improvements
* tag 'sched_urgent_for_v6.12_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
MAINTAINERS: Add an entry for PREEMPT_RT.
sched/fair: Fix external p->on_rq users
sched/psi: Fix mistaken CPU pressure indication after corrupted task state bug
sched/core: Dequeue PSI signals for blocked tasks that are delayed
sched: Fix delayed_dequeue vs switched_from_fair()
sched/core: Disable page allocation in task_tick_mm_cid()
sched/deadline: Use hrtick_enabled_dl() before start_hrtick_dl()
sched/eevdf: Fix wakeup-preempt by checking cfs_rq->nr_running
sched: Fix sched_delayed vs cfs_bandwidth
Pull ftrace fixes from Steven Rostedt:
"A couple of fixes to function graph infrastructure:
- Fix allocation of idle shadow stack allocation during hotplug
If function graph tracing is started when a CPU is offline, if it
were come online during the trace then the idle task that
represents the CPU will not get a shadow stack allocated for it.
This means all function graph hooks that happen while that idle
task is running (including in interrupt mode) will have all its
events dropped.
Switch over to the CPU hotplug mechanism that will have any newly
brought on line CPU get a callback that can allocate the shadow
stack for its idle task.
- Fix allocation size of the ret_stack_list array
When function graph tracing converted over to allowing more than
one user at a time, it had to convert its shadow stack from an
array of ret_stack structures to an array of unsigned longs. The
shadow stacks are allocated in batches of 32 at a time and assigned
to every running task. The batch is held by the ret_stack_list
array.
But when the conversion happened, instead of allocating an array of
32 pointers, it was allocated as a ret_stack itself (PAGE_SIZE).
This ret_stack_list gets passed to a function that iterates over
what it believes is its size defined by the
FTRACE_RETSTACK_ALLOC_SIZE macro (which is 32).
Luckily (PAGE_SIZE) is greater than 32 * sizeof(long), otherwise
this would have been an array overflow. This still should be fixed
and the ret_stack_list should be allocated to the size it is
expected to be as someday it may end up being bigger than
SHADOW_STACK_SIZE"
* tag 'ftrace-v6.12-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
fgraph: Allocate ret_stack_list with proper size
fgraph: Use CPU hotplug mechanism to initialize idle shadow stacks
The ret_stack_list is an array of ret_stack shadow stacks for the function
graph usage. When the first function graph is enabled, all tasks in the
system get a shadow stack. The ret_stack_list is a 32 element array of
pointers to these shadow stacks. It allocates the shadow stack in batches
(32 stacks at a time), assigns them to running tasks, and continues until
all tasks are covered.
When the function graph shadow stack changed from an array of
ftrace_ret_stack structures to an array of longs, the allocation of
ret_stack_list went from allocating an array of 32 elements to just a
block defined by SHADOW_STACK_SIZE. Luckily, that's defined as PAGE_SIZE
and is much more than enough to hold 32 pointers. But it is way overkill
for the amount needed to allocate.
Change the allocation of ret_stack_list back to a kcalloc() of
FTRACE_RETSTACK_ALLOC_SIZE pointers.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20241018215212.23f13f40@rorschach
Fixes: 42675b723b ("function_graph: Convert ret_stack to a series of longs")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The function graph infrastructure allocates a shadow stack for every task
when enabled. This includes the idle tasks. The first time the function
graph is invoked, the shadow stacks are created and never freed until the
task exits. This includes the idle tasks.
Only the idle tasks that were for online CPUs had their shadow stacks
created when function graph tracing started. If function graph tracing is
enabled and a CPU comes online, the idle task representing that CPU will
not have its shadow stack created, and all function graph tracing for that
idle task will be silently dropped.
Instead, use the CPU hotplug mechanism to allocate the idle shadow stacks.
This will include idle tasks for CPUs that come online during tracing.
This issue can be reproduced by:
# cd /sys/kernel/tracing
# echo 0 > /sys/devices/system/cpu/cpu1/online
# echo 0 > set_ftrace_pid
# echo function_graph > current_tracer
# echo 1 > options/funcgraph-proc
# echo 1 > /sys/devices/system/cpu/cpu1
# grep '<idle>' per_cpu/cpu1/trace | head
Before, nothing would show up.
After:
1) <idle>-0 | 0.811 us | __enqueue_entity();
1) <idle>-0 | 5.626 us | } /* enqueue_entity */
1) <idle>-0 | | dl_server_update_idle_time() {
1) <idle>-0 | | dl_scaled_delta_exec() {
1) <idle>-0 | 0.450 us | arch_scale_cpu_capacity();
1) <idle>-0 | 1.242 us | }
1) <idle>-0 | 1.908 us | }
1) <idle>-0 | | dl_server_start() {
1) <idle>-0 | | enqueue_dl_entity() {
1) <idle>-0 | | task_contending() {
Note, if tracing stops and restarts, the old way would then initialize
the onlined CPUs.
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/20241018214300.6df82178@rorschach
Fixes: 868baf07b1 ("ftrace: Fix memory leak with function graph and cpu hotplug")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Pull bpf fixes from Daniel Borkmann:
- Fix BPF verifier to not affect subreg_def marks in its range
propagation (Eduard Zingerman)
- Fix a truncation bug in the BPF verifier's handling of
coerce_reg_to_size_sx (Dimitar Kanaliev)
- Fix the BPF verifier's delta propagation between linked registers
under 32-bit addition (Daniel Borkmann)
- Fix a NULL pointer dereference in BPF devmap due to missing rxq
information (Florian Kauer)
- Fix a memory leak in bpf_core_apply (Jiri Olsa)
- Fix an UBSAN-reported array-index-out-of-bounds in BTF parsing for
arrays of nested structs (Hou Tao)
- Fix build ID fetching where memory areas backing the file were
created with memfd_secret (Andrii Nakryiko)
- Fix BPF task iterator tid filtering which was incorrectly using pid
instead of tid (Jordan Rome)
- Several fixes for BPF sockmap and BPF sockhash redirection in
combination with vsocks (Michal Luczaj)
- Fix riscv BPF JIT and make BPF_CMPXCHG fully ordered (Andrea Parri)
- Fix riscv BPF JIT under CONFIG_CFI_CLANG to prevent the possibility
of an infinite BPF tailcall (Pu Lehui)
- Fix a build warning from resolve_btfids that bpf_lsm_key_free cannot
be resolved (Thomas Weißschuh)
- Fix a bug in kfunc BTF caching for modules where the wrong BTF object
was returned (Toke Høiland-Jørgensen)
- Fix a BPF selftest compilation error in cgroup-related tests with
musl libc (Tony Ambardar)
- Several fixes to BPF link info dumps to fill missing fields (Tyrone
Wu)
- Add BPF selftests for kfuncs from multiple modules, checking that the
correct kfuncs are called (Simon Sundberg)
- Ensure that internal and user-facing bpf_redirect flags don't overlap
(Toke Høiland-Jørgensen)
- Switch to use kvzmalloc to allocate BPF verifier environment (Rik van
Riel)
- Use raw_spinlock_t in BPF ringbuf to fix a sleep in atomic splat
under RT (Wander Lairson Costa)
* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: (38 commits)
lib/buildid: Handle memfd_secret() files in build_id_parse()
selftests/bpf: Add test case for delta propagation
bpf: Fix print_reg_state's constant scalar dump
bpf: Fix incorrect delta propagation between linked registers
bpf: Properly test iter/task tid filtering
bpf: Fix iter/task tid filtering
riscv, bpf: Make BPF_CMPXCHG fully ordered
bpf, vsock: Drop static vsock_bpf_prot initialization
vsock: Update msg_count on read_skb()
vsock: Update rx_bytes on read_skb()
bpf, sockmap: SK_DROP on attempted redirects of unsupported af_vsock
selftests/bpf: Add asserts for netfilter link info
bpf: Fix link info netfilter flags to populate defrag flag
selftests/bpf: Add test for sign extension in coerce_subreg_to_size_sx()
selftests/bpf: Add test for truncation after sign extension in coerce_reg_to_size_sx()
bpf: Fix truncation bug in coerce_reg_to_size_sx()
selftests/bpf: Assert link info uprobe_multi count & path_size if unset
bpf: Fix unpopulated path_size when uprobe_multi fields unset
selftests/bpf: Fix cross-compiling urandom_read
selftests/bpf: Add test for kfunc module order
...
Conflicts:
kernel/sched/ext.c
There's a context conflict between this upstream commit:
3fdb9ebcec sched_ext: Start schedulers with consistent p->scx.slice values
... and this fix in sched/urgent:
98442f0ccd sched: Fix delayed_dequeue vs switched_from_fair()
Resolve it.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
A ring buffer which has its buffered mapped at boot up to fixed memory
should not be freed. Other buffers can be. The ref counting setup was
wrong for both. It made the not mapped buffers ref count have zero, and the
boot mapped buffer a ref count of 1. But an normally allocated buffer
should be 1, where it can be removed.
Keep the ref count of a normal boot buffer with its setup ref count (do
not decrement it), and increment the fixed memory boot mapped buffer's ref
count.
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20241011165224.33dd2624@gandalf.local.home
Fixes: e645535a95 ("tracing: Add option to use memmapped memory for trace boot instance")
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Sean noted that ever since commit 152e11f6df ("sched/fair: Implement
delayed dequeue") KVM's preemption notifiers have started
mis-classifying preemption vs blocking.
Notably p->on_rq is no longer sufficient to determine if a task is
runnable or blocked -- the aforementioned commit introduces tasks that
remain on the runqueue even through they will not run again, and
should be considered blocked for many cases.
Add the task_is_runnable() helper to classify things and audit all
external users of the p->on_rq state. Also add a few comments.
Fixes: 152e11f6df ("sched/fair: Implement delayed dequeue")
Reported-by: Sean Christopherson <seanjc@google.com>
Tested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20241010091843.GK33184@noisy.programming.kicks-ass.net
Previously when retrieving `bpf_link_info.uprobe_multi` with `path` and
`path_size` fields unset, the `path_size` field is not populated
(remains 0). This behavior was inconsistent with how other input/output
string buffer fields work, as the field should be populated in cases
when:
- both buffer and length are set (currently works as expected)
- both buffer and length are unset (not working as expected)
This patch now fills the `path_size` field when `path` and `path_size`
are unset.
Fixes: e56fdbfb06 ("bpf: Add link_info support for uprobe multi link")
Signed-off-by: Tyrone Wu <wudevelops@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20241011000803.681190-1-wudevelops@gmail.com
The boot mapped ring buffer has its buffer mapped at a fixed location
found at boot up. It is not dynamic. It cannot grow or be expanded when
new CPUs come online.
Do not hook fixed memory mapped ring buffers to the CPU hotplug callback,
otherwise it can cause a crash when it tries to add the buffer to the
memory that is already fully occupied.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20241008143242.25e20801@gandalf.local.home
Fixes: be68d63a13 ("ring-buffer: Add ring_buffer_alloc_range()")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>