Commit Graph

45236 Commits

Author SHA1 Message Date
Peter Oberparleiter
c1558bc57b gcov: add support for GCC 14
Using gcov on kernels compiled with GCC 14 results in truncated 16-byte
long .gcda files with no usable data.  To fix this, update GCOV_COUNTERS
to match the value defined by GCC 14.

Tested with GCC versions 14.1.0 and 13.2.0.

Link: https://lkml.kernel.org/r/20240610092743.1609845-1-oberpar@linux.ibm.com
Signed-off-by: Peter Oberparleiter <oberpar@linux.ibm.com>
Reported-by: Allison Henderson <allison.henderson@oracle.com>
Reported-by: Chuck Lever III <chuck.lever@oracle.com>
Tested-by: Chuck Lever <chuck.lever@oracle.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-06-15 10:43:06 -07:00
Oleg Nesterov
7fea700e04 zap_pid_ns_processes: clear TIF_NOTIFY_SIGNAL along with TIF_SIGPENDING
kernel_wait4() doesn't sleep and returns -EINTR if there is no
eligible child and signal_pending() is true.

That is why zap_pid_ns_processes() clears TIF_SIGPENDING but this is not
enough, it should also clear TIF_NOTIFY_SIGNAL to make signal_pending()
return false and avoid a busy-wait loop.

Link: https://lkml.kernel.org/r/20240608120616.GB7947@redhat.com
Fixes: 12db8b6900 ("entry: Add support for TIF_NOTIFY_SIGNAL")
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reported-by: Rachel Menge <rachelmenge@linux.microsoft.com>
Closes: https://lore.kernel.org/all/1386cd49-36d0-4a5c-85e9-bc42056a5a38@linux.microsoft.com/
Reviewed-by: Boqun Feng <boqun.feng@gmail.com>
Tested-by: Wei Fu <fuweid89@gmail.com>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Cc: Allen Pais <apais@linux.microsoft.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Joel Granados <j.granados@samsung.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Mateusz Guzik <mjguzik@gmail.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Mike Christie <michael.christie@oracle.com>
Cc: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Cc: Zqiang <qiang.zhang1211@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-06-15 10:43:06 -07:00
Kees Cook
51104c19d8 kunit: test: Add vm_mmap() allocation resource manager
For tests that need to allocate using vm_mmap() (e.g. usercopy and
execve), provide the interface to have the allocation tracked by KUnit
itself. This requires bringing up a placeholder userspace mm.

This combines my earlier attempt at this with Mark Rutland's version[1].

Normally alloc_mm() and arch_pick_mmap_layout() aren't exported for
modules, so export these only for KUnit testing.

Link: https://lore.kernel.org/lkml/20230321122514.1743889-2-mark.rutland@arm.com/ [1]
Co-developed-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: David Gow <davidgow@google.com>
Signed-off-by: Kees Cook <kees@kernel.org>
Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
2024-06-14 19:31:33 -06:00
Jakub Kicinski
c64da10adb Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:

====================
pull-request: bpf 2024-06-14

We've added 8 non-merge commits during the last 2 day(s) which contain
a total of 9 files changed, 92 insertions(+), 11 deletions(-).

The main changes are:

1) Silence a syzkaller splat under CONFIG_DEBUG_NET=y in pskb_pull_reason()
   triggered via __bpf_try_make_writable(), from Florian Westphal.

2) Fix removal of kfuncs during linking phase which then throws a kernel
   build warning via resolve_btfids about unresolved symbols,
   from Tony Ambardar.

3) Fix a UML x86_64 compilation failure from BPF as pcpu_hot symbol
   is not available on User Mode Linux, from Maciej Żenczykowski.

4) Fix a register corruption in reg_set_min_max triggering an invariant
   violation in BPF verifier, from Daniel Borkmann.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  bpf: Harden __bpf_kfunc tag against linker kfunc removal
  compiler_types.h: Define __retain for __attribute__((__retain__))
  bpf: Avoid splat in pskb_pull_reason
  bpf: fix UML x86_64 compile failure
  selftests/bpf: Add test coverage for reg_set_min_max handling
  bpf: Reduce stack consumption in check_stack_write_fixed_off
  bpf: Fix reg_set_min_max corruption of fake_reg
  MAINTAINERS: mailmap: Update Stanislav's email address
====================

Link: https://lore.kernel.org/r/20240614203223.26500-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-14 17:57:10 -07:00
Alexei Starovoitov
98d7ca374b bpf: Track delta between "linked" registers.
Compilers can generate the code
  r1 = r2
  r1 += 0x1
  if r2 < 1000 goto ...
  use knowledge of r2 range in subsequent r1 operations

So remember constant delta between r2 and r1 and update r1 after 'if' condition.

Unfortunately LLVM still uses this pattern for loops with 'can_loop' construct:
for (i = 0; i < 1000 && can_loop; i++)

The "undo" pass was introduced in LLVM
https://reviews.llvm.org/D121937
to prevent this optimization, but it cannot cover all cases.
Instead of fighting middle end optimizer in BPF backend teach the verifier
about this pattern.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20240613013815.953-3-alexei.starovoitov@gmail.com
2024-06-14 21:52:39 +02:00
Steven Rostedt (Google)
63a8dfb889 function_graph: Add READ_ONCE() when accessing fgraph_array[]
In function_graph_enter() there's a loop that looks at fgraph_array[]
elements which are fgraph_ops. It first tests if it is a fgraph_stub op,
and if so skips it, as that's just there as a place holder. Then it checks
the fgraph_ops filters to see if the ops wants to trace the current
function.

But if the compiler reloads the fgraph_array[] after the check against
fgraph_stub, it could race with the fgraph_array[] being updated with the
fgraph_stub. That would cause the stub to be processed. But the stub has a
null "func_hash" field which will cause a NULL pointer dereference.

Add a READ_ONCE() so that the gops that is compared against the
fgraph_stub is also the gops that is processed later.

Link: https://lore.kernel.org/all/CA+G9fYsSVJQZH=nM=1cjTc94PgSnMF9y65BnOv6XSoCG_b6wmw@mail.gmail.com/
Link: https://lore.kernel.org/linux-trace-kernel/20240613095223.1f07e3a4@rorschach.local.home

Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Fixes: cc60ee813b ("function_graph: Use static_call and branch to optimize entry function")
Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2024-06-14 12:37:08 -04:00
Vadim Fedorenko
65d6d61d25 bpf: crypto: make state and IV dynptr nullable
Some ciphers do not require state and IV buffer, but with current
implementation 0-sized dynptr is always needed. With adjustment to
verifier we can provide NULL instead of 0-sized dynptr. Make crypto
kfuncs ready for this.

Reviewed-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Vadim Fedorenko <vadfed@meta.com>
Link: https://lore.kernel.org/r/20240613211817.1551967-3-vadfed@meta.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-06-13 16:33:04 -07:00
Vadim Fedorenko
a90797993a bpf: verifier: make kfuncs args nullalble
Some arguments to kfuncs might be NULL in some cases. But currently it's
not possible to pass NULL to any BTF structures because the check for
the suffix is located after all type checks. Move it to earlier place
to allow nullable args.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Vadim Fedorenko <vadfed@meta.com>
Link: https://lore.kernel.org/r/20240613211817.1551967-2-vadfed@meta.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-06-13 16:33:04 -07:00
Jakub Kicinski
4c7d3d79c7 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

No conflicts, no adjacent changes.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-13 13:13:46 -07:00
GUO Zihua
9a95c5bfbf ima: Avoid blocking in RCU read-side critical section
A panic happens in ima_match_policy:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
PGD 42f873067 P4D 0
Oops: 0000 [#1] SMP NOPTI
CPU: 5 PID: 1286325 Comm: kubeletmonit.sh
Kdump: loaded Tainted: P
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
               BIOS 0.0.0 02/06/2015
RIP: 0010:ima_match_policy+0x84/0x450
Code: 49 89 fc 41 89 cf 31 ed 89 44 24 14 eb 1c 44 39
      7b 18 74 26 41 83 ff 05 74 20 48 8b 1b 48 3b 1d
      f2 b9 f4 00 0f 84 9c 01 00 00 <44> 85 73 10 74 ea
      44 8b 6b 14 41 f6 c5 01 75 d4 41 f6 c5 02 74 0f
RSP: 0018:ff71570009e07a80 EFLAGS: 00010207
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000200
RDX: ffffffffad8dc7c0 RSI: 0000000024924925 RDI: ff3e27850dea2000
RBP: 0000000000000000 R08: 0000000000000000 R09: ffffffffabfce739
R10: ff3e27810cc42400 R11: 0000000000000000 R12: ff3e2781825ef970
R13: 00000000ff3e2785 R14: 000000000000000c R15: 0000000000000001
FS:  00007f5195b51740(0000)
GS:ff3e278b12d40000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000010 CR3: 0000000626d24002 CR4: 0000000000361ee0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 ima_get_action+0x22/0x30
 process_measurement+0xb0/0x830
 ? page_add_file_rmap+0x15/0x170
 ? alloc_set_pte+0x269/0x4c0
 ? prep_new_page+0x81/0x140
 ? simple_xattr_get+0x75/0xa0
 ? selinux_file_open+0x9d/0xf0
 ima_file_check+0x64/0x90
 path_openat+0x571/0x1720
 do_filp_open+0x9b/0x110
 ? page_counter_try_charge+0x57/0xc0
 ? files_cgroup_alloc_fd+0x38/0x60
 ? __alloc_fd+0xd4/0x250
 ? do_sys_open+0x1bd/0x250
 do_sys_open+0x1bd/0x250
 do_syscall_64+0x5d/0x1d0
 entry_SYSCALL_64_after_hwframe+0x65/0xca

Commit c7423dbdbc ("ima: Handle -ESTALE returned by
ima_filter_rule_match()") introduced call to ima_lsm_copy_rule within a
RCU read-side critical section which contains kmalloc with GFP_KERNEL.
This implies a possible sleep and violates limitations of RCU read-side
critical sections on non-PREEMPT systems.

Sleeping within RCU read-side critical section might cause
synchronize_rcu() returning early and break RCU protection, allowing a
UAF to happen.

The root cause of this issue could be described as follows:
|	Thread A	|	Thread B	|
|			|ima_match_policy	|
|			|  rcu_read_lock	|
|ima_lsm_update_rule	|			|
|  synchronize_rcu	|			|
|			|    kmalloc(GFP_KERNEL)|
|			|      sleep		|
==> synchronize_rcu returns early
|  kfree(entry)		|			|
|			|    entry = entry->next|
==> UAF happens and entry now becomes NULL (or could be anything).
|			|    entry->action	|
==> Accessing entry might cause panic.

To fix this issue, we are converting all kmalloc that is called within
RCU read-side critical section to use GFP_ATOMIC.

Fixes: c7423dbdbc ("ima: Handle -ESTALE returned by ima_filter_rule_match()")
Cc: stable@vger.kernel.org
Signed-off-by: GUO Zihua <guozihua@huawei.com>
Acked-by: John Johansen <john.johansen@canonical.com>
Reviewed-by: Mimi Zohar <zohar@linux.ibm.com>
Reviewed-by: Casey Schaufler <casey@schaufler-ca.com>
[PM: fixed missing comment, long lines, !CONFIG_IMA_LSM_RULES case]
Signed-off-by: Paul Moore <paul@paul-moore.com>
2024-06-13 14:26:50 -04:00
Maciej Żenczykowski
b99a95bc56 bpf: fix UML x86_64 compile failure
pcpu_hot (defined in arch/x86) is not available on user mode linux (ARCH=um)

Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Fixes: 1ae6921009 ("bpf: inline bpf_get_smp_processor_id() helper")
Signed-off-by: Maciej Żenczykowski <maze@google.com>
Link: https://lore.kernel.org/r/20240613173146.2524647-1-maze@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-06-13 11:24:45 -07:00
Daniel Xu
78746f93e9 bpf: Fix bpf_dynptr documentation comments
The function argument names were changed but the doc comment was not.
Fix htmldocs build warning by updating doc comments.

Fixes: cce4c40b96 ("bpf: treewide: Align kfunc signatures to prog point-of-view")
Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Link: https://lore.kernel.org/r/d0b0eb05f91e12e5795966153b11998d3fc1d433.1718295425.git.dxu@dxuuu.xyz
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-06-13 11:18:43 -07:00
Daniel Borkmann
e73cd1cfc2 bpf: Reduce stack consumption in check_stack_write_fixed_off
The fake_reg moved into env->fake_reg given it consumes a lot of stack
space (120 bytes). Migrate the fake_reg in check_stack_write_fixed_off()
as well now that we have it.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/20240613115310.25383-2-daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-06-13 11:16:01 -07:00
Daniel Borkmann
9242480126 bpf: Fix reg_set_min_max corruption of fake_reg
Juan reported that after doing some changes to buzzer [0] and implementing
a new fuzzing strategy guided by coverage, they noticed the following in
one of the probes:

  [...]
  13: (79) r6 = *(u64 *)(r0 +0)         ; R0=map_value(ks=4,vs=8) R6_w=scalar()
  14: (b7) r0 = 0                       ; R0_w=0
  15: (b4) w0 = -1                      ; R0_w=0xffffffff
  16: (74) w0 >>= 1                     ; R0_w=0x7fffffff
  17: (5c) w6 &= w0                     ; R0_w=0x7fffffff R6_w=scalar(smin=smin32=0,smax=umax=umax32=0x7fffffff,var_off=(0x0; 0x7fffffff))
  18: (44) w6 |= 2                      ; R6_w=scalar(smin=umin=smin32=umin32=2,smax=umax=umax32=0x7fffffff,var_off=(0x2; 0x7ffffffd))
  19: (56) if w6 != 0x7ffffffd goto pc+1
  REG INVARIANTS VIOLATION (true_reg2): range bounds violation u64=[0x7fffffff, 0x7ffffffd] s64=[0x7fffffff, 0x7ffffffd] u32=[0x7fffffff, 0x7ffffffd] s32=[0x7fffffff, 0x7ffffffd] var_off=(0x7fffffff, 0x0)
  REG INVARIANTS VIOLATION (false_reg1): range bounds violation u64=[0x7fffffff, 0x7ffffffd] s64=[0x7fffffff, 0x7ffffffd] u32=[0x7fffffff, 0x7ffffffd] s32=[0x7fffffff, 0x7ffffffd] var_off=(0x7fffffff, 0x0)
  REG INVARIANTS VIOLATION (false_reg2): const tnum out of sync with range bounds u64=[0x0, 0xffffffffffffffff] s64=[0x8000000000000000, 0x7fffffffffffffff] u32=[0x0, 0xffffffff] s32=[0x80000000, 0x7fffffff] var_off=(0x7fffffff, 0x0)
  19: R6_w=0x7fffffff
  20: (95) exit

  from 19 to 21: R0=0x7fffffff R6=scalar(smin=umin=smin32=umin32=2,smax=umax=smax32=umax32=0x7ffffffe,var_off=(0x2; 0x7ffffffd)) R7=map_ptr(ks=4,vs=8) R9=ctx() R10=fp0 fp-24=map_ptr(ks=4,vs=8) fp-40=mmmmmmmm
  21: R0=0x7fffffff R6=scalar(smin=umin=smin32=umin32=2,smax=umax=smax32=umax32=0x7ffffffe,var_off=(0x2; 0x7ffffffd)) R7=map_ptr(ks=4,vs=8) R9=ctx() R10=fp0 fp-24=map_ptr(ks=4,vs=8) fp-40=mmmmmmmm
  21: (14) w6 -= 2147483632             ; R6_w=scalar(smin=umin=umin32=2,smax=umax=0xffffffff,smin32=0x80000012,smax32=14,var_off=(0x2; 0xfffffffd))
  22: (76) if w6 s>= 0xe goto pc+1      ; R6_w=scalar(smin=umin=umin32=2,smax=umax=0xffffffff,smin32=0x80000012,smax32=13,var_off=(0x2; 0xfffffffd))
  23: (95) exit

  from 22 to 24: R0=0x7fffffff R6_w=14 R7=map_ptr(ks=4,vs=8) R9=ctx() R10=fp0 fp-24=map_ptr(ks=4,vs=8) fp-40=mmmmmmmm
  24: R0=0x7fffffff R6_w=14 R7=map_ptr(ks=4,vs=8) R9=ctx() R10=fp0 fp-24=map_ptr(ks=4,vs=8) fp-40=mmmmmmmm
  24: (14) w6 -= 14                     ; R6_w=0
  [...]

What can be seen here is a register invariant violation on line 19. After
the binary-or in line 18, the verifier knows that bit 2 is set but knows
nothing about the rest of the content which was loaded from a map value,
meaning, range is [2,0x7fffffff] with var_off=(0x2; 0x7ffffffd). When in
line 19 the verifier analyzes the branch, it splits the register states
in reg_set_min_max() into the registers of the true branch (true_reg1,
true_reg2) and the registers of the false branch (false_reg1, false_reg2).

Since the test is w6 != 0x7ffffffd, the src_reg is a known constant.
Internally, the verifier creates a "fake" register initialized as scalar
to the value of 0x7ffffffd, and then passes it onto reg_set_min_max(). Now,
for line 19, it is mathematically impossible to take the false branch of
this program, yet the verifier analyzes it. It is impossible because the
second bit of r6 will be set due to the prior or operation and the
constant in the condition has that bit unset (hex(fd) == binary(1111 1101).

When the verifier first analyzes the false / fall-through branch, it will
compute an intersection between the var_off of r6 and of the constant. This
is because the verifier creates a "fake" register initialized to the value
of the constant. The intersection result later refines both registers in
regs_refine_cond_op():

  [...]
  t = tnum_intersect(tnum_subreg(reg1->var_off), tnum_subreg(reg2->var_off));
  reg1->var_off = tnum_with_subreg(reg1->var_off, t);
  reg2->var_off = tnum_with_subreg(reg2->var_off, t);
  [...]

Since the verifier is analyzing the false branch of the conditional jump,
reg1 is equal to false_reg1 and reg2 is equal to false_reg2, i.e. the reg2
is the "fake" register that was meant to hold a constant value. The resulting
var_off of the intersection says that both registers now hold a known value
of var_off=(0x7fffffff, 0x0) or in other words: this operation manages to
make the verifier think that the "constant" value that was passed in the
jump operation now holds a different value.

Normally this would not be an issue since it should not influence the true
branch, however, false_reg2 and true_reg2 are pointers to the same "fake"
register. Meaning, the false branch can influence the results of the true
branch. In line 24, the verifier assumes R6_w=0, but the actual runtime
value in this case is 1. The fix is simply not passing in the same "fake"
register location as inputs to reg_set_min_max(), but instead making a
copy. Moving the fake_reg into the env also reduces stack consumption by
120 bytes. With this, the verifier successfully rejects invalid accesses
from the test program.

  [0] https://github.com/google/buzzer

Fixes: 67420501e8 ("bpf: generalize reg_set_min_max() to handle non-const register comparisons")
Reported-by: Juan José López Jaimez <jjlopezjaimez@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/r/20240613115310.25383-1-daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-06-13 11:16:01 -07:00
Joel Granados
2f7c624892 locking: Remove superfluous sentinel element from kern_lockdep_table
This commit is part of a greater effort to remove all
empty elements at the end of the ctl_table arrays (sentinels) which will
reduce the overall build time size of the kernel and run time memory
bloat by ~64 bytes per sentinel (further information Link :
https://lore.kernel.org/all/ZO5Yx5JFogGi%2FcBo@bombadil.infradead.org/)

Signed-off-by: Joel Granados <j.granados@samsung.com>
2024-06-13 10:50:52 +02:00
Daniel Xu
cce4c40b96 bpf: treewide: Align kfunc signatures to prog point-of-view
Previously, kfunc declarations in bpf_kfuncs.h (and others) used "user
facing" types for kfuncs prototypes while the actual kfunc definitions
used "kernel facing" types. More specifically: bpf_dynptr vs
bpf_dynptr_kern, __sk_buff vs sk_buff, and xdp_md vs xdp_buff.

It wasn't an issue before, as the verifier allows aliased types.
However, since we are now generating kfunc prototypes in vmlinux.h (in
addition to keeping bpf_kfuncs.h around), this conflict creates
compilation errors.

Fix this conflict by using "user facing" types in kfunc definitions.
This results in more casts, but otherwise has no additional runtime
cost.

Note, similar to 5b268d1ebc ("bpf: Have bpf_rdonly_cast() take a const
pointer"), we also make kfuncs take const arguments where appropriate in
order to make the kfunc more permissive.

Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Link: https://lore.kernel.org/r/b58346a63a0e66bc9b7504da751b526b0b189a67.1718207789.git.dxu@dxuuu.xyz
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-06-12 11:01:31 -07:00
Daniel Xu
ec209ad863 bpf: verifier: Relax caller requirements for kfunc projection type args
Currently, if a kfunc accepts a projection type as an argument (eg
struct __sk_buff *), the caller must exactly provide exactly the same
type with provable provenance.

However in practice, kfuncs that accept projection types _must_ cast to
the underlying type before use b/c projection type layouts are
completely made up. Thus, it is ok to relax the verifier rules around
implicit conversions.

We will use this functionality in the next commit when we align kfuncs
to user-facing types.

Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Link: https://lore.kernel.org/r/e2c025cb09ccfd4af1ec9e18284dc3cecff7514d.1718207789.git.dxu@dxuuu.xyz
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-06-12 11:01:31 -07:00
Daniel Xu
2b8dd87332 bpf: Make bpf_session_cookie() kfunc return long *
We will soon be generating kfunc prototypes from BTF. As part of that,
we need to align the manual signatures in bpf_kfuncs.h with the actual
kfunc definitions. There is currently a conflicting signature for
bpf_session_cookie() w.r.t. return type.

The original intent was to return long * and not __u64 *. You can see
evidence of that intent in a3a5113393 ("selftests/bpf: Add kprobe
session cookie test").

Fix conflict by changing kfunc definition.

Fixes: 5c919acef8 ("bpf: Add support for kprobe session cookie")
Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Link: https://lore.kernel.org/r/7043e1c251ab33151d6e3830f8ea1902ed2604ac.1718207789.git.dxu@dxuuu.xyz
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-06-12 11:01:31 -07:00
Masami Hiramatsu (Google)
3eddb03196 tracing/kprobe: Remove cleanup code unrelated to selftest
This cleanup all kprobe events code is not related to the selftest
itself, and it can fail by the reason unrelated to this test.
If the test is successful, the generated events are cleaned up.
And if not, we cannot guarantee that the kprobe events will work
correctly. So, anyway, there is no need to clean it up.

Link: https://lore.kernel.org/all/171811265627.85078.16897867213512435822.stgit@devnote2/

Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-06-12 08:46:00 +09:00
Masami Hiramatsu (Google)
41051daa38 tracing/kprobe: Integrate test warnings into WARN_ONCE
Cleanup the redundant WARN_ON_ONCE(cond) + pr_warn(msg) into
WARN_ONCE(cond, msg). Also add some WARN_ONCE() for hitcount check.
These WARN_ONCE() errors makes it easy to handle errors from ktest.

Link: https://lore.kernel.org/all/171811264685.85078.8068819097047430463.stgit@devnote2/

Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-06-12 08:44:44 +09:00
Jiri Olsa
ff474a78ce uprobe: Add uretprobe syscall to speed up return probe
Adding uretprobe syscall instead of trap to speed up return probe.

At the moment the uretprobe setup/path is:

  - install entry uprobe

  - when the uprobe is hit, it overwrites probed function's return address
    on stack with address of the trampoline that contains breakpoint
    instruction

  - the breakpoint trap code handles the uretprobe consumers execution and
    jumps back to original return address

This patch replaces the above trampoline's breakpoint instruction with new
ureprobe syscall call. This syscall does exactly the same job as the trap
with some more extra work:

  - syscall trampoline must save original value for rax/r11/rcx registers
    on stack - rax is set to syscall number and r11/rcx are changed and
    used by syscall instruction

  - the syscall code reads the original values of those registers and
    restore those values in task's pt_regs area

  - only caller from trampoline exposed in '[uprobes]' is allowed,
    the process will receive SIGILL signal otherwise

Even with some extra work, using the uretprobes syscall shows speed
improvement (compared to using standard breakpoint):

  On Intel (11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz)

  current:
    uretprobe-nop  :    1.498 ± 0.000M/s
    uretprobe-push :    1.448 ± 0.001M/s
    uretprobe-ret  :    0.816 ± 0.001M/s

  with the fix:
    uretprobe-nop  :    1.969 ± 0.002M/s  < 31% speed up
    uretprobe-push :    1.910 ± 0.000M/s  < 31% speed up
    uretprobe-ret  :    0.934 ± 0.000M/s  < 14% speed up

  On Amd (AMD Ryzen 7 5700U)

  current:
    uretprobe-nop  :    0.778 ± 0.001M/s
    uretprobe-push :    0.744 ± 0.001M/s
    uretprobe-ret  :    0.540 ± 0.001M/s

  with the fix:
    uretprobe-nop  :    0.860 ± 0.001M/s  < 10% speed up
    uretprobe-push :    0.818 ± 0.001M/s  < 10% speed up
    uretprobe-ret  :    0.578 ± 0.000M/s  <  7% speed up

The performance test spawns a thread that runs loop which triggers
uprobe with attached bpf program that increments the counter that
gets printed in results above.

The uprobe (and uretprobe) kind is determined by which instruction
is being patched with breakpoint instruction. That's also important
for uretprobes, because uprobe is installed for each uretprobe.

The performance test is part of bpf selftests:
  tools/testing/selftests/bpf/run_bench_uprobes.sh

Note at the moment uretprobe syscall is supported only for native
64-bit process, compat process still uses standard breakpoint.

Note that when shadow stack is enabled the uretprobe syscall returns
via iret, which is slower than return via sysret, but won't cause the
shadow stack violation.

Link: https://lore.kernel.org/all/20240611112158.40795-4-jolsa@kernel.org/

Suggested-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2024-06-12 08:44:28 +09:00
Jiri Olsa
190fec72df uprobe: Wire up uretprobe system call
Wiring up uretprobe system call, which comes in following changes.
We need to do the wiring before, because the uretprobe implementation
needs the syscall number.

Note at the moment uretprobe syscall is supported only for native
64-bit process.

Link: https://lore.kernel.org/all/20240611112158.40795-3-jolsa@kernel.org/

Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2024-06-12 08:44:27 +09:00
Masami Hiramatsu (Google)
3572bd5689 tracing: Build event generation tests only as modules
The kprobes and synth event generation test modules add events and lock
(get a reference) those event file reference in module init function,
and unlock and delete it in module exit function. This is because those
are designed for playing as modules.

If we make those modules as built-in, those events are left locked in the
kernel, and never be removed. This causes kprobe event self-test failure
as below.

[   97.349708] ------------[ cut here ]------------
[   97.353453] WARNING: CPU: 3 PID: 1 at kernel/trace/trace_kprobe.c:2133 kprobe_trace_self_tests_init+0x3f1/0x480
[   97.357106] Modules linked in:
[   97.358488] CPU: 3 PID: 1 Comm: swapper/0 Not tainted 6.9.0-g699646734ab5-dirty #14
[   97.361556] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
[   97.363880] RIP: 0010:kprobe_trace_self_tests_init+0x3f1/0x480
[   97.365538] Code: a8 24 08 82 e9 ae fd ff ff 90 0f 0b 90 48 c7 c7 e5 aa 0b 82 e9 ee fc ff ff 90 0f 0b 90 48 c7 c7 2d 61 06 82 e9 8e fd ff ff 90 <0f> 0b 90 48 c7 c7 33 0b 0c 82 89 c6 e8 6e 03 1f ff 41 ff c7 e9 90
[   97.370429] RSP: 0000:ffffc90000013b50 EFLAGS: 00010286
[   97.371852] RAX: 00000000fffffff0 RBX: ffff888005919c00 RCX: 0000000000000000
[   97.373829] RDX: ffff888003f40000 RSI: ffffffff8236a598 RDI: ffff888003f40a68
[   97.375715] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000000
[   97.377675] R10: ffffffff811c9ae5 R11: ffffffff8120c4e0 R12: 0000000000000000
[   97.379591] R13: 0000000000000001 R14: 0000000000000015 R15: 0000000000000000
[   97.381536] FS:  0000000000000000(0000) GS:ffff88807dcc0000(0000) knlGS:0000000000000000
[   97.383813] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   97.385449] CR2: 0000000000000000 CR3: 0000000002244000 CR4: 00000000000006b0
[   97.387347] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   97.389277] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   97.391196] Call Trace:
[   97.391967]  <TASK>
[   97.392647]  ? __warn+0xcc/0x180
[   97.393640]  ? kprobe_trace_self_tests_init+0x3f1/0x480
[   97.395181]  ? report_bug+0xbd/0x150
[   97.396234]  ? handle_bug+0x3e/0x60
[   97.397311]  ? exc_invalid_op+0x1a/0x50
[   97.398434]  ? asm_exc_invalid_op+0x1a/0x20
[   97.399652]  ? trace_kprobe_is_busy+0x20/0x20
[   97.400904]  ? tracing_reset_all_online_cpus+0x15/0x90
[   97.402304]  ? kprobe_trace_self_tests_init+0x3f1/0x480
[   97.403773]  ? init_kprobe_trace+0x50/0x50
[   97.404972]  do_one_initcall+0x112/0x240
[   97.406113]  do_initcall_level+0x95/0xb0
[   97.407286]  ? kernel_init+0x1a/0x1a0
[   97.408401]  do_initcalls+0x3f/0x70
[   97.409452]  kernel_init_freeable+0x16f/0x1e0
[   97.410662]  ? rest_init+0x1f0/0x1f0
[   97.411738]  kernel_init+0x1a/0x1a0
[   97.412788]  ret_from_fork+0x39/0x50
[   97.413817]  ? rest_init+0x1f0/0x1f0
[   97.414844]  ret_from_fork_asm+0x11/0x20
[   97.416285]  </TASK>
[   97.417134] irq event stamp: 13437323
[   97.418376] hardirqs last  enabled at (13437337): [<ffffffff8110bc0c>] console_unlock+0x11c/0x150
[   97.421285] hardirqs last disabled at (13437370): [<ffffffff8110bbf1>] console_unlock+0x101/0x150
[   97.423838] softirqs last  enabled at (13437366): [<ffffffff8108e17f>] handle_softirqs+0x23f/0x2a0
[   97.426450] softirqs last disabled at (13437393): [<ffffffff8108e346>] __irq_exit_rcu+0x66/0xd0
[   97.428850] ---[ end trace 0000000000000000 ]---

And also, since we can not cleanup dynamic_event file, ftracetest are
failed too.

To avoid these issues, build these tests only as modules.

Link: https://lore.kernel.org/all/171811263754.85078.5877446624311852525.stgit@devnote2/

Fixes: 9fe41efaca ("tracing: Add synth event generation test module")
Fixes: 64836248dd ("tracing: Add kprobe event command generation test module")
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-06-12 08:43:12 +09:00
Marilene A Garcia
9b5a45eb63 ftrace: Add missing kerneldoc parameters to unregister_ftrace_direct()
Add the description to the parameters addr and free_filters
of the function unregister_ftrace_direct().

Link: https://lore.kernel.org/linux-trace-kernel/20240606132520.1397567-1-marilene.agarcia@gmail.com

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Shuah Khan <skhan@linuxfoundation.org>
Cc: Javier Carrasco <javier.carrasco.cruz@gmail.com>
Signed-off-by: Marilene A Garcia <marilene.agarcia@gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-06-11 11:18:24 -04:00
Steven Rostedt (Google)
5f7fb89a11 function_graph: Everyone uses HAVE_FUNCTION_GRAPH_RET_ADDR_PTR, remove it
All architectures that implement function graph also implements
HAVE_FUNCTION_GRAPH_RET_ADDR_PTR. Remove it, as it is no longer a
differentiator.

Link: https://lore.kernel.org/linux-trace-kernel/20240611031737.982047614@goodmis.org

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Guo Ren <guoren@kernel.org>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: WANG Xuerui <kernel@xen0n.name>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-06-11 11:18:24 -04:00
Steven Rostedt (Google)
29c1c24a27 function_graph: Fix up ftrace_graph_ret_addr()
Yang Li sent a patch to fix the kerneldoc of ftrace_graph_ret_addr().
While reviewing it, I realized that the comments in the entire function
header needed a rewrite. When doing that, I realized that @idx parameter
was being ignored. Every time this was called by the unwinder, it would
start the loop at the top of the shadow stack and look for the matching
stack pointer. When it found it, it would return it. When the unwinder
asked for the next function, it would search from the beginning again.

In reality, it should start from where it left off. That was the reason
for the @idx parameter in the first place. The first time the unwinder
calls this function, the @idx pointer would contain zero. That would mean
to start from the top of the stack. The function was supposed to update
the @idx with the index where it found the return address, so that the
next time the unwinder calls this function it doesn't have to search
through the previous addresses it found (making it O(n^2)!).

This speeds up the unwinder's use of ftrace_graph_ret_addr() by an order
of magnitude.

Link: https://lore.kernel.org/linux-trace-kernel/20240610181746.656e3759@gandalf.local.home/
Link: https://lore.kernel.org/linux-trace-kernel/20240611031737.821995106@goodmis.org

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Guo Ren <guoren@kernel.org>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: WANG Xuerui <kernel@xen0n.name>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Reported-by: Yang Li <yang.lee@linux.alibaba.com>
Fixes: 7aa1eaef9f ("function_graph: Allow multiple users to attach to function graph")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-06-11 11:18:24 -04:00
Thomas Gleixner
9bc2ff871f jump_label: Simplify and clarify static_key_fast_inc_cpus_locked()
Make the code more obvious and add proper comments to avoid future head
scratching.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20240610124406.548322963@linutronix.de
2024-06-11 11:25:24 +02:00
Thomas Gleixner
695ef79646 jump_label: Clarify condition in static_key_fast_inc_not_disabled()
The second part of

      if (v <= 0 || (v + 1) < 0)

is not immediately obvious that it acts as overflow protection.

Check explicitely for v == INT_MAX instead and add a proper comment how
this is used at the call sites.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20240610124406.484973160@linutronix.de
2024-06-11 11:25:23 +02:00
Thomas Gleixner
83ab38ef0a jump_label: Fix concurrency issues in static_key_slow_dec()
The commit which tried to fix the concurrency issues of concurrent
static_key_slow_inc() failed to fix the equivalent issues
vs. static_key_slow_dec():

CPU0                     CPU1

static_key_slow_dec()
  static_key_slow_try_dec()

	key->enabled == 1
	val = atomic_fetch_add_unless(&key->enabled, -1, 1);
	if (val == 1)
	     return false;

  jump_label_lock();
  if (atomic_dec_and_test(&key->enabled)) {
     --> key->enabled == 0
   __jump_label_update()

			 static_key_slow_dec()
			   static_key_slow_try_dec()

			     key->enabled == 0
			     val = atomic_fetch_add_unless(&key->enabled, -1, 1);

			      --> key->enabled == -1 <- FAIL

There is another bug in that code, when there is a concurrent
static_key_slow_inc() which enables the key as that sets key->enabled to -1
so on the other CPU

	val = atomic_fetch_add_unless(&key->enabled, -1, 1);

will succeed and decrement to -2, which is invalid.

Cure all of this by replacing the atomic_fetch_add_unless() with a
atomic_try_cmpxchg() loop similar to static_key_fast_inc_not_disabled().

[peterz: add WARN_ON_ONCE for the -1 race]
Fixes: 4c5ea0a9cd ("locking/static_key: Fix concurrent static_key_slow_inc()")
Reported-by: Yue Sun <samsun1006219@gmail.com>
Reported-by: Xingwei Lee <xrivendell7@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20240610124406.422897838@linutronix.de
2024-06-11 11:25:23 +02:00
Jakub Kicinski
b1156532bc Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
pull-request: bpf-next 2024-06-06

We've added 54 non-merge commits during the last 10 day(s) which contain
a total of 50 files changed, 1887 insertions(+), 527 deletions(-).

The main changes are:

1) Add a user space notification mechanism via epoll when a struct_ops
   object is getting detached/unregistered, from Kui-Feng Lee.

2) Big batch of BPF selftest refactoring for sockmap and BPF congctl
   tests, from Geliang Tang.

3) Add BTF field (type and string fields, right now) iterator support
   to libbpf instead of using existing callback-based approaches,
   from Andrii Nakryiko.

4) Extend BPF selftests for the latter with a new btf_field_iter
   selftest, from Alan Maguire.

5) Add new kfuncs for a generic, open-coded bits iterator,
   from Yafang Shao.

6) Fix BPF selftests' kallsyms_find() helper under kernels configured
   with CONFIG_LTO_CLANG_THIN, from Yonghong Song.

7) Remove a bunch of unused structs in BPF selftests,
   from David Alan Gilbert.

8) Convert test_sockmap section names into names understood by libbpf
   so it can deduce program type and attach type, from Jakub Sitnicki.

9) Extend libbpf with the ability to configure log verbosity
   via LIBBPF_LOG_LEVEL environment variable, from Mykyta Yatsenko.

10) Fix BPF selftests with regards to bpf_cookie and find_vma flakiness
    in nested VMs, from Song Liu.

11) Extend riscv32/64 JITs to introduce shift/add helpers to generate Zba
    optimization, from Xiao Wang.

12) Enable BPF programs to declare arrays and struct fields with kptr,
    bpf_rb_root, and bpf_list_head, from Kui-Feng Lee.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (54 commits)
  selftests/bpf: Drop useless arguments of do_test in bpf_tcp_ca
  selftests/bpf: Use start_test in test_dctcp in bpf_tcp_ca
  selftests/bpf: Use start_test in test_dctcp_fallback in bpf_tcp_ca
  selftests/bpf: Add start_test helper in bpf_tcp_ca
  selftests/bpf: Use connect_to_fd_opts in do_test in bpf_tcp_ca
  libbpf: Auto-attach struct_ops BPF maps in BPF skeleton
  selftests/bpf: Add btf_field_iter selftests
  selftests/bpf: Fix send_signal test with nested CONFIG_PARAVIRT
  libbpf: Remove callback-based type/string BTF field visitor helpers
  bpftool: Use BTF field iterator in btfgen
  libbpf: Make use of BTF field iterator in BTF handling code
  libbpf: Make use of BTF field iterator in BPF linker code
  libbpf: Add BTF field iterator
  selftests/bpf: Ignore .llvm.<hash> suffix in kallsyms_find()
  selftests/bpf: Fix bpf_cookie and find_vma in nested VM
  selftests/bpf: Test global bpf_list_head arrays.
  selftests/bpf: Test global bpf_rb_root arrays and fields in nested struct types.
  selftests/bpf: Test kptr arrays and kptrs in nested struct fields.
  bpf: limit the number of levels of a nested struct type.
  bpf: look into the types of the fields of a struct type recursively.
  ...
====================

Link: https://lore.kernel.org/r/20240606223146.23020-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-10 18:02:14 -07:00
Steven Rostedt (Google)
4267fda4af function_graph: Make fgraph_update_pid_func() a stub for !DYNAMIC_FTRACE
When CONFIG_DYNAMIC_FTRACE is not set, the function
fgraph_update_pid_func() doesn't do anything. Currently, most of its logic
is within a "#ifdef CONFIG_DYNAMIC_FTRACE" block, but its variables were
declared outside that, and when DYNAMIC_FTRACE is not set, it produces
unused variable warnings.

Instead, just place it (and the helper function fgraph_pid_func()) within
the #ifdef block and have the header file use a empty stub function for
when DYNAMIC_FTRACE is not defined.

Link: https://lore.kernel.org/linux-trace-kernel/20240607094833.6a787d73@rorschach.local.home

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202406071806.BRjaC5FF-lkp@intel.com/
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-06-10 18:08:23 -04:00
Julia Lawall
37c2277fad workqueue: replace call_rcu by kfree_rcu for simple kmem_cache_free callback
Since SLOB was removed, it is not necessary to use call_rcu
when the callback only performs kmem_cache_free. Use
kfree_rcu() directly.

The changes were done using the following Coccinelle semantic patch.
This semantic patch is designed to ignore cases where the callback
function is used in another way.

// <smpl>
@r@
expression e;
local idexpression e2;
identifier cb,f;
position p;
@@

(
call_rcu(...,e2)
|
call_rcu(&e->f,cb@p)
)

@r1@
type T;
identifier x,r.cb;
@@

 cb(...) {
(
   kmem_cache_free(...);
|
   T x = ...;
   kmem_cache_free(...,x);
|
   T x;
   x = ...;
   kmem_cache_free(...,x);
)
 }

@s depends on r1@
position p != r.p;
identifier r.cb;
@@

 cb@p

@script:ocaml@
cb << r.cb;
p << s.p;
@@

Printf.eprintf "Other use of %s at %s:%d\n"
   cb (List.hd p).file (List.hd p).line

@depends on r1 && !s@
expression e;
identifier r.cb,f;
position r.p;
@@

- call_rcu(&e->f,cb@p)
+ kfree_rcu(e,f)

@r1a depends on !s@
type T;
identifier x,r.cb;
@@

- cb(...) {
(
-  kmem_cache_free(...);
|
-  T x = ...;
-  kmem_cache_free(...,x);
|
-  T x;
-  x = ...;
-  kmem_cache_free(...,x);
)
- }
// </smpl>

Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-06-10 10:30:48 -10:00
Oleg Nesterov
07c54cc598 tick/nohz_full: Don't abuse smp_call_function_single() in tick_setup_device()
After the recent commit 5097cbcb38 ("sched/isolation: Prevent boot crash
when the boot CPU is nohz_full") the kernel no longer crashes, but there is
another problem.

In this case tick_setup_device() calls tick_take_do_timer_from_boot() to
update tick_do_timer_cpu and this triggers the WARN_ON_ONCE(irqs_disabled)
in smp_call_function_single().

Kill tick_take_do_timer_from_boot() and just use WRITE_ONCE(), the new
comment explains why this is safe (thanks Thomas!).

Fixes: 08ae95f4fd ("nohz_full: Allow the boot CPU to be nohz_full")
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20240528122019.GA28794@redhat.com
Link: https://lore.kernel.org/all/20240522151742.GA10400@redhat.com
2024-06-10 20:18:13 +02:00
Linus Torvalds
7cec2e16cb Merge tag 'perf-urgent-2024-06-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf event fix from Ingo Molnar:
 "Fix race between perf_event_free_task() and perf_event_release_kernel()
  that can result in missed wakeups and hung tasks"

* tag 'perf-urgent-2024-06-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/core: Fix missing wakeup when waiting for context reference
2024-06-08 09:26:59 -07:00
Wenchao Hao
7ccc215187 workqueue: Clean code in alloc_and_link_pwqs()
wq->flags would not change, so it's not necessary to check if WQ_BH
is set in loop for_each_possible_cpu(), move define and set of pools
out of loop to simpliy the code.

Signed-off-by: Wenchao Hao <haowenchao22@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-06-07 06:27:04 -10:00
Steven Rostedt (Google)
2f6b884dfc function_graph: Rename BYTE_NUMBER to CHAR_NUMBER in selftests
The function_graph selftests checks various size variables to pass from
the entry of the function to the exit. It tests 1, 2, 4 and 8 byte words.
The 1 byte macro was called BYTE_NUMBER but that is used in the sh
architecture: arch/sh/include/asm/bitops-op32.h

Just rename the macro to CHAR_NUMBER.

Link: https://lore.kernel.org/linux-trace-kernel/20240606081846.4cb82dc4@gandalf.local.home

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Fixes: 47c3c70aa3 ("function_graph: Add selftest for passing local variables")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202406061744.rZDXfRrG-lkp@intel.com/
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-06-06 15:22:19 -04:00
Jiapeng Chong
9a2a3aab73 fgraph: Remove some unused functions
These functions are defined in the fgraph.c file, but not
called elsewhere, so delete these unused functions.

kernel/trace/fgraph.c:273:1: warning: unused function 'set_bitmap_bits'.
kernel/trace/fgraph.c:259:19: warning: unused function 'get_fgraph_type'.

Link: https://lore.kernel.org/linux-trace-kernel/20240606021053.27783-1-jiapeng.chong@linux.alibaba.com

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=9289
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-06-06 15:22:19 -04:00
Tatsuya S
6c1f7f0aca ftrace: Hide one more entry in stack trace when ftrace_pid is enabled
On setting set_ftrace_pid, a extra entry generated by ftrace_pid_func()
is shown on stack trace(CONFIG_UNWINDER_FRAME_POINTER=y).

        [004] .....    68.459382: <stack trace>
 => 0xffffffffa00090af
 => ksys_read
 => __x64_sys_read
 => x64_sys_call
 => do_syscall_64
 => entry_SYSCALL_64_after_hwframe

To resolve this issue, increment skip count
in function_stack_trace_call() if pids are set.

Link: https://lore.kernel.org/linux-trace-kernel/20240528032604.6813-3-tatsuya.s2862@gmail.com

Signed-off-by: Tatsuya S <tatsuya.s2862@gmail.com>
[ Rebased to current tree ]
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-06-06 15:22:18 -04:00
Steven Rostedt (Google)
4057fd2cdd function_graph: Do not update pid func if CONFIG_DYNAMIC_FTRACE not enabled
The ftrace subops is only defined if CONFIG_DYNAMIC_FTRACE is enabled. If
it is not, function tracing is extremely limited, and the subops in the
ftrace_ops structure is not defined (and will fail to compile). If
DYNAMIC_FTRACE is not enabled, then function graph filtering will not
work (as it shouldn't).

Link: https://lore.kernel.org/linux-trace-kernel/20240605202709.096020676@goodmis.org

Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Fixes: df3ec5da6a ("function_graph: Add pid tracing back to function graph tracer")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202406051855.9VIYXbTB-lkp@intel.com/
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-06-06 15:22:18 -04:00
Steven Rostedt (Google)
0c4d8cbb2c function_graph: Make fgraph_do_direct static key static
The static branch key "fgraph_do_direct" was not declared static but is
only used in one file. Change it to a static variable.

Link: https://lore.kernel.org/linux-trace-kernel/20240605202708.936515302@goodmis.org

Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Fixes: cc60ee813b ("function_graph: Use static_call and branch to optimize entry function")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202406051711.dS1sQZ9n-lkp@intel.com/
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-06-06 15:22:18 -04:00
Steven Rostedt (Google)
86b49970e0 ftrace: Fix prototypes for ftrace_startup/shutdown_subops()
The ftrace_startup_subops() was in the wrong header, and both functions
were not defined on !CONFIG_DYNAMIC_FTRACE.

Link: https://lore.kernel.org/linux-trace-kernel/20240605202708.773583114@goodmis.org

Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Fixes: 5fccc7552c ("ftrace: Add subops logic to allow one ops to manage many")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202406051524.a12JqLqx-lkp@intel.com/
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-06-06 15:22:18 -04:00
Steven Rostedt (Google)
0ddef5d601 ftrace: Assign RCU list variable with rcu_assign_ptr()
Use rcu_assign_ptr() to assign the list pointer as it is marked as RCU,
and this quiets the sparse warning:

   kernel/trace/ftrace.c:313:23: warning: incorrect type in assignment (different address spaces)
   kernel/trace/ftrace.c:313:23:    expected struct ftrace_ops [noderef] __rcu *
   kernel/trace/ftrace.c:313:23:    got struct ftrace_ops *

Link: https://lore.kernel.org/linux-trace-kernel/20240605202708.613471310@goodmis.org

Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-06-06 15:22:18 -04:00
Steven Rostedt (Google)
1f51ba905e ftrace: Assign ftrace_list_end to ftrace_ops_list type cast to RCU
Use a type cast to convert ftrace_list_end to RCU when assigning
ftrace_ops_list. This will quiet the sparse warning:

 kernel/trace/ftrace.c:125:59: warning: incorrect type in initializer (different address spaces)
 kernel/trace/ftrace.c:125:59:    expected struct ftrace_ops [noderef] __rcu *[addressable] [toplevel] ftrace_ops_list
 kernel/trace/ftrace.c:125:59:    got struct ftrace_ops *

Link: https://lore.kernel.org/linux-trace-kernel/20240605202708.450784356@goodmis.org

Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-06-06 15:22:18 -04:00
Steven Rostedt (Google)
d66bb33479 ftrace: Add comments to ftrace_hash_move() and friends
Describe what ftrace_hash_move() does and add some more comments to some
other functions to make it easier to understand.

Link: https://lore.kernel.org/linux-trace-kernel/20240605180409.179520305@goodmis.org

Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-06-06 15:22:03 -04:00
Steven Rostedt (Google)
1a88c07167 ftrace: Convert "inc" parameter to bool in ftrace_hash_rec_update_modify()
The parameter "inc" in the function ftrace_hash_rec_update_modify() is
boolean. Change it to be such.

Also add documentation to what the function does.

Link: https://lore.kernel.org/linux-trace-kernel/20240605180409.021080462@goodmis.org

Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-06-06 15:21:28 -04:00
Steven Rostedt (Google)
da73f6d490 ftrace: Add comments to ftrace_hash_rec_disable/enable()
Add comments to describe what the functions ftrace_hash_rec_disable() and
ftrace_hash_rec_enable() do. Also change the passing of the "inc" variable
to __ftrace_hash_rec_update() to a boolean value as that is what it is
supposed to take.

Link: https://lore.kernel.org/linux-trace-kernel/20240605180408.857333430@goodmis.org

Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-06-06 15:21:25 -04:00
Steven Rostedt (Google)
07bbe0833e ftrace: Remove "filter_hash" parameter from __ftrace_hash_rec_update()
While adding comments to the function __ftrace_hash_rec_update() and
trying to describe in detail what the parameter for "filter_hash" does, I
realized that it basically does exactly the same thing (but differently)
if it is set or not!

If it is set, the idea was the ops->filter_hash was being updated, and the
code should focus on the functions that are in the ops->filter_hash and
add them. But it still had to pay attention to the functions in the
ops->notrace_hash, to ignore them.

If it was cleared, it focused on the ops->notrace_hash, and would add
functions that were not in the ops->notrace_hash but would still keep
functions in the "ops->filter_hash". Basically doing the same thing.

In reality, the __ftrace_hash_rec_update() only needs to either remove the
functions associated to the give ops (if "inc" is set) or remove them (if
"inc" is cleared). It has to pay attention to both the filter_hash and
notrace_hash regardless.

Remove the "filter_hash" parameter from __filter_hash_rec_update() and
comment the function for what it really is doing.

Link: https://lore.kernel.org/linux-trace-kernel/20240605180408.691995506@goodmis.org

Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reviewed-by: Mark Rutland <mark.rutland@arm.com>
Tested-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-06-06 15:20:27 -04:00
Jakub Kicinski
62b5bf58b9 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.

No conflicts.

Adjacent changes:

drivers/net/ethernet/pensando/ionic/ionic_txrx.c
  d9c0420999 ("ionic: Mark error paths in the data path as unlikely")
  491aee894a ("ionic: fix kernel panic in XDP_TX action")

net/ipv6/ip6_fib.c
  b4cb4a1391 ("net: use unrcu_pointer() helper")
  b01e1c0307 ("ipv6: fix possible race in __fib6_drop_pcpu_from()")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-06 12:06:56 -07:00
Frederic Weisbecker
399ced9594 rcu/tasks: Fix stale task snaphot for Tasks Trace
When RCU-TASKS-TRACE pre-gp takes a snapshot of the current task running
on all online CPUs, no explicit ordering synchronizes properly with a
context switch.  This lack of ordering can permit the new task to miss
pre-grace-period update-side accesses.  The following diagram, courtesy
of Paul, shows the possible bad scenario:

        CPU 0                                           CPU 1
        -----                                           -----

        // Pre-GP update side access
        WRITE_ONCE(*X, 1);
        smp_mb();
        r0 = rq->curr;
                                                        RCU_INIT_POINTER(rq->curr, TASK_B)
                                                        spin_unlock(rq)
                                                        rcu_read_lock_trace()
                                                        r1 = X;
        /* ignore TASK_B */

Either r0==TASK_B or r1==1 is needed but neither is guaranteed.

One possible solution to solve this is to wait for an RCU grace period
at the beginning of the RCU-tasks-trace grace period before taking the
current tasks snaphot. However this would introduce large additional
latencies to RCU-tasks-trace grace periods.

Another solution is to lock the target runqueue while taking the current
task snapshot. This ensures that the update side sees the latest context
switch and subsequent context switches will see the pre-grace-period
update side accesses.

This commit therefore adds runqueue locking to cpu_curr_snapshot().

Fixes: e386b67257 ("rcu-tasks: Eliminate RCU Tasks Trace IPIs to online CPUs")
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2024-06-06 11:50:04 -07:00
Jeff Johnson
b9f147cdc2 rcutorture: Add missing MODULE_DESCRIPTION() macros
Fix the following 'make W=1' warnings:

WARNING: modpost: missing MODULE_DESCRIPTION() in kernel/rcu/rcutorture.o
WARNING: modpost: missing MODULE_DESCRIPTION() in kernel/rcu/rcuscale.o
WARNING: modpost: missing MODULE_DESCRIPTION() in kernel/rcu/refscale.o

Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2024-06-06 11:44:42 -07:00