mirror of
https://github.com/torvalds/linux.git
synced 2025-12-07 20:06:24 +00:00
7aca0ac4792e6cb0f35ef97bfcb39b1663a92fb7
40451 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
40adaf51cb |
tracing/eprobe: Fix eprobe filter to make a filter correctly
Since the eprobe filter was defined based on the eprobe's trace event
itself, it doesn't work correctly. Use the original trace event of
the eprobe when making the filter so that the filter works correctly.
Without this fix:
# echo 'e syscalls/sys_enter_openat \
flags_rename=$flags:u32 if flags < 1000' >> dynamic_events
# echo 1 > events/eprobes/sys_enter_openat/enable
[ 114.551550] event trace: Could not enable event sys_enter_openat
-bash: echo: write error: Invalid argument
With this fix:
# echo 'e syscalls/sys_enter_openat \
flags_rename=$flags:u32 if flags < 1000' >> dynamic_events
# echo 1 > events/eprobes/sys_enter_openat/enable
# tail trace
cat-241 [000] ...1. 266.498449: sys_enter_openat: (syscalls.sys_enter_openat) flags_rename=0
cat-242 [000] ...1. 266.977640: sys_enter_openat: (syscalls.sys_enter_openat) flags_rename=0
Link: https://lore.kernel.org/all/166823166395.1385292.8931770640212414483.stgit@devnote3/
Fixes:
|
||
|
|
342a4a2f99 |
tracing/eprobe: Fix warning in filter creation
The filter pointer (filterp) passed to create_filter() function must be a
pointer that references a NULL pointer, otherwise, we get a warning when
adding a filter option to the event probe:
root@localhost:/sys/kernel/tracing# echo 'e:egroup/stat_runtime_4core sched/sched_stat_runtime \
runtime=$runtime:u32 if cpu < 4' >> dynamic_events
[ 5034.340439] ------------[ cut here ]------------
[ 5034.341258] WARNING: CPU: 0 PID: 223 at kernel/trace/trace_events_filter.c:1939 create_filter+0x1db/0x250
[...] stripped
[ 5034.345518] RIP: 0010:create_filter+0x1db/0x250
[...] stripped
[ 5034.351604] Call Trace:
[ 5034.351803] <TASK>
[ 5034.351959] ? process_preds+0x1b40/0x1b40
[ 5034.352241] ? rcu_read_lock_bh_held+0xd0/0xd0
[ 5034.352604] ? kasan_set_track+0x29/0x40
[ 5034.352904] ? kasan_save_alloc_info+0x1f/0x30
[ 5034.353264] create_event_filter+0x38/0x50
[ 5034.353573] __trace_eprobe_create+0x16f4/0x1d20
[ 5034.353964] ? eprobe_dyn_event_release+0x360/0x360
[ 5034.354363] ? mark_held_locks+0xa6/0xf0
[ 5034.354684] ? _raw_spin_unlock_irqrestore+0x35/0x60
[ 5034.355105] ? trace_hardirqs_on+0x41/0x120
[ 5034.355417] ? _raw_spin_unlock_irqrestore+0x35/0x60
[ 5034.355751] ? __create_object+0x5b7/0xcf0
[ 5034.356027] ? lock_is_held_type+0xaf/0x120
[ 5034.356362] ? rcu_read_lock_bh_held+0xb0/0xd0
[ 5034.356716] ? rcu_read_lock_bh_held+0xd0/0xd0
[ 5034.357084] ? kasan_set_track+0x29/0x40
[ 5034.357411] ? kasan_save_alloc_info+0x1f/0x30
[ 5034.357715] ? __kasan_kmalloc+0xb8/0xc0
[ 5034.357985] ? write_comp_data+0x2f/0x90
[ 5034.358302] ? __sanitizer_cov_trace_pc+0x25/0x60
[ 5034.358691] ? argv_split+0x381/0x460
[ 5034.358949] ? write_comp_data+0x2f/0x90
[ 5034.359240] ? eprobe_dyn_event_release+0x360/0x360
[ 5034.359620] trace_probe_create+0xf6/0x110
[ 5034.359940] ? trace_probe_match_command_args+0x240/0x240
[ 5034.360376] eprobe_dyn_event_create+0x21/0x30
[ 5034.360709] create_dyn_event+0xf3/0x1a0
[ 5034.360983] trace_parse_run_command+0x1a9/0x2e0
[ 5034.361297] ? dyn_event_release+0x500/0x500
[ 5034.361591] dyn_event_write+0x39/0x50
[ 5034.361851] vfs_write+0x311/0xe50
[ 5034.362091] ? dyn_event_seq_next+0x40/0x40
[ 5034.362376] ? kernel_write+0x5b0/0x5b0
[ 5034.362637] ? write_comp_data+0x2f/0x90
[ 5034.362937] ? __sanitizer_cov_trace_pc+0x25/0x60
[ 5034.363258] ? ftrace_syscall_enter+0x544/0x840
[ 5034.363563] ? write_comp_data+0x2f/0x90
[ 5034.363837] ? __sanitizer_cov_trace_pc+0x25/0x60
[ 5034.364156] ? write_comp_data+0x2f/0x90
[ 5034.364468] ? write_comp_data+0x2f/0x90
[ 5034.364770] ksys_write+0x158/0x2a0
[ 5034.365022] ? __ia32_sys_read+0xc0/0xc0
[ 5034.365344] __x64_sys_write+0x7c/0xc0
[ 5034.365669] ? syscall_enter_from_user_mode+0x53/0x70
[ 5034.366084] do_syscall_64+0x60/0x90
[ 5034.366356] entry_SYSCALL_64_after_hwframe+0x63/0xcd
[ 5034.366767] RIP: 0033:0x7ff0b43938f3
[...] stripped
[ 5034.371892] </TASK>
[ 5034.374720] ---[ end trace 0000000000000000 ]---
Link: https://lore.kernel.org/all/20221108202148.1020111-1-rafaelmendsr@gmail.com/
Fixes:
|
||
|
|
5dd7caf0bd |
kprobes: Skip clearing aggrprobe's post_handler in kprobe-on-ftrace case
In __unregister_kprobe_top(), if the currently unregistered probe has
post_handler but other child probes of the aggrprobe do not have
post_handler, the post_handler of the aggrprobe is cleared. If this is
a ftrace-based probe, there is a problem. In later calls to
disarm_kprobe(), we will use kprobe_ftrace_ops because post_handler is
NULL. But we're armed with kprobe_ipmodify_ops. This triggers a WARN in
__disarm_kprobe_ftrace() and may even cause use-after-free:
Failed to disarm kprobe-ftrace at kernel_clone+0x0/0x3c0 (error -2)
WARNING: CPU: 5 PID: 137 at kernel/kprobes.c:1135 __disarm_kprobe_ftrace.isra.21+0xcf/0xe0
Modules linked in: testKprobe_007(-)
CPU: 5 PID: 137 Comm: rmmod Not tainted 6.1.0-rc4-dirty #18
[...]
Call Trace:
<TASK>
__disable_kprobe+0xcd/0xe0
__unregister_kprobe_top+0x12/0x150
? mutex_lock+0xe/0x30
unregister_kprobes.part.23+0x31/0xa0
unregister_kprobe+0x32/0x40
__x64_sys_delete_module+0x15e/0x260
? do_user_addr_fault+0x2cd/0x6b0
do_syscall_64+0x3a/0x90
entry_SYSCALL_64_after_hwframe+0x63/0xcd
[...]
For the kprobe-on-ftrace case, we keep the post_handler setting to
identify this aggrprobe armed with kprobe_ipmodify_ops. This way we
can disarm it correctly.
Link: https://lore.kernel.org/all/20221112070000.35299-1-lihuafei1@huawei.com/
Fixes:
|
||
|
|
0a1ebe35cb |
rethook: fix a potential memleak in rethook_alloc()
In rethook_alloc(), the variable rh is not freed or passed out
if handler is NULL, which could lead to a memleak, fix it.
Link: https://lore.kernel.org/all/20221110104438.88099-1-yiyang13@huawei.com/
[Masami: Add "rethook:" tag to the title.]
Fixes:
|
||
|
|
d1776c0202 |
tracing/eprobe: Fix memory leak of filter string
The filter string doesn't get freed when a dynamic event is deleted. If a
filter is set, then memory is leaked:
root@localhost:/sys/kernel/tracing# echo 'e:egroup/stat_runtime_4core \
sched/sched_stat_runtime runtime=$runtime:u32 if cpu < 4' >> dynamic_events
root@localhost:/sys/kernel/tracing# echo "-:egroup/stat_runtime_4core" >> dynamic_events
root@localhost:/sys/kernel/tracing# echo scan > /sys/kernel/debug/kmemleak
[ 224.416373] kmemleak: 1 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
root@localhost:/sys/kernel/tracing# cat /sys/kernel/debug/kmemleak
unreferenced object 0xffff88810156f1b8 (size 8):
comm "bash", pid 224, jiffies 4294935612 (age 55.800s)
hex dump (first 8 bytes):
63 70 75 20 3c 20 34 00 cpu < 4.
backtrace:
[<000000009f880725>] __kmem_cache_alloc_node+0x18e/0x720
[<0000000042492946>] __kmalloc+0x57/0x240
[<0000000034ea7995>] __trace_eprobe_create+0x1214/0x1d30
[<00000000d70ef730>] trace_probe_create+0xf6/0x110
[<00000000915c7b16>] eprobe_dyn_event_create+0x21/0x30
[<000000000d894386>] create_dyn_event+0xf3/0x1a0
[<00000000e9af57d5>] trace_parse_run_command+0x1a9/0x2e0
[<0000000080777f18>] dyn_event_write+0x39/0x50
[<0000000089f0ec73>] vfs_write+0x311/0xe50
[<000000003da1bdda>] ksys_write+0x158/0x2a0
[<00000000bb1e616e>] __x64_sys_write+0x7c/0xc0
[<00000000e8aef1f7>] do_syscall_64+0x60/0x90
[<00000000fe7fe8ba>] entry_SYSCALL_64_after_hwframe+0x63/0xcd
Additionally, in __trace_eprobe_create() function, if an error occurs after
the call to trace_eprobe_parse_filter(), which allocates the filter string,
then memory is also leaked. That can be reproduced by creating the same
event probe twice:
root@localhost:/sys/kernel/tracing# echo 'e:egroup/stat_runtime_4core \
sched/sched_stat_runtime runtime=$runtime:u32 if cpu < 4' >> dynamic_events
root@localhost:/sys/kernel/tracing# echo 'e:egroup/stat_runtime_4core \
sched/sched_stat_runtime runtime=$runtime:u32 if cpu < 4' >> dynamic_events
-bash: echo: write error: File exists
root@localhost:/sys/kernel/tracing# echo scan > /sys/kernel/debug/kmemleak
[ 207.871584] kmemleak: 1 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
root@localhost:/sys/kernel/tracing# cat /sys/kernel/debug/kmemleak
unreferenced object 0xffff8881020d17a8 (size 8):
comm "bash", pid 223, jiffies 4294938308 (age 31.000s)
hex dump (first 8 bytes):
63 70 75 20 3c 20 34 00 cpu < 4.
backtrace:
[<000000000e4f5f31>] __kmem_cache_alloc_node+0x18e/0x720
[<0000000024f0534b>] __kmalloc+0x57/0x240
[<000000002930a28e>] __trace_eprobe_create+0x1214/0x1d30
[<0000000028387903>] trace_probe_create+0xf6/0x110
[<00000000a80d6a9f>] eprobe_dyn_event_create+0x21/0x30
[<000000007168698c>] create_dyn_event+0xf3/0x1a0
[<00000000f036bf6a>] trace_parse_run_command+0x1a9/0x2e0
[<00000000014bde8b>] dyn_event_write+0x39/0x50
[<0000000078a097f7>] vfs_write+0x311/0xe50
[<00000000996cb208>] ksys_write+0x158/0x2a0
[<00000000a3c2acb0>] __x64_sys_write+0x7c/0xc0
[<0000000006b5d698>] do_syscall_64+0x60/0x90
[<00000000780e8ecf>] entry_SYSCALL_64_after_hwframe+0x63/0xcd
Fix both issues by releasing the filter string in
trace_event_probe_cleanup().
Link: https://lore.kernel.org/all/20221108235738.1021467-1-rafaelmendsr@gmail.com/
Fixes:
|
||
|
|
22ea4ca963 |
tracing: kprobe: Fix potential null-ptr-deref on trace_array in kprobe_event_gen_test_exit()
When test_gen_kprobe_cmd() failed after kprobe_event_gen_cmd_end(), it
will goto delete, which will call kprobe_event_delete() and release the
corresponding resource. However, the trace_array in gen_kretprobe_test
will point to the invalid resource. Set gen_kretprobe_test to NULL
after called kprobe_event_delete() to prevent null-ptr-deref.
BUG: kernel NULL pointer dereference, address: 0000000000000070
PGD 0 P4D 0
Oops: 0000 [#1] SMP PTI
CPU: 0 PID: 246 Comm: modprobe Tainted: G W
6.1.0-rc1-00174-g9522dc5c87da-dirty #248
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org 04/01/2014
RIP: 0010:__ftrace_set_clr_event_nolock+0x53/0x1b0
Code: e8 82 26 fc ff 49 8b 1e c7 44 24 0c ea ff ff ff 49 39 de 0f 84 3c
01 00 00 c7 44 24 18 00 00 00 00 e8 61 26 fc ff 48 8b 6b 10 <44> 8b 65
70 4c 8b 6d 18 41 f7 c4 00 02 00 00 75 2f
RSP: 0018:ffffc9000159fe00 EFLAGS: 00010293
RAX: 0000000000000000 RBX: ffff88810971d268 RCX: 0000000000000000
RDX: ffff8881080be600 RSI: ffffffff811b48ff RDI: ffff88810971d058
RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000001
R10: ffffc9000159fe58 R11: 0000000000000001 R12: ffffffffa0001064
R13: ffffffffa000106c R14: ffff88810971d238 R15: 0000000000000000
FS: 00007f89eeff6540(0000) GS:ffff88813b600000(0000)
knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000070 CR3: 000000010599e004 CR4: 0000000000330ef0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
__ftrace_set_clr_event+0x3e/0x60
trace_array_set_clr_event+0x35/0x50
? 0xffffffffa0000000
kprobe_event_gen_test_exit+0xcd/0x10b [kprobe_event_gen_test]
__x64_sys_delete_module+0x206/0x380
? lockdep_hardirqs_on_prepare+0xd8/0x190
? syscall_enter_from_user_mode+0x1c/0x50
do_syscall_64+0x3f/0x90
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7f89eeb061b7
Link: https://lore.kernel.org/all/20221108015130.28326-3-shangxiaojing@huawei.com/
Fixes:
|
||
|
|
e0d75267f5 |
tracing: kprobe: Fix potential null-ptr-deref on trace_event_file in kprobe_event_gen_test_exit()
When trace_get_event_file() failed, gen_kretprobe_test will be assigned
as the error code. If module kprobe_event_gen_test is removed now, the
null pointer dereference will happen in kprobe_event_gen_test_exit().
Check if gen_kprobe_test or gen_kretprobe_test is error code or NULL
before dereference them.
BUG: kernel NULL pointer dereference, address: 0000000000000012
PGD 0 P4D 0
Oops: 0000 [#1] SMP PTI
CPU: 3 PID: 2210 Comm: modprobe Not tainted
6.1.0-rc1-00171-g2159299a3b74-dirty #217
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org 04/01/2014
RIP: 0010:kprobe_event_gen_test_exit+0x1c/0xb5 [kprobe_event_gen_test]
Code: Unable to access opcode bytes at 0xffffffff9ffffff2.
RSP: 0018:ffffc900015bfeb8 EFLAGS: 00010246
RAX: ffffffffffffffea RBX: ffffffffa0002080 RCX: 0000000000000000
RDX: ffffffffa0001054 RSI: ffffffffa0001064 RDI: ffffffffdfc6349c
RBP: ffffffffa0000000 R08: 0000000000000004 R09: 00000000001e95c0
R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000800
R13: ffffffffa0002420 R14: 0000000000000000 R15: 0000000000000000
FS: 00007f56b75be540(0000) GS:ffff88813bc00000(0000)
knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffffff9ffffff2 CR3: 000000010874a006 CR4: 0000000000330ee0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
__x64_sys_delete_module+0x206/0x380
? lockdep_hardirqs_on_prepare+0xd8/0x190
? syscall_enter_from_user_mode+0x1c/0x50
do_syscall_64+0x3f/0x90
entry_SYSCALL_64_after_hwframe+0x63/0xcd
Link: https://lore.kernel.org/all/20221108015130.28326-2-shangxiaojing@huawei.com/
Fixes:
|
||
|
|
1b5f1c34d3 |
tracing: Fix wild-memory-access in register_synth_event()
In register_synth_event(), if set_synth_event_print_fmt() failed, then
both trace_remove_event_call() and unregister_trace_event() will be
called, which means the trace_event_call will call
__unregister_trace_event() twice. As the result, the second unregister
will causes the wild-memory-access.
register_synth_event
set_synth_event_print_fmt failed
trace_remove_event_call
event_remove
if call->event.funcs then
__unregister_trace_event (first call)
unregister_trace_event
__unregister_trace_event (second call)
Fix the bug by avoiding to call the second __unregister_trace_event() by
checking if the first one is called.
general protection fault, probably for non-canonical address
0xfbd59c0000000024: 0000 [#1] SMP KASAN PTI
KASAN: maybe wild-memory-access in range
[0xdead000000000120-0xdead000000000127]
CPU: 0 PID: 3807 Comm: modprobe Not tainted
6.1.0-rc1-00186-g76f33a7eedb4 #299
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org 04/01/2014
RIP: 0010:unregister_trace_event+0x6e/0x280
Code: 00 fc ff df 4c 89 ea 48 c1 ea 03 80 3c 02 00 0f 85 0e 02 00 00 48
b8 00 00 00 00 00 fc ff df 4c 8b 63 08 4c 89 e2 48 c1 ea 03 <80> 3c 02
00 0f 85 e2 01 00 00 49 89 2c 24 48 85 ed 74 28 e8 7a 9b
RSP: 0018:ffff88810413f370 EFLAGS: 00010a06
RAX: dffffc0000000000 RBX: ffff888105d050b0 RCX: 0000000000000000
RDX: 1bd5a00000000024 RSI: ffff888119e276e0 RDI: ffffffff835a8b20
RBP: dead000000000100 R08: 0000000000000000 R09: fffffbfff0913481
R10: ffffffff8489a407 R11: fffffbfff0913480 R12: dead000000000122
R13: ffff888105d050b8 R14: 0000000000000000 R15: ffff888105d05028
FS: 00007f7823e8d540(0000) GS:ffff888119e00000(0000)
knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f7823e7ebec CR3: 000000010a058002 CR4: 0000000000330ef0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
__create_synth_event+0x1e37/0x1eb0
create_or_delete_synth_event+0x110/0x250
synth_event_run_command+0x2f/0x110
test_gen_synth_cmd+0x170/0x2eb [synth_event_gen_test]
synth_event_gen_test_init+0x76/0x9bc [synth_event_gen_test]
do_one_initcall+0xdb/0x480
do_init_module+0x1cf/0x680
load_module+0x6a50/0x70a0
__do_sys_finit_module+0x12f/0x1c0
do_syscall_64+0x3f/0x90
entry_SYSCALL_64_after_hwframe+0x63/0xcd
Link: https://lkml.kernel.org/r/20221117012346.22647-3-shangxiaojing@huawei.com
Fixes:
|
||
|
|
a4527fef9a |
tracing: Fix memory leak in test_gen_synth_cmd() and test_empty_synth_event()
test_gen_synth_cmd() only free buf in fail path, hence buf will leak
when there is no failure. Add kfree(buf) to prevent the memleak. The
same reason and solution in test_empty_synth_event().
unreferenced object 0xffff8881127de000 (size 2048):
comm "modprobe", pid 247, jiffies 4294972316 (age 78.756s)
hex dump (first 32 bytes):
20 67 65 6e 5f 73 79 6e 74 68 5f 74 65 73 74 20 gen_synth_test
20 70 69 64 5f 74 20 6e 65 78 74 5f 70 69 64 5f pid_t next_pid_
backtrace:
[<000000004254801a>] kmalloc_trace+0x26/0x100
[<0000000039eb1cf5>] 0xffffffffa00083cd
[<000000000e8c3bc8>] 0xffffffffa00086ba
[<00000000c293d1ea>] do_one_initcall+0xdb/0x480
[<00000000aa189e6d>] do_init_module+0x1cf/0x680
[<00000000d513222b>] load_module+0x6a50/0x70a0
[<000000001fd4d529>] __do_sys_finit_module+0x12f/0x1c0
[<00000000b36c4c0f>] do_syscall_64+0x3f/0x90
[<00000000bbf20cf3>] entry_SYSCALL_64_after_hwframe+0x63/0xcd
unreferenced object 0xffff8881127df000 (size 2048):
comm "modprobe", pid 247, jiffies 4294972324 (age 78.728s)
hex dump (first 32 bytes):
20 65 6d 70 74 79 5f 73 79 6e 74 68 5f 74 65 73 empty_synth_tes
74 20 20 70 69 64 5f 74 20 6e 65 78 74 5f 70 69 t pid_t next_pi
backtrace:
[<000000004254801a>] kmalloc_trace+0x26/0x100
[<00000000d4db9a3d>] 0xffffffffa0008071
[<00000000c31354a5>] 0xffffffffa00086ce
[<00000000c293d1ea>] do_one_initcall+0xdb/0x480
[<00000000aa189e6d>] do_init_module+0x1cf/0x680
[<00000000d513222b>] load_module+0x6a50/0x70a0
[<000000001fd4d529>] __do_sys_finit_module+0x12f/0x1c0
[<00000000b36c4c0f>] do_syscall_64+0x3f/0x90
[<00000000bbf20cf3>] entry_SYSCALL_64_after_hwframe+0x63/0xcd
Link: https://lkml.kernel.org/r/20221117012346.22647-2-shangxiaojing@huawei.com
Cc: <mhiramat@kernel.org>
Cc: <zanussi@kernel.org>
Cc: <fengguang.wu@intel.com>
Cc: stable@vger.kernel.org
Fixes:
|
||
|
|
19ba6c8af9 |
ftrace: Fix null pointer dereference in ftrace_add_mod()
The @ftrace_mod is allocated by kzalloc(), so both the members {prev,next}
of @ftrace_mode->list are NULL, it's not a valid state to call list_del().
If kstrdup() for @ftrace_mod->{func|module} fails, it goes to @out_free
tag and calls free_ftrace_mod() to destroy @ftrace_mod, then list_del()
will write prev->next and next->prev, where null pointer dereference
happens.
BUG: kernel NULL pointer dereference, address: 0000000000000008
Oops: 0002 [#1] PREEMPT SMP NOPTI
Call Trace:
<TASK>
ftrace_mod_callback+0x20d/0x220
? do_filp_open+0xd9/0x140
ftrace_process_regex.isra.51+0xbf/0x130
ftrace_regex_write.isra.52.part.53+0x6e/0x90
vfs_write+0xee/0x3a0
? __audit_filter_op+0xb1/0x100
? auditd_test_task+0x38/0x50
ksys_write+0xa5/0xe0
do_syscall_64+0x3a/0x90
entry_SYSCALL_64_after_hwframe+0x63/0xcd
Kernel panic - not syncing: Fatal exception
So call INIT_LIST_HEAD() to initialize the list member to fix this issue.
Link: https://lkml.kernel.org/r/20221116015207.30858-1-xiujianfeng@huawei.com
Cc: stable@vger.kernel.org
Fixes:
|
||
|
|
56f4ca0a79 |
ring_buffer: Do not deactivate non-existant pages
rb_head_page_deactivate() expects cpu_buffer to contain a valid list of
->pages, so verify that the list is actually present before calling it.
Found by Linux Verification Center (linuxtesting.org) with the SVACE
static analysis tool.
Link: https://lkml.kernel.org/r/20221114143129.3534443-1-d-tatianin@yandex-team.ru
Cc: stable@vger.kernel.org
Fixes:
|
||
|
|
bcea02b096 |
ftrace: Optimize the allocation for mcount entries
If we can't allocate this size, try something smaller with half of the
size. Its order should be decreased by one instead of divided by two.
Link: https://lkml.kernel.org/r/20221109094434.84046-3-wangwensheng4@huawei.com
Cc: <mhiramat@kernel.org>
Cc: <mark.rutland@arm.com>
Cc: stable@vger.kernel.org
Fixes:
|
||
|
|
08948caebe |
ftrace: Fix the possible incorrect kernel message
If the number of mcount entries is an integer multiple of
ENTRIES_PER_PAGE, the page count showing on the console would be wrong.
Link: https://lkml.kernel.org/r/20221109094434.84046-2-wangwensheng4@huawei.com
Cc: <mhiramat@kernel.org>
Cc: <mark.rutland@arm.com>
Cc: stable@vger.kernel.org
Fixes:
|
||
|
|
3af43ba4c6 |
bpf: Pass map file to .map_update_batch directly
Currently bpf_map_do_batch() first invokes fdget(batch.map_fd) to get the target map file, then it invokes generic_map_update_batch() to do batch update. generic_map_update_batch() will get the target map file by using fdget(batch.map_fd) again and pass it to bpf_map_update_value(). The problem is map file returned by the second fdget() may be NULL or a totally different file compared by map file in bpf_map_do_batch(). The reason is that the first fdget() only guarantees the liveness of struct file instead of file descriptor and the file description may be released by concurrent close() through pick_file(). It doesn't incur any problem as for now, because maps with batch update support don't use map file in .map_fd_get_ptr() ops. But it is better to fix the potential access of an invalid map file. Using __bpf_map_get() again in generic_map_update_batch() can not fix the problem, because batch.map_fd may be closed and reopened, and the returned map file may be different with map file got in bpf_map_do_batch(), so just passing the map file directly to .map_update_batch() in bpf_map_do_batch(). Signed-off-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20221116075059.1551277-1-houtao@huaweicloud.com |
||
|
|
649e72070c |
tracing: Fix memory leak in tracing_read_pipe()
kmemleak reports this issue:
unreferenced object 0xffff888105a18900 (size 128):
comm "test_progs", pid 18933, jiffies 4336275356 (age 22801.766s)
hex dump (first 32 bytes):
25 73 00 90 81 88 ff ff 26 05 00 00 42 01 58 04 %s......&...B.X.
03 00 00 00 02 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace:
[<00000000560143a1>] __kmalloc_node_track_caller+0x4a/0x140
[<000000006af00822>] krealloc+0x8d/0xf0
[<00000000c309be6a>] trace_iter_expand_format+0x99/0x150
[<000000005a53bdb6>] trace_check_vprintf+0x1e0/0x11d0
[<0000000065629d9d>] trace_event_printf+0xb6/0xf0
[<000000009a690dc7>] trace_raw_output_bpf_trace_printk+0x89/0xc0
[<00000000d22db172>] print_trace_line+0x73c/0x1480
[<00000000cdba76ba>] tracing_read_pipe+0x45c/0x9f0
[<0000000015b58459>] vfs_read+0x17b/0x7c0
[<000000004aeee8ed>] ksys_read+0xed/0x1c0
[<0000000063d3d898>] do_syscall_64+0x3b/0x90
[<00000000a06dda7f>] entry_SYSCALL_64_after_hwframe+0x63/0xcd
iter->fmt alloced in
tracing_read_pipe() -> .. ->trace_iter_expand_format(), but not
freed, to fix, add free in tracing_release_pipe()
Link: https://lkml.kernel.org/r/1667819090-4643-1-git-send-email-wangyufen@huawei.com
Cc: stable@vger.kernel.org
Fixes:
|
||
|
|
31029a8b2c |
ring-buffer: Include dropped pages in counting dirty patches
The function ring_buffer_nr_dirty_pages() was created to find out how many
pages are filled in the ring buffer. There's two running counters. One is
incremented whenever a new page is touched (pages_touched) and the other
is whenever a page is read (pages_read). The dirty count is the number
touched minus the number read. This is used to determine if a blocked task
should be woken up if the percentage of the ring buffer it is waiting for
is hit.
The problem is that it does not take into account dropped pages (when the
new writes overwrite pages that were not read). And then the dirty pages
will always be greater than the percentage.
This makes the "buffer_percent" file inaccurate, as the number of dirty
pages end up always being larger than the percentage, event when it's not
and this causes user space to be woken up more than it wants to be.
Add a new counter to keep track of lost pages, and include that in the
accounting of dirty pages so that it is actually accurate.
Link: https://lkml.kernel.org/r/20221021123013.55fb6055@gandalf.local.home
Fixes:
|
||
|
|
42fb0a1e84 |
tracing/ring-buffer: Have polling block on watermark
Currently the way polling works on the ring buffer is broken. It will
return immediately if there's any data in the ring buffer whereas a read
will block until the watermark (defined by the tracefs buffer_percent file)
is hit.
That is, a select() or poll() will return as if there's data available,
but then the following read will block. This is broken for the way
select()s and poll()s are supposed to work.
Have the polling on the ring buffer also block the same way reads and
splice does on the ring buffer.
Link: https://lkml.kernel.org/r/20221020231427.41be3f26@gandalf.local.home
Cc: Linux Trace Kernel <linux-trace-kernel@vger.kernel.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Primiano Tucci <primiano@google.com>
Cc: stable@vger.kernel.org
Fixes:
|
||
|
|
befae75856 |
bpf: propagate nullness information for reg to reg comparisons
Propagate nullness information for branches of register to register equality compare instructions. The following rules are used: - suppose register A maybe null - suppose register B is not null - for JNE A, B, ... - A is not null in the false branch - for JEQ A, B, ... - A is not null in the true branch E.g. for program like below: r6 = skb->sk; r7 = sk_fullsock(r6); r0 = sk_fullsock(r6); if (r0 == 0) return 0; (a) if (r0 != r7) return 0; (b) *r7->type; (c) return 0; It is safe to dereference r7 at point (c), because of (a) and (b). Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/r/20221115224859.2452988-2-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
|
|
32637e3300 |
bpf: Expand map key argument of bpf_redirect_map to u64
For queueing packets in XDP we want to add a new redirect map type with support for 64-bit indexes. To prepare fore this, expand the width of the 'key' argument to the bpf_redirect_map() helper. Since BPF registers are always 64-bit, this should be safe to do after the fact. Acked-by: Song Liu <song@kernel.org> Reviewed-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/r/20221108140601.149971-3-toke@redhat.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
|
|
47df8a2f78 |
bpf, perf: Use subprog name when reporting subprog ksymbol
Since commit |
||
|
|
6728aea721 |
bpf: Refactor btf_struct_access
Instead of having to pass multiple arguments that describe the register, pass the bpf_reg_state into the btf_struct_access callback. Currently, all call sites simply reuse the btf and btf_id of the reg they want to check the access of. The only exception to this pattern is the callsite in check_ptr_to_map_access, hence for that case create a dummy reg to simulate PTR_TO_BTF_ID access. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20221114191547.1694267-8-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
|
|
894f2a8b16 |
bpf: Rename MEM_ALLOC to MEM_RINGBUF
Currently, verifier uses MEM_ALLOC type tag to specially tag memory returned from bpf_ringbuf_reserve helper. However, this is currently only used for this purpose and there is an implicit assumption that it only refers to ringbuf memory (e.g. the check for ARG_PTR_TO_ALLOC_MEM in check_func_arg_reg_off). Hence, rename MEM_ALLOC to MEM_RINGBUF to indicate this special relationship and instead open the use of MEM_ALLOC for more generic allocations made for user types. Also, since ARG_PTR_TO_ALLOC_MEM_OR_NULL is unused, simply drop it. Finally, update selftests using 'alloc_' verifier string to 'ringbuf_'. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20221114191547.1694267-7-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
|
|
2de2669b4e |
bpf: Rename RET_PTR_TO_ALLOC_MEM
Currently, the verifier has two return types, RET_PTR_TO_ALLOC_MEM, and
RET_PTR_TO_ALLOC_MEM_OR_NULL, however the former is confusingly named to
imply that it carries MEM_ALLOC, while only the latter does. This causes
confusion during code review leading to conclusions like that the return
value of RET_PTR_TO_DYNPTR_MEM_OR_NULL (which is RET_PTR_TO_ALLOC_MEM |
PTR_MAYBE_NULL) may be consumable by bpf_ringbuf_{submit,commit}.
Rename it to make it clear MEM_ALLOC needs to be tacked on top of
RET_PTR_TO_MEM.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221114191547.1694267-6-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
||
|
|
f0c5941ff5 |
bpf: Support bpf_list_head in map values
Add the support on the map side to parse, recognize, verify, and build
metadata table for a new special field of the type struct bpf_list_head.
To parameterize the bpf_list_head for a certain value type and the
list_node member it will accept in that value type, we use BTF
declaration tags.
The definition of bpf_list_head in a map value will be done as follows:
struct foo {
struct bpf_list_node node;
int data;
};
struct map_value {
struct bpf_list_head head __contains(foo, node);
};
Then, the bpf_list_head only allows adding to the list 'head' using the
bpf_list_node 'node' for the type struct foo.
The 'contains' annotation is a BTF declaration tag composed of four
parts, "contains:name:node" where the name is then used to look up the
type in the map BTF, with its kind hardcoded to BTF_KIND_STRUCT during
the lookup. The node defines name of the member in this type that has
the type struct bpf_list_node, which is actually used for linking into
the linked list. For now, 'kind' part is hardcoded as struct.
This allows building intrusive linked lists in BPF, using container_of
to obtain pointer to entry, while being completely type safe from the
perspective of the verifier. The verifier knows exactly the type of the
nodes, and knows that list helpers return that type at some fixed offset
where the bpf_list_node member used for this list exists. The verifier
also uses this information to disallow adding types that are not
accepted by a certain list.
For now, no elements can be added to such lists. Support for that is
coming in future patches, hence draining and freeing items is done with
a TODO that will be resolved in a future patch.
Note that the bpf_list_head_free function moves the list out to a local
variable under the lock and releases it, doing the actual draining of
the list items outside the lock. While this helps with not holding the
lock for too long pessimizing other concurrent list operations, it is
also necessary for deadlock prevention: unless every function called in
the critical section would be notrace, a fentry/fexit program could
attach and call bpf_map_update_elem again on the map, leading to the
same lock being acquired if the key matches and lead to a deadlock.
While this requires some special effort on part of the BPF programmer to
trigger and is highly unlikely to occur in practice, it is always better
if we can avoid such a condition.
While notrace would prevent this, doing the draining outside the lock
has advantages of its own, hence it is used to also fix the deadlock
related problem.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221114191547.1694267-5-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
||
|
|
2d57725257 |
bpf: Remove BPF_MAP_OFF_ARR_MAX
In
|
||
|
|
91dabf33ae |
sched: Fix race in task_call_func()
There is a very narrow race between schedule() and task_call_func().
CPU0 CPU1
__schedule()
rq_lock();
prev_state = READ_ONCE(prev->__state);
if (... && prev_state) {
deactivate_tasl(rq, prev, ...)
prev->on_rq = 0;
task_call_func()
raw_spin_lock_irqsave(p->pi_lock);
state = READ_ONCE(p->__state);
smp_rmb();
if (... || p->on_rq) // false!!!
rq = __task_rq_lock()
ret = func();
next = pick_next_task();
rq = context_switch(prev, next)
prepare_lock_switch()
spin_release(&__rq_lockp(rq)->dep_map...)
So while the task is on it's way out, it still holds rq->lock for a
little while, and right then task_call_func() comes in and figures it
doesn't need rq->lock anymore (because the task is already dequeued --
but still running there) and then the __set_task_frozen() thing observes
it's holding rq->lock and yells murder.
Avoid this by waiting for p->on_cpu to get cleared, which guarantees
the task is fully finished on the old CPU.
( While arguably the fixes tag is 'wrong' -- none of the previous
task_call_func() users appears to care for this case. )
Fixes:
|
||
|
|
448dca8c88 |
rseq: Use pr_warn_once() when deprecated/unknown ABI flags are encountered
These commits use WARN_ON_ONCE() and kill the offending processes when deprecated and unknown flags are encountered: commit |
||
|
|
f4c4ca70de |
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Andrii Nakryiko says: ==================== bpf-next 2022-11-11 We've added 49 non-merge commits during the last 9 day(s) which contain a total of 68 files changed, 3592 insertions(+), 1371 deletions(-). The main changes are: 1) Veristat tool improvements to support custom filtering, sorting, and replay of results, from Andrii Nakryiko. 2) BPF verifier precision tracking fixes and improvements, from Andrii Nakryiko. 3) Lots of new BPF documentation for various BPF maps, from Dave Tucker, Donald Hunter, Maryam Tahhan, Bagas Sanjaya. 4) BTF dedup improvements and libbpf's hashmap interface clean ups, from Eduard Zingerman. 5) Fix veth driver panic if XDP program is attached before veth_open, from John Fastabend. 6) BPF verifier clean ups and fixes in preparation for follow up features, from Kumar Kartikeya Dwivedi. 7) Add access to hwtstamp field from BPF sockops programs, from Martin KaFai Lau. 8) Various fixes for BPF selftests and samples, from Artem Savkov, Domenico Cerasuolo, Kang Minchul, Rong Tao, Yang Jihong. 9) Fix redirection to tunneling device logic, preventing skb->len == 0, from Stanislav Fomichev. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (49 commits) selftests/bpf: fix veristat's singular file-or-prog filter selftests/bpf: Test skops->skb_hwtstamp selftests/bpf: Fix incorrect ASSERT in the tcp_hdr_options test bpf: Add hwtstamp field for the sockops prog selftests/bpf: Fix xdp_synproxy compilation failure in 32-bit arch bpf, docs: Document BPF_MAP_TYPE_ARRAY docs/bpf: Document BPF map types QUEUE and STACK docs/bpf: Document BPF ARRAY_OF_MAPS and HASH_OF_MAPS docs/bpf: Document BPF_MAP_TYPE_CPUMAP map docs/bpf: Document BPF_MAP_TYPE_LPM_TRIE map libbpf: Hashmap.h update to fix build issues using LLVM14 bpf: veth driver panics when xdp prog attached before veth_open selftests: Fix test group SKIPPED result selftests/bpf: Tests for btf_dedup_resolve_fwds libbpf: Resolve unambigous forward declarations libbpf: Hashmap interface update to allow both long and void* keys/values samples/bpf: Fix sockex3 error: Missing BPF prog type selftests/bpf: Fix u32 variable compared with less than zero Documentation: bpf: Escape underscore in BPF type name prefix selftests/bpf: Use consistent build-id type for liburandom_read.so ... ==================== Link: https://lore.kernel.org/r/20221111233733.1088228-1-andrii@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
|
|
c1754bf019 |
Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Andrii Nakryiko says: ==================== bpf 2022-11-11 We've added 11 non-merge commits during the last 8 day(s) which contain a total of 11 files changed, 83 insertions(+), 74 deletions(-). The main changes are: 1) Fix strncpy_from_kernel_nofault() to prevent out-of-bounds writes, from Alban Crequy. 2) Fix for bpf_prog_test_run_skb() to prevent wrong alignment, from Baisong Zhong. 3) Switch BPF_DISPATCHER to static_call() instead of ftrace infra, with a small build fix on top, from Peter Zijlstra and Nathan Chancellor. 4) Fix memory leak in BPF verifier in some error cases, from Wang Yufen. 5) 32-bit compilation error fixes for BPF selftests, from Pu Lehui and Yang Jihong. 6) Ensure even distribution of per-CPU free list elements, from Xu Kuohai. 7) Fix copy_map_value() to track special zeroed out areas properly, from Xu Kuohai. * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: bpf: Fix offset calculation error in __copy_map_value and zero_map_value bpf: Initialize same number of free nodes for each pcpu_freelist selftests: bpf: Add a test when bpf_probe_read_kernel_str() returns EFAULT maccess: Fix writing offset in case of fault in strncpy_from_kernel_nofault() selftests/bpf: Fix test_progs compilation failure in 32-bit arch selftests/bpf: Fix casting error when cross-compiling test_verifier for 32-bit platforms bpf: Fix memory leaks in __check_func_call bpf: Add explicit cast to 'void *' for __BPF_DISPATCHER_UPDATE() bpf: Convert BPF_DISPATCHER to use static_call() (not ftrace) bpf: Revert ("Fix dispatcher patchable function entry to 5 bytes nop") bpf, test_run: Fix alignment problem in bpf_prog_test_run_skb() ==================== Link: https://lore.kernel.org/r/20221111231624.938829-1-andrii@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
|
|
4b45cd81f7 |
bpf: Initialize same number of free nodes for each pcpu_freelist
pcpu_freelist_populate() initializes nr_elems / num_possible_cpus() + 1
free nodes for some CPUs, and then possibly one CPU with fewer nodes,
followed by remaining cpus with 0 nodes. For example, when nr_elems == 256
and num_possible_cpus() == 32, CPU 0~27 each gets 9 free nodes, CPU 28 gets
4 free nodes, CPU 29~31 get 0 free nodes, while in fact each CPU should get
8 nodes equally.
This patch initializes nr_elems / num_possible_cpus() free nodes for each
CPU firstly, then allocates the remaining free nodes by one for each CPU
until no free nodes left.
Fixes:
|
||
|
|
161939abc8 |
docs/bpf: Document BPF_MAP_TYPE_CPUMAP map
Add documentation for BPF_MAP_TYPE_CPUMAP including kernel version introduced, usage and examples. Co-developed-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20221107165207.2682075-2-mtahhan@redhat.com |
||
|
|
966a9b4903 |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
drivers/net/can/pch_can.c |
||
|
|
4bbf3422df |
Merge tag 'net-6.1-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from netfilter, wifi, can and bpf.
Current release - new code bugs:
- can: af_can: can_exit(): add missing dev_remove_pack() of
canxl_packet
Previous releases - regressions:
- bpf, sockmap: fix the sk->sk_forward_alloc warning
- wifi: mac80211: fix general-protection-fault in
ieee80211_subif_start_xmit()
- can: af_can: fix NULL pointer dereference in can_rx_register()
- can: dev: fix skb drop check, avoid o-o-b access
- nfnetlink: fix potential dead lock in nfnetlink_rcv_msg()
Previous releases - always broken:
- bpf: fix wrong reg type conversion in release_reference()
- gso: fix panic on frag_list with mixed head alloc types
- wifi: brcmfmac: fix buffer overflow in brcmf_fweh_event_worker()
- wifi: mac80211: set TWT Information Frame Disabled bit as 1
- eth: macsec offload related fixes, make sure to clear the keys from
memory
- tun: fix memory leaks in the use of napi_get_frags
- tun: call napi_schedule_prep() to ensure we own a napi
- tcp: prohibit TCP_REPAIR_OPTIONS if data was already sent
- ipv6: addrlabel: fix infoleak when sending struct ifaddrlblmsg to
network
- tipc: fix a msg->req tlv length check
- sctp: clear out_curr if all frag chunks of current msg are pruned,
avoid list corruption
- mctp: fix an error handling path in mctp_init(), avoid leaks"
* tag 'net-6.1-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (101 commits)
eth: sp7021: drop free_netdev() from spl2sw_init_netdev()
MAINTAINERS: Move Vivien to CREDITS
net: macvlan: fix memory leaks of macvlan_common_newlink
ethernet: tundra: free irq when alloc ring failed in tsi108_open()
net: mv643xx_eth: disable napi when init rxq or txq failed in mv643xx_eth_open()
ethernet: s2io: disable napi when start nic failed in s2io_card_up()
net: atlantic: macsec: clear encryption keys from the stack
net: phy: mscc: macsec: clear encryption keys when freeing a flow
stmmac: dwmac-loongson: fix missing of_node_put() while module exiting
stmmac: dwmac-loongson: fix missing pci_disable_device() in loongson_dwmac_probe()
stmmac: dwmac-loongson: fix missing pci_disable_msi() while module exiting
cxgb4vf: shut down the adapter when t4vf_update_port_info() failed in cxgb4vf_open()
mctp: Fix an error handling path in mctp_init()
stmmac: intel: Update PCH PTP clock rate from 200MHz to 204.8MHz
net: cxgb3_main: disable napi when bind qsets failed in cxgb_up()
net: cpsw: disable napi in cpsw_ndo_open()
iavf: Fix VF driver counting VLAN 0 filters
ice: Fix spurious interrupt during removal of trusted VF
net/mlx5e: TC, Fix slab-out-of-bounds in parse_tc_actions
net/mlx5e: E-Switch, Fix comparing termination table instance
...
|
||
|
|
eb86559a69 |
bpf: Fix memory leaks in __check_func_call
kmemleak reports this issue:
unreferenced object 0xffff88817139d000 (size 2048):
comm "test_progs", pid 33246, jiffies 4307381979 (age 45851.820s)
hex dump (first 32 bytes):
01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace:
[<0000000045f075f0>] kmalloc_trace+0x27/0xa0
[<0000000098b7c90a>] __check_func_call+0x316/0x1230
[<00000000b4c3c403>] check_helper_call+0x172e/0x4700
[<00000000aa3875b7>] do_check+0x21d8/0x45e0
[<000000001147357b>] do_check_common+0x767/0xaf0
[<00000000b5a595b4>] bpf_check+0x43e3/0x5bc0
[<0000000011e391b1>] bpf_prog_load+0xf26/0x1940
[<0000000007f765c0>] __sys_bpf+0xd2c/0x3650
[<00000000839815d6>] __x64_sys_bpf+0x75/0xc0
[<00000000946ee250>] do_syscall_64+0x3b/0x90
[<0000000000506b7f>] entry_SYSCALL_64_after_hwframe+0x63/0xcd
The root case here is: In function prepare_func_exit(), the callee is
not released in the abnormal scenario after "state->curframe--;". To
fix, move "state->curframe--;" to the very bottom of the function,
right when we free callee and reset frame[] pointer to NULL, as Andrii
suggested.
In addition, function __check_func_call() has a similar problem. In
the abnormal scenario before "state->curframe++;", the callee also
should be released by free_func_state().
Fixes:
|
||
|
|
bb88f96954 |
perf: Improve missing SIGTRAP checking
To catch missing SIGTRAP we employ a WARN in __perf_event_overflow(),
which fires if pending_sigtrap was already set: returning to user space
without consuming pending_sigtrap, and then having the event fire again
would re-enter the kernel and trigger the WARN.
This, however, seemed to miss the case where some events not associated
with progress in the user space task can fire and the interrupt handler
runs before the IRQ work meant to consume pending_sigtrap (and generate
the SIGTRAP).
syzbot gifted us this stack trace:
| WARNING: CPU: 0 PID: 3607 at kernel/events/core.c:9313 __perf_event_overflow
| Modules linked in:
| CPU: 0 PID: 3607 Comm: syz-executor100 Not tainted 6.1.0-rc2-syzkaller-00073-g88619e77b33d #0
| Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/11/2022
| RIP: 0010:__perf_event_overflow+0x498/0x540 kernel/events/core.c:9313
| <...>
| Call Trace:
| <TASK>
| perf_swevent_hrtimer+0x34f/0x3c0 kernel/events/core.c:10729
| __run_hrtimer kernel/time/hrtimer.c:1685 [inline]
| __hrtimer_run_queues+0x1c6/0xfb0 kernel/time/hrtimer.c:1749
| hrtimer_interrupt+0x31c/0x790 kernel/time/hrtimer.c:1811
| local_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1096 [inline]
| __sysvec_apic_timer_interrupt+0x17c/0x640 arch/x86/kernel/apic/apic.c:1113
| sysvec_apic_timer_interrupt+0x40/0xc0 arch/x86/kernel/apic/apic.c:1107
| asm_sysvec_apic_timer_interrupt+0x16/0x20 arch/x86/include/asm/idtentry.h:649
| <...>
| </TASK>
In this case, syzbot produced a program with event type
PERF_TYPE_SOFTWARE and config PERF_COUNT_SW_CPU_CLOCK. The hrtimer
manages to fire again before the IRQ work got a chance to run, all while
never having returned to user space.
Improve the WARN to check for real progress in user space: approximate
this by storing a 32-bit hash of the current IP into pending_sigtrap,
and if an event fires while pending_sigtrap still matches the previous
IP, we assume no progress (false negatives are possible given we could
return to user space and trigger again on the same IP).
Fixes:
|
||
|
|
a679120edf |
bpf: Add explicit cast to 'void *' for __BPF_DISPATCHER_UPDATE()
When building with clang:
kernel/bpf/dispatcher.c:126:33: error: pointer type mismatch ('void *' and 'unsigned int (*)(const void *, const struct bpf_insn *, bpf_func_t)' (aka 'unsigned int (*)(const void *, const struct bpf_insn *, unsigned int (*)(const void *, const struct bpf_insn *))')) [-Werror,-Wpointer-type-mismatch]
__BPF_DISPATCHER_UPDATE(d, new ?: &bpf_dispatcher_nop_func);
~~~ ^ ~~~~~~~~~~~~~~~~~~~~~~~~
./include/linux/bpf.h:1045:54: note: expanded from macro '__BPF_DISPATCHER_UPDATE'
__static_call_update((_d)->sc_key, (_d)->sc_tramp, (_new))
^~~~
1 error generated.
The warning is pointing out that the type of new ('void *') and
&bpf_dispatcher_nop_func are not compatible, which could have side
effects coming out of a conditional operator due to promotion rules.
Add the explicit cast to 'void *' to make it clear that this is
expected, as __BPF_DISPATCHER_UPDATE() expands to a call to
__static_call_update(), which expects a 'void *' as its final argument.
Fixes:
|
||
|
|
727ea09e99 |
Merge tag 'perf_urgent_for_v6.1_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Borislav Petkov: - Add Cooper Lake's stepping to the PEBS guest/host events isolation fixed microcode revisions checking quirk - Update Icelake and Sapphire Rapids events constraints - Use the standard energy unit for Sapphire Rapids in RAPL - Fix the hw_breakpoint test to fail more graciously on !SMP configs * tag 'perf_urgent_for_v6.1_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86/intel: Add Cooper Lake stepping to isolation_ucodes[] perf/x86/intel: Fix pebs event constraints for SPR perf/x86/intel: Fix pebs event constraints for ICL perf/x86/rapl: Use standard Energy Unit for SPR Dram RAPL domain perf/hw_breakpoint: test: Skip the test if dependencies unmet |
||
|
|
c86df29d11 |
bpf: Convert BPF_DISPATCHER to use static_call() (not ftrace)
The dispatcher function is currently abusing the ftrace __fentry__ call location for its own purposes -- this obviously gives trouble when the dispatcher and ftrace are both in use. A previous solution tried using __attribute__((patchable_function_entry())) which works, except it is GCC-8+ only, breaking the build on the earlier still supported compilers. Instead use static_call() -- which has its own annotations and does not conflict with ftrace -- to rewrite the dispatch function. By using: return static_call()(ctx, insni, bpf_func) you get a perfect forwarding tail call as function body (iow a single jmp instruction). By having the default static_call() target be bpf_dispatcher_nop_func() it retains the default behaviour (an indirect call to the argument function). Only once a dispatcher program is attached is the target rewritten to directly call the JIT'ed image. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: Björn Töpel <bjorn@kernel.org> Tested-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Björn Töpel <bjorn@kernel.org> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lkml.kernel.org/r/Y1/oBlK0yFk5c/Im@hirez.programming.kicks-ass.net Link: https://lore.kernel.org/bpf/20221103120647.796772565@infradead.org |
||
|
|
18acb7fac2 |
bpf: Revert ("Fix dispatcher patchable function entry to 5 bytes nop")
Because __attribute__((patchable_function_entry)) is only available since GCC-8 this solution fails to build on the minimum required GCC version. Undo these changes so we might try again -- without cluttering up the patches with too many changes. This is an almost complete revert of: |
||
|
|
7a830b53c1 |
bpf: aggressively forget precise markings during state checkpointing
Exploit the property of about-to-be-checkpointed state to be able to forget all precise markings up to that point even more aggressively. We now clear all potentially inherited precise markings right before checkpointing and branching off into child state. If any of children states require precise knowledge of any SCALAR register, those will be propagated backwards later on before this state is finalized, preserving correctness. There is a single selftests BPF program change, but tremendous one: 25x reduction in number of verified instructions and states in trace_virtqueue_add_sgs. Cilium results are more modest, but happen across wider range of programs. SELFTESTS RESULTS ================= $ ./veristat -C -e file,prog,insns,states ~/imprecise-early-results.csv ~/imprecise-aggressive-results.csv | grep -v '+0' File Program Total insns (A) Total insns (B) Total insns (DIFF) Total states (A) Total states (B) Total states (DIFF) ------------------- ----------------------- --------------- --------------- ------------------ ---------------- ---------------- ------------------- loop6.bpf.linked1.o trace_virtqueue_add_sgs 398057 15114 -382943 (-96.20%) 8717 336 -8381 (-96.15%) ------------------- ----------------------- --------------- --------------- ------------------ ---------------- ---------------- ------------------- CILIUM RESULTS ============== $ ./veristat -C -e file,prog,insns,states ~/imprecise-early-results-cilium.csv ~/imprecise-aggressive-results-cilium.csv | grep -v '+0' File Program Total insns (A) Total insns (B) Total insns (DIFF) Total states (A) Total states (B) Total states (DIFF) ------------- -------------------------------- --------------- --------------- ------------------ ---------------- ---------------- ------------------- bpf_host.o tail_handle_nat_fwd_ipv4 23426 23221 -205 (-0.88%) 1537 1515 -22 (-1.43%) bpf_host.o tail_handle_nat_fwd_ipv6 13009 12904 -105 (-0.81%) 719 708 -11 (-1.53%) bpf_host.o tail_nodeport_nat_ingress_ipv6 5261 5196 -65 (-1.24%) 247 243 -4 (-1.62%) bpf_host.o tail_nodeport_nat_ipv6_egress 3446 3406 -40 (-1.16%) 203 198 -5 (-2.46%) bpf_lxc.o tail_handle_nat_fwd_ipv4 23426 23221 -205 (-0.88%) 1537 1515 -22 (-1.43%) bpf_lxc.o tail_handle_nat_fwd_ipv6 13009 12904 -105 (-0.81%) 719 708 -11 (-1.53%) bpf_lxc.o tail_ipv4_ct_egress 5074 4897 -177 (-3.49%) 255 248 -7 (-2.75%) bpf_lxc.o tail_ipv4_ct_ingress 5100 4923 -177 (-3.47%) 255 248 -7 (-2.75%) bpf_lxc.o tail_ipv4_ct_ingress_policy_only 5100 4923 -177 (-3.47%) 255 248 -7 (-2.75%) bpf_lxc.o tail_ipv6_ct_egress 4558 4536 -22 (-0.48%) 188 187 -1 (-0.53%) bpf_lxc.o tail_ipv6_ct_ingress 4578 4556 -22 (-0.48%) 188 187 -1 (-0.53%) bpf_lxc.o tail_ipv6_ct_ingress_policy_only 4578 4556 -22 (-0.48%) 188 187 -1 (-0.53%) bpf_lxc.o tail_nodeport_nat_ingress_ipv6 5261 5196 -65 (-1.24%) 247 243 -4 (-1.62%) bpf_overlay.o tail_nodeport_nat_ingress_ipv6 5261 5196 -65 (-1.24%) 247 243 -4 (-1.62%) bpf_overlay.o tail_nodeport_nat_ipv6_egress 3482 3442 -40 (-1.15%) 204 201 -3 (-1.47%) bpf_xdp.o tail_nodeport_nat_egress_ipv4 17200 15619 -1581 (-9.19%) 1111 1010 -101 (-9.09%) ------------- -------------------------------- --------------- --------------- ------------------ ---------------- ---------------- ------------------- Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20221104163649.121784-6-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
|
|
f63181b6ae |
bpf: stop setting precise in current state
Setting reg->precise to true in current state is not necessary from correctness standpoint, but it does pessimise the whole precision (or rather "imprecision", because that's what we want to keep as much as possible) tracking. Why is somewhat subtle and my best attempt to explain this is recorded in an extensive comment for __mark_chain_precise() function. Some more careful thinking and code reading is probably required still to grok this completely, unfortunately. Whiteboarding and a bunch of extra handwaiving in person would be even more helpful, but is deemed impractical in Git commit. Next patch pushes this imprecision property even further, building on top of the insights described in this patch. End results are pretty nice, we get reduction in number of total instructions and states verified due to a better states reuse, as some of the states are now more generic and permissive due to less unnecessary precise=true requirements. SELFTESTS RESULTS ================= $ ./veristat -C -e file,prog,insns,states ~/subprog-precise-results.csv ~/imprecise-early-results.csv | grep -v '+0' File Program Total insns (A) Total insns (B) Total insns (DIFF) Total states (A) Total states (B) Total states (DIFF) --------------------------------------- ---------------------- --------------- --------------- ------------------ ---------------- ---------------- ------------------- bpf_iter_ksym.bpf.linked1.o dump_ksym 347 285 -62 (-17.87%) 20 19 -1 (-5.00%) pyperf600_bpf_loop.bpf.linked1.o on_event 3678 3736 +58 (+1.58%) 276 285 +9 (+3.26%) setget_sockopt.bpf.linked1.o skops_sockopt 4038 3947 -91 (-2.25%) 347 343 -4 (-1.15%) test_l4lb.bpf.linked1.o balancer_ingress 4559 2611 -1948 (-42.73%) 118 105 -13 (-11.02%) test_l4lb_noinline.bpf.linked1.o balancer_ingress 6279 6268 -11 (-0.18%) 237 236 -1 (-0.42%) test_misc_tcp_hdr_options.bpf.linked1.o misc_estab 1307 1303 -4 (-0.31%) 100 99 -1 (-1.00%) test_sk_lookup.bpf.linked1.o ctx_narrow_access 456 447 -9 (-1.97%) 39 38 -1 (-2.56%) test_sysctl_loop1.bpf.linked1.o sysctl_tcp_mem 1389 1384 -5 (-0.36%) 26 25 -1 (-3.85%) test_tc_dtime.bpf.linked1.o egress_fwdns_prio101 518 485 -33 (-6.37%) 51 46 -5 (-9.80%) test_tc_dtime.bpf.linked1.o egress_host 519 468 -51 (-9.83%) 50 44 -6 (-12.00%) test_tc_dtime.bpf.linked1.o ingress_fwdns_prio101 842 1000 +158 (+18.76%) 73 88 +15 (+20.55%) xdp_synproxy_kern.bpf.linked1.o syncookie_tc 405757 373173 -32584 (-8.03%) 25735 22882 -2853 (-11.09%) xdp_synproxy_kern.bpf.linked1.o syncookie_xdp 479055 371590 -107465 (-22.43%) 29145 22207 -6938 (-23.81%) --------------------------------------- ---------------------- --------------- --------------- ------------------ ---------------- ---------------- ------------------- Slight regression in test_tc_dtime.bpf.linked1.o/ingress_fwdns_prio101 is left for a follow up, there might be some more precision-related bugs in existing BPF verifier logic. CILIUM RESULTS ============== $ ./veristat -C -e file,prog,insns,states ~/subprog-precise-results-cilium.csv ~/imprecise-early-results-cilium.csv | grep -v '+0' File Program Total insns (A) Total insns (B) Total insns (DIFF) Total states (A) Total states (B) Total states (DIFF) ------------- ------------------------------ --------------- --------------- ------------------ ---------------- ---------------- ------------------- bpf_host.o cil_from_host 762 556 -206 (-27.03%) 43 37 -6 (-13.95%) bpf_host.o tail_handle_nat_fwd_ipv4 23541 23426 -115 (-0.49%) 1538 1537 -1 (-0.07%) bpf_host.o tail_nodeport_nat_egress_ipv4 33592 33566 -26 (-0.08%) 2163 2161 -2 (-0.09%) bpf_lxc.o tail_handle_nat_fwd_ipv4 23541 23426 -115 (-0.49%) 1538 1537 -1 (-0.07%) bpf_overlay.o tail_nodeport_nat_egress_ipv4 33581 33543 -38 (-0.11%) 2160 2157 -3 (-0.14%) bpf_xdp.o tail_handle_nat_fwd_ipv4 21659 20920 -739 (-3.41%) 1440 1376 -64 (-4.44%) bpf_xdp.o tail_handle_nat_fwd_ipv6 17084 17039 -45 (-0.26%) 907 905 -2 (-0.22%) bpf_xdp.o tail_lb_ipv4 73442 73430 -12 (-0.02%) 4370 4369 -1 (-0.02%) bpf_xdp.o tail_lb_ipv6 152114 151895 -219 (-0.14%) 6493 6479 -14 (-0.22%) bpf_xdp.o tail_nodeport_nat_egress_ipv4 17377 17200 -177 (-1.02%) 1125 1111 -14 (-1.24%) bpf_xdp.o tail_nodeport_nat_ingress_ipv6 6405 6397 -8 (-0.12%) 309 308 -1 (-0.32%) bpf_xdp.o tail_rev_nodeport_lb4 7126 6934 -192 (-2.69%) 414 402 -12 (-2.90%) bpf_xdp.o tail_rev_nodeport_lb6 18059 17905 -154 (-0.85%) 1105 1096 -9 (-0.81%) ------------- ------------------------------ --------------- --------------- ------------------ ---------------- ---------------- ------------------- Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20221104163649.121784-5-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
|
|
be2ef81615 |
bpf: allow precision tracking for programs with subprogs
Stop forcing precise=true for SCALAR registers when BPF program has any subprograms. Current restriction means that any BPF program, as soon as it uses subprograms, will end up not getting any of the precision tracking benefits in reduction of number of verified states. This patch keeps the fallback mark_all_scalars_precise() behavior if precise marking has to cross function frames. E.g., if subprogram requires R1 (first input arg) to be marked precise, ideally we'd need to backtrack to the parent function and keep marking R1 and its dependencies as precise. But right now we give up and force all the SCALARs in any of the current and parent states to be forced to precise=true. We can lift that restriction in the future. But this patch fixes two issues identified when trying to enable precision tracking for subprogs. First, prevent "escaping" from top-most state in a global subprog. While with entry-level BPF program we never end up requesting precision for R1-R5 registers, because R2-R5 are not initialized (and so not readable in correct BPF program), and R1 is PTR_TO_CTX, not SCALAR, and so is implicitly precise. With global subprogs, though, it's different, as global subprog a) can have up to 5 SCALAR input arguments, which might get marked as precise=true and b) it is validated in isolation from its main entry BPF program. b) means that we can end up exhausting parent state chain and still not mark all registers in reg_mask as precise, which would lead to verifier bug warning. To handle that, we need to consider two cases. First, if the very first state is not immediately "checkpointed" (i.e., stored in state lookup hashtable), it will get correct first_insn_idx and last_insn_idx instruction set during state checkpointing. As such, this case is already handled and __mark_chain_precision() already handles that by just doing nothing when we reach to the very first parent state. st->parent will be NULL and we'll just stop. Perhaps some extra check for reg_mask and stack_mask is due here, but this patch doesn't address that issue. More problematic second case is when global function's initial state is immediately checkpointed before we manage to process the very first instruction. This is happening because when there is a call to global subprog from the main program the very first subprog's instruction is marked as pruning point, so before we manage to process first instruction we have to check and checkpoint state. This patch adds a special handling for such "empty" state, which is identified by having st->last_insn_idx set to -1. In such case, we check that we are indeed validating global subprog, and with some sanity checking we mark input args as precise if requested. Note that we also initialize state->first_insn_idx with correct start insn_idx offset. For main program zero is correct value, but for any subprog it's quite confusing to not have first_insn_idx set. This doesn't have any functional impact, but helps with debugging and state printing. We also explicitly initialize state->last_insns_idx instead of relying on is_state_visited() to do this with env->prev_insns_idx, which will be -1 on the very first instruction. This concludes necessary changes to handle specifically global subprog's precision tracking. Second identified problem was missed handling of BPF helper functions that call into subprogs (e.g., bpf_loop and few others). From precision tracking and backtracking logic's standpoint those are effectively calls into subprogs and should be called as BPF_PSEUDO_CALL calls. This patch takes the least intrusive way and just checks against a short list of current BPF helpers that do call subprogs, encapsulated in is_callback_calling_function() function. But to prevent accidentally forgetting to add new BPF helpers to this "list", we also do a sanity check in __check_func_call, which has to be called for each such special BPF helper, to validate that BPF helper is indeed recognized as callback-calling one. This should catch any missed checks in the future. Adding some special flags to be added in function proto definitions seemed like an overkill in this case. With the above changes, it's possible to remove forceful setting of reg->precise to true in __mark_reg_unknown, which turns on precision tracking both inside subprogs and entry progs that have subprogs. No warnings or errors were detected across all the selftests, but also when validating with veristat against internal Meta BPF objects and Cilium objects. Further, in some BPF programs there are noticeable reduction in number of states and instructions validated due to more effective precision tracking, especially benefiting syncookie test. $ ./veristat -C -e file,prog,insns,states ~/baseline-results.csv ~/subprog-precise-results.csv | grep -v '+0' File Program Total insns (A) Total insns (B) Total insns (DIFF) Total states (A) Total states (B) Total states (DIFF) ---------------------------------------- -------------------------- --------------- --------------- ------------------ ---------------- ---------------- ------------------- pyperf600_bpf_loop.bpf.linked1.o on_event 3966 3678 -288 (-7.26%) 306 276 -30 (-9.80%) pyperf_global.bpf.linked1.o on_event 7563 7530 -33 (-0.44%) 520 517 -3 (-0.58%) pyperf_subprogs.bpf.linked1.o on_event 36358 36934 +576 (+1.58%) 2499 2531 +32 (+1.28%) setget_sockopt.bpf.linked1.o skops_sockopt 3965 4038 +73 (+1.84%) 343 347 +4 (+1.17%) test_cls_redirect_subprogs.bpf.linked1.o cls_redirect 64965 64901 -64 (-0.10%) 4619 4612 -7 (-0.15%) test_misc_tcp_hdr_options.bpf.linked1.o misc_estab 1491 1307 -184 (-12.34%) 110 100 -10 (-9.09%) test_pkt_access.bpf.linked1.o test_pkt_access 354 349 -5 (-1.41%) 25 24 -1 (-4.00%) test_sock_fields.bpf.linked1.o egress_read_sock_fields 435 375 -60 (-13.79%) 22 20 -2 (-9.09%) test_sysctl_loop2.bpf.linked1.o sysctl_tcp_mem 1508 1501 -7 (-0.46%) 29 28 -1 (-3.45%) test_tc_dtime.bpf.linked1.o egress_fwdns_prio100 468 435 -33 (-7.05%) 45 41 -4 (-8.89%) test_tc_dtime.bpf.linked1.o ingress_fwdns_prio100 398 408 +10 (+2.51%) 42 39 -3 (-7.14%) test_tc_dtime.bpf.linked1.o ingress_fwdns_prio101 1096 842 -254 (-23.18%) 97 73 -24 (-24.74%) test_tcp_hdr_options.bpf.linked1.o estab 2758 2408 -350 (-12.69%) 208 181 -27 (-12.98%) test_urandom_usdt.bpf.linked1.o urand_read_with_sema 466 448 -18 (-3.86%) 31 28 -3 (-9.68%) test_urandom_usdt.bpf.linked1.o urand_read_without_sema 466 448 -18 (-3.86%) 31 28 -3 (-9.68%) test_urandom_usdt.bpf.linked1.o urandlib_read_with_sema 466 448 -18 (-3.86%) 31 28 -3 (-9.68%) test_urandom_usdt.bpf.linked1.o urandlib_read_without_sema 466 448 -18 (-3.86%) 31 28 -3 (-9.68%) test_xdp_noinline.bpf.linked1.o balancer_ingress_v6 4302 4294 -8 (-0.19%) 257 256 -1 (-0.39%) xdp_synproxy_kern.bpf.linked1.o syncookie_tc 583722 405757 -177965 (-30.49%) 35846 25735 -10111 (-28.21%) xdp_synproxy_kern.bpf.linked1.o syncookie_xdp 609123 479055 -130068 (-21.35%) 35452 29145 -6307 (-17.79%) ---------------------------------------- -------------------------- --------------- --------------- ------------------ ---------------- ---------------- ------------------- Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20221104163649.121784-4-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
|
|
529409ea92 |
bpf: propagate precision across all frames, not just the last one
When equivalent completed state is found and it has additional precision
restrictions, BPF verifier propagates precision to
currently-being-verified state chain (i.e., including parent states) so
that if some of the states in the chain are not yet completed, necessary
precision restrictions are enforced.
Unfortunately, right now this happens only for the last frame (deepest
active subprogram's frame), not all the frames. This can lead to
incorrect matching of states due to missing precision marker. Currently
this doesn't seem possible as BPF verifier forces everything to precise
when validated BPF program has any subprograms. But with the next patch
lifting this restriction, this becomes problematic.
In fact, without this fix, we'll start getting failure in one of the
existing test_verifier test cases:
#906/p precise: cross frame pruning FAIL
Unexpected success to load!
verification time 48 usec
stack depth 0+0
processed 26 insns (limit 1000000) max_states_per_insn 3 total_states 17 peak_states 17 mark_read 8
This patch adds precision propagation across all frames.
Fixes:
|
||
|
|
a3b666bfa9 |
bpf: propagate precision in ALU/ALU64 operations
When processing ALU/ALU64 operations (apart from BPF_MOV, which is
handled correctly already; and BPF_NEG and BPF_END are special and don't
have source register), if destination register is already marked
precise, this causes problem with potentially missing precision tracking
for the source register. E.g., when we have r1 >>= r5 and r1 is marked
precise, but r5 isn't, this will lead to r5 staying as imprecise. This
is due to the precision backtracking logic stopping early when it sees
r1 is already marked precise. If r1 wasn't precise, we'd keep
backtracking and would add r5 to the set of registers that need to be
marked precise. So there is a discrepancy here which can lead to invalid
and incompatible states matched due to lack of precision marking on r5.
If r1 wasn't precise, precision backtracking would correctly mark both
r1 and r5 as precise.
This is simple to fix, though. During the forward instruction simulation
pass, for arithmetic operations of `scalar <op>= scalar` form (where
<op> is ALU or ALU64 operations), if destination register is already
precise, mark source register as precise. This applies only when both
involved registers are SCALARs. `ptr += scalar` and `scalar += ptr`
cases are already handled correctly.
This does have (negative) effect on some selftest programs and few
Cilium programs. ~/baseline-tmp-results.csv are veristat results with
this patch, while ~/baseline-results.csv is without it. See post
scriptum for instructions on how to make Cilium programs testable with
veristat. Correctness has a price.
$ ./veristat -C -e file,prog,insns,states ~/baseline-results.csv ~/baseline-tmp-results.csv | grep -v '+0'
File Program Total insns (A) Total insns (B) Total insns (DIFF) Total states (A) Total states (B) Total states (DIFF)
----------------------- -------------------- --------------- --------------- ------------------ ---------------- ---------------- -------------------
bpf_cubic.bpf.linked1.o bpf_cubic_cong_avoid 997 1700 +703 (+70.51%) 62 90 +28 (+45.16%)
test_l4lb.bpf.linked1.o balancer_ingress 4559 5469 +910 (+19.96%) 118 126 +8 (+6.78%)
----------------------- -------------------- --------------- --------------- ------------------ ---------------- ---------------- -------------------
$ ./veristat -C -e file,prog,verdict,insns,states ~/baseline-results-cilium.csv ~/baseline-tmp-results-cilium.csv | grep -v '+0'
File Program Total insns (A) Total insns (B) Total insns (DIFF) Total states (A) Total states (B) Total states (DIFF)
------------- ------------------------------ --------------- --------------- ------------------ ---------------- ---------------- -------------------
bpf_host.o tail_nodeport_nat_ingress_ipv6 4448 5261 +813 (+18.28%) 234 247 +13 (+5.56%)
bpf_host.o tail_nodeport_nat_ipv6_egress 3396 3446 +50 (+1.47%) 201 203 +2 (+1.00%)
bpf_lxc.o tail_nodeport_nat_ingress_ipv6 4448 5261 +813 (+18.28%) 234 247 +13 (+5.56%)
bpf_overlay.o tail_nodeport_nat_ingress_ipv6 4448 5261 +813 (+18.28%) 234 247 +13 (+5.56%)
bpf_xdp.o tail_lb_ipv4 71736 73442 +1706 (+2.38%) 4295 4370 +75 (+1.75%)
------------- ------------------------------ --------------- --------------- ------------------ ---------------- ---------------- -------------------
P.S. To make Cilium ([0]) programs libbpf-compatible and thus
veristat-loadable, apply changes from topmost commit in [1], which does
minimal changes to Cilium source code, mostly around SEC() annotations
and BPF map definitions.
[0] https://github.com/cilium/cilium/
[1] https://github.com/anakryiko/cilium/commits/libbpf-friendliness
Fixes:
|
||
|
|
f71b2f6417 |
bpf: Refactor map->off_arr handling
Refactor map->off_arr handling into generic functions that can work on their own without hardcoding map specific code. The btf_fields_offs structure is now returned from btf_parse_field_offs, which can be reused later for types in program BTF. All functions like copy_map_value, zero_map_value call generic underlying functions so that they can also be reused later for copying to values allocated in programs which encode specific fields. Later, some helper functions will also require access to this btf_field_offs structure to be able to skip over special fields at runtime. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20221103191013.1236066-9-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
|
|
db55911782 |
bpf: Consolidate spin_lock, timer management into btf_record
Now that kptr_off_tab has been refactored into btf_record, and can hold more than one specific field type, accomodate bpf_spin_lock and bpf_timer as well. While they don't require any more metadata than offset, having all special fields in one place allows us to share the same code for allocated user defined types and handle both map values and these allocated objects in a similar fashion. As an optimization, we still keep spin_lock_off and timer_off offsets in the btf_record structure, just to avoid having to find the btf_field struct each time their offset is needed. This is mostly needed to manipulate such objects in a map value at runtime. It's ok to hardcode just one offset as more than one field is disallowed. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20221103191013.1236066-8-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
|
|
aa3496accc |
bpf: Refactor kptr_off_tab into btf_record
To prepare the BPF verifier to handle special fields in both map values and program allocated types coming from program BTF, we need to refactor the kptr_off_tab handling code into something more generic and reusable across both cases to avoid code duplication. Later patches also require passing this data to helpers at runtime, so that they can work on user defined types, initialize them, destruct them, etc. The main observation is that both map values and such allocated types point to a type in program BTF, hence they can be handled similarly. We can prepare a field metadata table for both cases and store them in struct bpf_map or struct btf depending on the use case. Hence, refactor the code into generic btf_record and btf_field member structs. The btf_record represents the fields of a specific btf_type in user BTF. The cnt indicates the number of special fields we successfully recognized, and field_mask is a bitmask of fields that were found, to enable quick determination of availability of a certain field. Subsequently, refactor the rest of the code to work with these generic types, remove assumptions about kptr and kptr_off_tab, rename variables to more meaningful names, etc. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20221103191013.1236066-7-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
|
|
f2c24be55b |
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says: ==================== bpf 2022-11-04 We've added 8 non-merge commits during the last 3 day(s) which contain a total of 10 files changed, 113 insertions(+), 16 deletions(-). The main changes are: 1) Fix memory leak upon allocation failure in BPF verifier's stack state tracking, from Kees Cook. 2) Fix address leakage when BPF progs release reference to an object, from Youlin Li. 3) Fix BPF CI breakage from buggy in.h uapi header dependency, from Andrii Nakryiko. 4) Fix bpftool pin sub-command's argument parsing, from Pu Lehui. 5) Fix BPF sockmap lockdep warning by cancelling psock work outside of socket lock, from Cong Wang. 6) Follow-up for BPF sockmap to fix sk_forward_alloc accounting, from Wang Yufen. bpf-for-netdev * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: selftests/bpf: Add verifier test for release_reference() bpf: Fix wrong reg type conversion in release_reference() bpf, sock_map: Move cancel_work_sync() out of sock lock tools/headers: Pull in stddef.h to uapi to fix BPF selftests build in CI net/ipv4: Fix linux/in.h header dependencies bpftool: Fix NULL pointer dereference when pin {PROG, MAP, LINK} without FILE bpf, sockmap: Fix the sk->sk_forward_alloc warning of sk_stream_kill_queues bpf, verifier: Fix memory leak in array reallocation for stack state ==================== Link: https://lore.kernel.org/r/20221104000445.30761-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
|
|
a28ace782e |
bpf: Drop reg_type_may_be_refcounted_or_null
It is not scalable to maintain a list of types that can have non-zero ref_obj_id. It is never set for scalars anyway, so just remove the conditional on register types and print it whenever it is non-zero. Acked-by: Dave Marchevsky <davemarchevsky@fb.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Acked-by: David Vernet <void@manifault.com> Link: https://lore.kernel.org/r/20221103191013.1236066-6-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
|
|
f5e477a861 |
bpf: Fix slot type check in check_stack_write_var_off
For the case where allow_ptr_leaks is false, code is checking whether
slot type is STACK_INVALID and STACK_SPILL and rejecting other cases.
This is a consequence of incorrectly checking for register type instead
of the slot type (NOT_INIT and SCALAR_VALUE respectively). Fix the
check.
Fixes:
|