linux

mirror of https://github.com/torvalds/linux.git synced 2025-12-07 20:06:24 +00:00

Author	SHA1	Message	Date
Lorenzo Stoakes	651fdda840	relay: update relay to use mmap_prepare It is relatively trivial to update this code to use the f_op->mmap_prepare hook in favour of the deprecated f_op->mmap hook, so do so. Link: https://lkml.kernel.org/r/7c9e82cdddf8b573ea3edb8cdb697363e3ccb5d7.1760959442.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Chatre, Reinette <reinette.chatre@intel.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Dave Martin <dave.martin@arm.com> Cc: Dave Young <dyoung@redhat.com> Cc: David S. Miller <davem@davemloft.net> Cc: Dmitriy Vyukov <dvyukov@google.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guo Ren <guoren@kernel.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Hugh Dickins <hughd@google.com> Cc: James Morse <james.morse@arm.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kevin Tian <kevin.tian@intel.com> Cc: Konstantin Komarov <almaz.alexandrovich@paragon-software.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nicolas Pitre <nico@fluxnic.net> Cc: Oscar Salvador <osalvador@suse.de> Cc: Robin Murohy <robin.murphy@arm.com> Cc: Sumanth Korikkar <sumanthk@linux.ibm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: "Uladzislau Rezki (Sony)" <urezki@gmail.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-11-16 17:28:11 -08:00
Shakeel Butt	d929525c2e	memcg: net: track network throttling due to memcg memory pressure The kernel can throttle network sockets if the memory cgroup associated with the corresponding socket is under memory pressure. The throttling actions include clamping the transmit window, failing to expand receive or send buffers, aggressively prune out-of-order receive queue, FIN deferred to a retransmitted packet and more. Let's add memcg metric to track such throttling actions. At the moment memcg memory pressure is defined through vmpressure and in future it may be defined using PSI or we may add more flexible way for the users to define memory pressure, maybe through ebpf. However the potential throttling actions will remain the same, so this newly introduced metric will continue to track throttling actions irrespective of how memcg memory pressure is defined. Link: https://lkml.kernel.org/r/20251016161035.86161-1-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Daniel Sedlak <daniel.sedlak@cdn77.com> Cc: David S. Miller <davem@davemloft.net> Cc: Eric Dumazet <edumazet@google.com> Cc: Jakub Kacinski <kuba@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Neal Cardwell <ncardwell@google.com> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Simon Horman <horms@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Willem de Bruijn <willemb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-11-16 17:28:06 -08:00
Ryan Roberts	9ac09bb9fe	mm: consistently use current->mm in mm_get_unmapped_area() mm_get_unmapped_area() is a wrapper around arch_get_unmapped_area() / arch_get_unmapped_area_topdown(), both of which search current->mm for some free space. Neither take an mm_struct - they implicitly operate on current->mm. But the wrapper takes an mm_struct and uses it to decide whether to search bottom up or top down. All callers pass in current->mm for this, so everything is working consistently. But it feels like an accident waiting to happen; eventually someone will call that function with a different mm, expecting to find free space in it, but what gets returned is free space in the current mm. So let's simplify by removing the parameter and have the wrapper use current->mm to decide which end to start at. Now everything is consistent and self-documenting. Link: https://lkml.kernel.org/r/20251003155306.2147572-1-ryan.roberts@arm.com Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Dev Jain <dev.jain@arm.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-11-16 17:27:57 -08:00
Sourabh Jain	00fbff75c5	crash: fix crashkernel resource shrink When crashkernel is configured with a high reservation, shrinking its value below the low crashkernel reservation causes two issues: 1. Invalid crashkernel resource objects 2. Kernel crash if crashkernel shrinking is done twice For example, with crashkernel=200M,high, the kernel reserves 200MB of high memory and some default low memory (say 256MB). The reservation appears as: cat /proc/iomem \| grep -i crash af000000-beffffff : Crash kernel 433000000-43f7fffff : Crash kernel If crashkernel is then shrunk to 50MB (echo 52428800 > /sys/kernel/kexec_crash_size), /proc/iomem still shows 256MB reserved: af000000-beffffff : Crash kernel Instead, it should show 50MB: af000000-b21fffff : Crash kernel Further shrinking crashkernel to 40MB causes a kernel crash with the following trace (x86): BUG: kernel NULL pointer dereference, address: 0000000000000038 PGD 0 P4D 0 Oops: 0000 [#1] PREEMPT SMP NOPTI <snip...> Call Trace: <TASK> ? __die_body.cold+0x19/0x27 ? page_fault_oops+0x15a/0x2f0 ? search_module_extables+0x19/0x60 ? search_bpf_extables+0x5f/0x80 ? exc_page_fault+0x7e/0x180 ? asm_exc_page_fault+0x26/0x30 ? __release_resource+0xd/0xb0 release_resource+0x26/0x40 __crash_shrink_memory+0xe5/0x110 crash_shrink_memory+0x12a/0x190 kexec_crash_size_store+0x41/0x80 kernfs_fop_write_iter+0x141/0x1f0 vfs_write+0x294/0x460 ksys_write+0x6d/0xf0 <snip...> This happens because __crash_shrink_memory()/kernel/crash_core.c incorrectly updates the crashk_res resource object even when crashk_low_res should be updated. Fix this by ensuring the correct crashkernel resource object is updated when shrinking crashkernel memory. Link: https://lkml.kernel.org/r/20251101193741.289252-1-sourabhjain@linux.ibm.com Fixes: `16c6006af4` ("kexec: enable kexec_crash_size to support two crash kernel regions") Signed-off-by: Sourabh Jain <sourabhjain@linux.ibm.com> Acked-by: Baoquan He <bhe@redhat.com> Cc: Zhen Lei <thunder.leizhen@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-11-15 10:52:01 -08:00
Pratyush Yadav	b05addf6f0	kho: warn and exit when unpreserved page wasn't preserved Calling __kho_unpreserve() on a pair of (pfn, end_pfn) that wasn't preserved is a bug. Currently, if that is done, the physxa or bits can be NULL. This results in a soft lockup since a NULL physxa or bits results in redoing the loop without ever making any progress. Return when physxa or bits are not found, but WARN first to loudly indicate invalid behaviour. Link: https://lkml.kernel.org/r/20251103180235.71409-3-pratyush@kernel.org Fixes: `fc33e4b44b` ("kexec: enable KHO support for memory preservation") Signed-off-by: Pratyush Yadav <pratyush@kernel.org> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Alexander Graf <graf@amazon.com> Cc: Baoquan He <bhe@redhat.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-11-09 21:19:47 -08:00
Pratyush Yadav	7ecd2e439d	kho: fix unpreservation of higher-order vmalloc preservations kho_vmalloc_unpreserve_chunk() calls __kho_unpreserve() with end_pfn as pfn + 1. This happens to work for 0-order pages, but leaks higher order pages. For example, say order 2 pages back the allocation. During preservation, they get preserved in the order 2 bitmaps, but kho_vmalloc_unpreserve_chunk() would try to unpreserve them from the order 0 bitmaps, which should not have these bits set anyway, leaving the order 2 bitmaps untouched. This results in the pages being carried over to the next kernel. Nothing will free those pages in the next boot, leaking them. Fix this by taking the order into account when calculating the end PFN for __kho_unpreserve(). Link: https://lkml.kernel.org/r/20251103180235.71409-2-pratyush@kernel.org Fixes: `a667300bd5` ("kho: add support for preserving vmalloc allocations") Signed-off-by: Pratyush Yadav <pratyush@kernel.org> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Alexander Graf <graf@amazon.com> Cc: Baoquan He <bhe@redhat.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-11-09 21:19:47 -08:00
Pratyush Yadav	0b07092d09	kho: fix out-of-bounds access of vmalloc chunk The list of pages in a vmalloc chunk is NULL-terminated. So when looping through the pages in a vmalloc chunk, both kho_restore_vmalloc() and kho_vmalloc_unpreserve_chunk() rightly make sure to stop when encountering a NULL page. But when the chunk is full, the loops do not stop and go past the bounds of chunk->phys, resulting in out-of-bounds memory access, and possibly the restoration or unpreservation of an invalid page. Fix this by making sure the processing of chunk stops at the end of the array. Link: https://lkml.kernel.org/r/20251103110159.8399-1-pratyush@kernel.org Fixes: `a667300bd5` ("kho: add support for preserving vmalloc allocations") Signed-off-by: Pratyush Yadav <pratyush@kernel.org> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Alexander Graf <graf@amazon.com> Cc: Baoquan He <bhe@redhat.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-11-09 21:19:47 -08:00
Peter Oberparleiter	ec4d11fc4b	gcov: add support for GCC 15 Using gcov on kernels compiled with GCC 15 results in truncated 16-byte long .gcda files with no usable data. To fix this, update GCOV_COUNTERS to match the value defined by GCC 15. Tested with GCC 14.3.0 and GCC 15.2.0. Link: https://lkml.kernel.org/r/20251028115125.1319410-1-oberpar@linux.ibm.com Signed-off-by: Peter Oberparleiter <oberpar@linux.ibm.com> Reported-by: Matthieu Baerts <matttbe@kernel.org> Closes: https://github.com/linux-test-project/lcov/issues/445 Tested-by: Matthieu Baerts <matttbe@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-11-09 21:19:44 -08:00
Pasha Tatashin	fa759cd75b	kho: allocate metadata directly from the buddy allocator KHO allocates metadata for its preserved memory map using the slab allocator via kzalloc(). This metadata is temporary and is used by the next kernel during early boot to find preserved memory. A problem arises when KFENCE is enabled. kzalloc() calls can be randomly intercepted by kfence_alloc(), which services the allocation from a dedicated KFENCE memory pool. This pool is allocated early in boot via memblock. When booting via KHO, the memblock allocator is restricted to a "scratch area", forcing the KFENCE pool to be allocated within it. This creates a conflict, as the scratch area is expected to be ephemeral and overwriteable by a subsequent kexec. If KHO metadata is placed in this KFENCE pool, it leads to memory corruption when the next kernel is loaded. To fix this, modify KHO to allocate its metadata directly from the buddy allocator instead of slab. Link: https://lkml.kernel.org/r/20251021000852.2924827-4-pasha.tatashin@soleen.com Fixes: `fc33e4b44b` ("kexec: enable KHO support for memory preservation") Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Pratyush Yadav <pratyush@kernel.org> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: David Matlack <dmatlack@google.com> Cc: Alexander Graf <graf@amazon.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Samiullah Khawaja <skhawaja@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-11-09 21:19:42 -08:00
Pasha Tatashin	a2fff99f92	kho: increase metadata bitmap size to PAGE_SIZE KHO memory preservation metadata is preserved in 512 byte chunks which requires their allocation from slab allocator. Slabs are not safe to be used with KHO because of kfence, and because partial slabs may lead leaks to the next kernel. Change the size to be PAGE_SIZE. The kfence specifically may cause memory corruption, where it randomly provides slab objects that can be within the scratch area. The reason for that is that kfence allocates its objects prior to KHO scratch is marked as CMA region. While this change could potentially increase metadata overhead on systems with sparsely preserved memory, this is being mitigated by ongoing work to reduce sparseness during preservation via 1G guest pages. Furthermore, this change aligns with future work on a stateless KHO, which will also use page-sized bitmaps for its radix tree metadata. Link: https://lkml.kernel.org/r/20251021000852.2924827-3-pasha.tatashin@soleen.com Fixes: `fc33e4b44b` ("kexec: enable KHO support for memory preservation") Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Pratyush Yadav <pratyush@kernel.org> Cc: Alexander Graf <graf@amazon.com> Cc: Christian Brauner <brauner@kernel.org> Cc: David Matlack <dmatlack@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Samiullah Khawaja <skhawaja@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-11-09 21:19:41 -08:00
Pasha Tatashin	e38f65d317	kho: warn and fail on metadata or preserved memory in scratch area Patch series "KHO: kfence + KHO memory corruption fix", v3. This series fixes a memory corruption bug in KHO that occurs when KFENCE is enabled. The root cause is that KHO metadata, allocated via kzalloc(), can be randomly serviced by kfence_alloc(). When a kernel boots via KHO, the early memblock allocator is restricted to a "scratch area". This forces the KFENCE pool to be allocated within this scratch area, creating a conflict. If KHO metadata is subsequently placed in this pool, it gets corrupted during the next kexec operation. Google is using KHO and have had obscure crashes due to this memory corruption, with stacks all over the place. I would prefer this fix to be properly backported to stable so we can also automatically consume it once we switch to the upstream KHO. Patch 1/3 introduces a debug-only feature (CONFIG_KEXEC_HANDOVER_DEBUG) that adds checks to detect and fail any operation that attempts to place KHO metadata or preserved memory within the scratch area. This serves as a validation and diagnostic tool to confirm the problem without affecting production builds. Patch 2/3 Increases bitmap to PAGE_SIZE, so buddy allocator can be used. Patch 3/3 Provides the fix by modifying KHO to allocate its metadata directly from the buddy allocator instead of slab. This bypasses the KFENCE interception entirely. This patch (of 3): It is invalid for KHO metadata or preserved memory regions to be located within the KHO scratch area, as this area is overwritten when the next kernel is loaded, and used early in boot by the next kernel. This can lead to memory corruption. Add checks to kho_preserve_* and KHO's internal metadata allocators (xa_load_or_alloc, new_chunk) to verify that the physical address of the memory does not overlap with any defined scratch region. If an overlap is detected, the operation will fail and a WARN_ON is triggered. To avoid performance overhead in production kernels, these checks are enabled only when CONFIG_KEXEC_HANDOVER_DEBUG is selected. [rppt@kernel.org: fix KEXEC_HANDOVER_DEBUG Kconfig dependency] Link: https://lkml.kernel.org/r/aQHUyyFtiNZhx8jo@kernel.org [pasha.tatashin@soleen.com: build fix] Link: https://lkml.kernel.org/r/CA+CK2bBnorfsTymKtv4rKvqGBHs=y=MjEMMRg_tE-RME6n-zUw@mail.gmail.com Link: https://lkml.kernel.org/r/20251021000852.2924827-1-pasha.tatashin@soleen.com Link: https://lkml.kernel.org/r/20251021000852.2924827-2-pasha.tatashin@soleen.com Fixes: `fc33e4b44b` ("kexec: enable KHO support for memory preservation") Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Mike Rapoport <rppt@kernel.org> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Pratyush Yadav <pratyush@kernel.org> Cc: Alexander Graf <graf@amazon.com> Cc: Christian Brauner <brauner@kernel.org> Cc: David Matlack <dmatlack@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Samiullah Khawaja <skhawaja@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-11-09 21:19:41 -08:00
Linus Torvalds	b5c0946029	Merge tag 'sched-urgent-2025-11-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fix from Ingo Molnar: "Fix a group-throttling bug in the fair scheduler" * tag 'sched-urgent-2025-11-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/fair: Prevent cfs_rq from being unthrottled with zero runtime_remaining	2025-11-08 08:59:05 -08:00
Linus Torvalds	133262cae9	Merge tag 'perf-urgent-2025-11-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf event fix from Ingo Molnar: "Fix a system hang caused by cpu-clock events deadlock" * tag 'perf-urgent-2025-11-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/core: Fix system hang caused by cpu-clock usage	2025-11-08 08:54:13 -08:00
Linus Torvalds	e6f55fe790	Merge tag 'locking-urgent-2025-11-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking fix from Ingo Molnar: "Fix (well, cut in half) a futex performance regression on PowerPC" * tag 'locking-urgent-2025-11-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: futex: Optimize per-cpu reference counting	2025-11-08 08:51:22 -08:00
Linus Torvalds	5b95a50001	Merge tag 'trace-v6.18-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: - Check for reader catching up in ring_buffer_map_get_reader() If the reader catches up to the writer in the memory mapped ring buffer then calling rb_get_reader_page() will return NULL as there's no pages left. But this isn't checked for before calling rb_get_reader_page() and the return of NULL causes a warning. If it is detected that the reader caught up to the writer, then simply exit the routine - Fix memory leak in histogram create_field_var() The couple of the error paths in create_field_var() did not properly clean up what was allocated. Make sure everything is freed properly on error - Fix help message of tools latency_collector The help message incorrectly stated that "-t" was the same as "--threads" whereas "--threads" is actually represented by "-e" * tag 'trace-v6.18-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing/tools: Fix incorrcet short option in usage text for --threads tracing: Fix memory leaks in create_field_var() ring-buffer: Do not warn in ring_buffer_map_get_reader() when reader catches up	2025-11-07 08:07:11 -08:00
Zilin Guan	80f0d631dc	tracing: Fix memory leaks in create_field_var() The function create_field_var() allocates memory for 'val' through create_hist_field() inside parse_atom(), and for 'var' through create_var(), which in turn allocates var->type and var->var.name internally. Simply calling kfree() to release these structures will result in memory leaks. Use destroy_hist_field() to properly free 'val', and explicitly release the memory of var->type and var->var.name before freeing 'var' itself. Link: https://patch.msgid.link/20251106120132.3639920-1-zilin@seu.edu.cn Fixes: `02205a6752` ("tracing: Add support for 'field variables'") Signed-off-by: Zilin Guan <zilin@seu.edu.cn> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2025-11-06 19:51:33 -05:00
Steven Rostedt	aa997d2d2a	ring-buffer: Do not warn in ring_buffer_map_get_reader() when reader catches up The function ring_buffer_map_get_reader() is a bit more strict than the other get reader functions, and except for certain situations the rb_get_reader_page() should not return NULL. If it does, it triggers a warning. This warning was triggering but after looking at why, it was because another acceptable situation was happening and it wasn't checked for. If the reader catches up to the writer and there's still data to be read on the reader page, then the rb_get_reader_page() will return NULL as there's no new page to get. In this situation, the reader page should not be updated and no warning should trigger. Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Vincent Donnefort <vdonnefort@google.com> Reported-by: syzbot+92a3745cea5ec6360309@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/690babec.050a0220.baf87.0064.GAE@google.com/ Link: https://lore.kernel.org/20251016132848.1b11bb37@gandalf.local.home Fixes: `117c39200d` ("ring-buffer: Introducing ring-buffer mapping functions") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2025-11-06 19:38:54 -05:00
Masami Hiramatsu (Google)	c91afa7610	tracing: tprobe-events: Fix to put tracepoint_user when disable the tprobe __unregister_trace_fprobe() checks tf->tuser to put it when removing tprobe. However, disable_trace_fprobe() does not use it and only calls unregister_fprobe(). Thus it forgets to disable tracepoint_user. If the trace_fprobe has tuser, put it for unregistering the tracepoint callbacks when disabling tprobe correctly. Link: https://lore.kernel.org/all/176244794466.155515.3971904050506100243.stgit@devnote2/ Fixes: `2867495dea` ("tracing: tprobe-events: Register tracepoint when enable tprobe event") Cc: stable@vger.kernel.org Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Tested-by: Beau Belgrave <beaub@linux.microsoft.com> Reviewed-by: Beau Belgrave <beaub@linux.microsoft.com>	2025-11-07 07:36:20 +09:00
Masami Hiramatsu (Google)	10d9dda426	tracing: tprobe-events: Fix to register tracepoint correctly Since __tracepoint_user_init() calls tracepoint_user_register() without initializing tuser->tpoint with given tracpoint, it does not register tracepoint stub function as callback correctly, and tprobe does not work. Initializing tuser->tpoint correctly before tracepoint_user_register() so that it sets up tracepoint callback. I confirmed below example works fine again. echo "t sched_switch preempt prev_pid=prev->pid next_pid=next->pid" > /sys/kernel/tracing/dynamic_events echo 1 > /sys/kernel/tracing/events/tracepoints/sched_switch/enable cat /sys/kernel/tracing/trace_pipe Link: https://lore.kernel.org/all/176244793514.155515.6466348656998627773.stgit@devnote2/ Fixes: `2867495dea` ("tracing: tprobe-events: Register tracepoint when enable tprobe event") Reported-by: Beau Belgrave <beaub@linux.microsoft.com> Cc: stable@vger.kernel.org Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Tested-by: Beau Belgrave <beaub@linux.microsoft.com> Reviewed-by: Beau Belgrave <beaub@linux.microsoft.com>	2025-11-07 07:32:55 +09:00
Peter Zijlstra	4cb5ac2626	futex: Optimize per-cpu reference counting Shrikanth noted that the per-cpu reference counter was still some 10% slower than the old immutable option (which removes the reference counting entirely). Further optimize the per-cpu reference counter by: - switching from RCU to preempt; - using __this_cpu_() since we now have preempt disabled; - switching from smp_load_acquire() to READ_ONCE(). This is all safe because disabling preemption inhibits the RCU grace period exactly like rcu_read_lock(). Having preemption disabled allows using __this_cpu_() provided the only access to the variable is in task context -- which is the case here. Furthermore, since we know changing fph->state to FR_ATOMIC demands a full RCU grace period we can rely on the implied smp_mb() from that to replace the acquire barrier(). This is very similar to the percpu_down_read_internal() fast-path. The reason this is significant for PowerPC is that it uses the generic this_cpu_() implementation which relies on local_irq_disable() (the x86 implementation relies on it being a single memop instruction to be IRQ-safe). Switching to preempt_disable() and __this_cpu() avoids this IRQ state swizzling. Also, PowerPC needs LWSYNC for the ACQUIRE barrier, not having to use explicit barriers safes a bunch. Combined this reduces the performance gap by half, down to some 5%. Fixes: `760e6f7bef` ("futex: Remove support for IMMUTABLE") Reported-by: Shrikanth Hegde <sshegde@linux.ibm.com> Tested-by: Shrikanth Hegde <sshegde@linux.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://patch.msgid.link/20251106092929.GR4067720@noisy.programming.kicks-ass.net	2025-11-06 12:30:54 +01:00
Aaron Lu	956dfda6a7	sched/fair: Prevent cfs_rq from being unthrottled with zero runtime_remaining When a cfs_rq is to be throttled, its limbo list should be empty and that's why there is a warn in tg_throttle_down() for non empty cfs_rq->throttled_limbo_list. When running a test with the following hierarchy: root / \ A* ... / \| \ ... B / \ C* where both A and C have quota settings, that warn on non empty limbo list is triggered for a cfs_rq of C, let's call it cfs_rq_c(and ignore the cpu part of the cfs_rq for the sake of simpler representation). Debug showed it happened like this: Task group C is created and quota is set, so in tg_set_cfs_bandwidth(), cfs_rq_c is initialized with runtime_enabled set, runtime_remaining equals to 0 and unthrottled. Before any tasks are enqueued to cfs_rq_c, multiple throttled tasks can migrate to cfs_rq_c (e.g., due to task group changes). When enqueue_task_fair(cfs_rq_c, throttled_task) is called and cfs_rq_c is in a throttled hierarchy (e.g., A is throttled), these throttled tasks are directly placed into cfs_rq_c's limbo list by enqueue_throttled_task(). Later, when A is unthrottled, tg_unthrottle_up(cfs_rq_c) enqueues these tasks. The first enqueue triggers check_enqueue_throttle(), and with zero runtime_remaining, cfs_rq_c can be throttled in throttle_cfs_rq() if it can't get more runtime and enters tg_throttle_down(), where the warning is hit due to remaining tasks in the limbo list. I think it's a chaos to trigger throttle on unthrottle path, the status of a being unthrottled cfs_rq can be in a mixed state in the end, so fix this by granting 1ns to cfs_rq in tg_set_cfs_bandwidth(). This ensures cfs_rq_c has a positive runtime_remaining when initialized as unthrottled and cannot enter tg_unthrottle_up() with zero runtime_remaining. Also, update outdated comments in tg_throttle_down() since unthrottle_cfs_rq() is no longer called with zero runtime_remaining. While at it, remove a redundant assignment to se in tg_throttle_down(). Fixes: `e1fad12dcb` ("sched/fair: Switch to task based throttle model") Reviewed-By: Benjamin Segall <bsegall@google.com> Suggested-by: Benjamin Segall <bsegall@google.com> Signed-off-by: Aaron Lu <ziqianlu@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Tested-by: Hao Jia <jiahao1@lixiang.com> Link: https://patch.msgid.link/20251030032755.560-1-ziqianlu@bytedance.com	2025-11-06 12:30:52 +01:00
Dapeng Mi	eb3182ef04	perf/core: Fix system hang caused by cpu-clock usage cpu-clock usage by the async-profiler tool can trigger a system hang, which got bisected back to the following commit by Octavia Togami: `18dbcbfabf` ("perf: Fix the POLL_HUP delivery breakage") causes this issue The root cause of the hang is that cpu-clock is a special type of SW event which relies on hrtimers. The __perf_event_overflow() callback is invoked from the hrtimer handler for cpu-clock events, and __perf_event_overflow() tries to call cpu_clock_event_stop() to stop the event, which calls htimer_cancel() to cancel the hrtimer. But that's a recursion into the hrtimer code from a hrtimer handler, which (unsurprisingly) deadlocks. To fix this bug, use hrtimer_try_to_cancel() instead, and set the PERF_HES_STOPPED flag, which causes perf_swevent_hrtimer() to stop the event once it sees the PERF_HES_STOPPED flag. [ mingo: Fixed the comments and improved the changelog. ] Closes: https://lore.kernel.org/all/CAHPNGSQpXEopYreir+uDDEbtXTBvBvi8c6fYXJvceqtgTPao3Q@mail.gmail.com/ Fixes: `18dbcbfabf` ("perf: Fix the POLL_HUP delivery breakage") Reported-by: Octavia Togami <octavia.togami@gmail.com> Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Octavia Togami <octavia.togami@gmail.com> Cc: stable@vger.kernel.org Link: https://github.com/lucko/spark/issues/530 Link: https://patch.msgid.link/20251015051828.12809-1-dapeng1.mi@linux.intel.com	2025-11-03 11:04:19 +01:00
Linus Torvalds	ba36dd5ee6	Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Pull bpf fixes from Alexei Starovoitov: - Mark migrate_disable/enable() as always_inline to avoid issues with partial inlining (Yonghong Song) - Fix powerpc stack register definition in libbpf bpf_tracing.h (Andrii Nakryiko) - Reject negative head_room in __bpf_skb_change_head (Daniel Borkmann) - Conditionally include dynptr copy kfuncs (Malin Jonsson) - Sync pending IRQ work before freeing BPF ring buffer (Noorain Eqbal) - Do not audit capability check in x86 do_jit() (Ondrej Mosnacek) - Fix arm64 JIT of BPF_ST insn when it writes into arena memory (Puranjay Mohan) * tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: bpf/arm64: Fix BPF_ST into arena memory bpf: Make migrate_disable always inline to avoid partial inlining bpf: Reject negative head_room in __bpf_skb_change_head bpf: Conditionally include dynptr copy kfuncs libbpf: Fix powerpc's stack register definition in bpf_tracing.h bpf: Do not audit capability check in do_jit() bpf: Sync pending IRQ work before freeing ring buffer	2025-10-31 18:22:26 -07:00
Linus Torvalds	a5dbbb39e1	Merge tag 'pm-6.18-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management fixes from Rafael Wysocki: "These fix three regressions, two recent ones and one introduced during the 6.17 development cycle: - Add an exit latency check to the menu cpuidle governor in the case when it considers using a real idle state instead of a polling one to address a performance regression (Rafael Wysocki) - Revert an attempted cleanup of a system suspend code path that introduced a regression elsewhere (Samuel Wu) - Allow pm_restrict_gfp_mask() to be called multiple times in a row and adjust pm_restore_gfp_mask() accordingly to avoid having to play nasty games with these calls during hibernation (Rafael Wysocki)" * tag 'pm-6.18-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: PM: sleep: Allow pm_restrict_gfp_mask() stacking cpuidle: governors: menu: Select polling state in some more cases Revert "PM: sleep: Make pm_wakeup_clear() call more clear"	2025-10-30 19:02:16 -07:00
Rafael J. Wysocki	590c5cd106	Merge branches 'pm-cpuidle' and 'pm-sleep' Merge a cpuidle fix and two fixes related to system sleep for 6.18-rc4: - Add an exit latency check to the menu cpuidle governor in the case when it considers using a real idle state instead of a polling one to address a performance regression (Rafael Wysocki) - Revert an attempted cleanup of a system suspend code path that introduced a regression elsewhere (Samuel Wu) - Allow pm_restrict_gfp_mask() to be called multiple times in a row and adjust pm_restore_gfp_mask() accordingly to avoid having to play nasty games with these calls during hibernation (Rafael Wysocki) * pm-cpuidle: cpuidle: governors: menu: Select polling state in some more cases * pm-sleep: PM: sleep: Allow pm_restrict_gfp_mask() stacking Revert "PM: sleep: Make pm_wakeup_clear() call more clear"	2025-10-30 20:25:18 +01:00
Rafael J. Wysocki	35e4a69b20	PM: sleep: Allow pm_restrict_gfp_mask() stacking Allow pm_restrict_gfp_mask() to be called many times in a row to avoid issues with calling dpm_suspend_start() when the GFP mask has been already restricted. Only the first invocation of pm_restrict_gfp_mask() will actually restrict the GFP mask and the subsequent calls will warn if there is a mismatch between the expected allowed GFP mask and the actual one. Moreover, if pm_restrict_gfp_mask() is called many times in a row, pm_restore_gfp_mask() needs to be called matching number of times in a row to actually restore the GFP mask. Calling it when the GFP mask has not been restricted will cause it to warn. This is necessary for the GFP mask restriction starting in hibernation_snapshot() to continue throughout the entire hibernation flow until it completes or it is aborted (either by a wakeup event or by an error). Fixes: `449c9c0253` ("PM: hibernate: Restrict GFP mask in hibernation_snapshot()") Fixes: `469d80a371` ("PM: hibernate: Fix hybrid-sleep") Reported-by: Askar Safin <safinaskar@gmail.com> Closes: https://lore.kernel.org/linux-pm/20251025050812.421905-1-safinaskar@gmail.com/ Link: https://lore.kernel.org/linux-pm/20251028111730.2261404-1-safinaskar@gmail.com/ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org> Tested-by: Mario Limonciello (AMD) <superm1@kernel.org> Cc: 6.16+ <stable@vger.kernel.org> # 6.16+ Link: https://patch.msgid.link/5935682.DvuYhMxLoT@rafael.j.wysocki	2025-10-29 18:55:32 +01:00
Linus Torvalds	fd57572253	Merge tag 'sched_ext-for-6.18-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext fixes from Tejun Heo: - Fix scx_kick_pseqs corruption when multiple schedulers are loaded concurrently - Allocate scx_kick_cpus_pnt_seqs lazily using kvzalloc() to handle systems with large CPU counts - Defer queue_balance_callback() until after ops.dispatch to fix callback ordering issues - Sync error_irq_work before freeing scx_sched to prevent use-after-free - Mark scx_bpf_dsq_move_set_[slice\|vtime]() with KF_RCU for proper RCU protection - Fix flag check for deferred callbacks * tag 'sched_ext-for-6.18-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: sched_ext: fix flag check for deferred callbacks sched_ext: Fix scx_kick_pseqs corruption on concurrent scheduler loads sched_ext: Allocate scx_kick_cpus_pnt_seqs lazily using kvzalloc() sched_ext: defer queue_balance_callback() until after ops.dispatch sched_ext: Sync error_irq_work before freeing scx_sched sched_ext: Mark scx_bpf_dsq_move_set_[slice\|vtime]() with KF_RCU	2025-10-27 10:52:18 -07:00
Linus Torvalds	5fee0dafba	Merge tag 'irq_urgent_for_v6.18_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq fixes from Borislav Petkov: - Restore the original buslock locking in a couple of places in the irq core subsystem after a rework * tag 'irq_urgent_for_v6.18_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: genirq/manage: Add buslock back in to enable_irq() genirq/manage: Add buslock back in to __disable_irq_nosync() genirq/chip: Add buslock back in to irq_set_handler()	2025-10-26 09:54:36 -07:00
Linus Torvalds	1bc9743b64	Merge tag 'sched_urgent_for_v6.18_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fix from Borislav Petkov: - Make sure a CFS runqueue on a throttled hierarchy has its PELT clock throttled otherwise task movement and manipulation would lead to dangling cfs_rq references and an eventual crash * tag 'sched_urgent_for_v6.18_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/fair: Start a cfs_rq on throttled hierarchy with PELT clock throttled	2025-10-26 09:42:19 -07:00
Linus Torvalds	7ea5092f52	Merge tag 'timers_urgent_for_v6.18_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fix from Borislav Petkov: - Do not create more than eight (max supported) AUX clocks sysfs hierarchies * tag 'timers_urgent_for_v6.18_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: timekeeping: Fix aux clocks sysfs initialization loop bound	2025-10-26 09:40:16 -07:00
Andy Shevchenko	53abe3e1c1	sched: Remove never used code in mm_cid_get() Clang is not happy with set but unused variable (this is visible with `make W=1` build: kernel/sched/sched.h:3744:18: error: variable 'cpumask' set but not used [-Werror,-Wunused-but-set-variable] It seems like the variable was never used along with the assignment that does not have side effects as far as I can see. Remove those altogether. Fixes: `223baf9d17` ("sched: Fix performance regression introduced by mm_cid") Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Tested-by: Eric Biggers <ebiggers@kernel.org> Reviewed-by: Breno Leitao <leitao@debian.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2025-10-24 16:55:46 -07:00
Malin Jonsson	8ce93aabbf	bpf: Conditionally include dynptr copy kfuncs Since commit `a498ee7576` ("bpf: Implement dynptr copy kfuncs"), if CONFIG_BPF_EVENTS is not enabled, but BPF_SYSCALL and DEBUG_INFO_BTF are, the build will break like so: BTFIDS vmlinux.unstripped WARN: resolve_btfids: unresolved symbol bpf_probe_read_user_str_dynptr WARN: resolve_btfids: unresolved symbol bpf_probe_read_user_dynptr WARN: resolve_btfids: unresolved symbol bpf_probe_read_kernel_str_dynptr WARN: resolve_btfids: unresolved symbol bpf_probe_read_kernel_dynptr WARN: resolve_btfids: unresolved symbol bpf_copy_from_user_task_str_dynptr WARN: resolve_btfids: unresolved symbol bpf_copy_from_user_task_dynptr WARN: resolve_btfids: unresolved symbol bpf_copy_from_user_str_dynptr WARN: resolve_btfids: unresolved symbol bpf_copy_from_user_dynptr make[2]: * [scripts/Makefile.vmlinux:72: vmlinux.unstripped] Error 255 make[2]: * Deleting file 'vmlinux.unstripped' make[1]: * [/repo/malin/upstream/linux/Makefile:1242: vmlinux] Error 2 make: * [Makefile:248: __sub-make] Error 2 Guard these symbols with #ifdef CONFIG_BPF_EVENTS to resolve the problem. Fixes: `a498ee7576` ("bpf: Implement dynptr copy kfuncs") Reported-by: Yong Gu <yong.g.gu@ericsson.com> Acked-by: Mykyta Yatsenko <yatsenko@meta.com> Signed-off-by: Malin Jonsson <malin.jonsson@est.tech> Link: https://lore.kernel.org/r/20251024151436.139131-1-malin.jonsson@est.tech Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-10-24 09:44:47 -07:00
Charles Keepax	ef3330b99c	genirq/manage: Add buslock back in to enable_irq() The locking was changed from a buslock to a plain lock, but the patch description states there was no functional change. Assuming this was accidental so reverting to using the buslock. Fixes: `bddd10c554` ("genirq/manage: Rework enable_irq()") Signed-off-by: Charles Keepax <ckeepax@opensource.cirrus.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://patch.msgid.link/20251023154901.1333755-4-ckeepax@opensource.cirrus.com	2025-10-24 11:38:39 +02:00
Charles Keepax	56363e25f7	genirq/manage: Add buslock back in to __disable_irq_nosync() The locking was changed from a buslock to a plain lock, but the patch description states there was no functional change. Assuming this was accidental so reverting to using the buslock. Fixes: `1b74444467` ("genirq/manage: Rework __disable_irq_nosync()") Signed-off-by: Charles Keepax <ckeepax@opensource.cirrus.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://patch.msgid.link/20251023154901.1333755-3-ckeepax@opensource.cirrus.com	2025-10-24 11:38:39 +02:00
Charles Keepax	5d7e45dd67	genirq/chip: Add buslock back in to irq_set_handler() The locking was changed from a buslock to a plain lock, but the patch description states there was no functional change. Assuming this was accidental so reverting to using the buslock. Fixes: `5cd05f3e23` ("genirq/chip: Rework irq_set_handler() variants") Signed-off-by: Charles Keepax <ckeepax@opensource.cirrus.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://patch.msgid.link/20251023154901.1333755-2-ckeepax@opensource.cirrus.com	2025-10-24 11:38:39 +02:00
Linus Torvalds	5121062e83	Merge tag 'trace-rv-v6.18-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: "A couple of fixes for Runtime Verification: - A bug caused a kernel panic when reading enabled_monitors was reported. Change callback functions to always use list_head iterators and by doing so, fix the wrong pointer that was leading to the panic. - The rtapp/pagefault monitor relies on the MMU to be present (pagefaults exist) but that was not enforced via kconfig, leading to potential build errors on systems without an MMU. Add that kconfig dependency" * tag 'trace-rv-v6.18-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: rv: Make rtapp/pagefault monitor depends on CONFIG_MMU rv: Fully convert enabled_monitors to use list_head as iterator	2025-10-23 16:50:25 -07:00
Samuel Wu	79816d4b9e	Revert "PM: sleep: Make pm_wakeup_clear() call more clear" This reverts commit `56a232d93c`. The above commit changed the position of pm_wakeup_clear() for the suspend call path, but other call paths with references to freeze_processes() were not updated. This means that other call paths, such as hibernate(), will not have pm_wakeup_clear() called. Suggested-by: Saravana Kannan <saravanak@google.com> Signed-off-by: Samuel Wu <wusamuel@google.com> [ rjw: Changelog edits ] Link: https://patch.msgid.link/20251022222830.634086-1-wusamuel@google.com Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2025-10-23 12:48:04 +02:00
Linus Torvalds	0f3ad9c610	Merge tag 'mm-hotfixes-stable-2025-10-22-12-43' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull hotfixes from Andrew Morton: "17 hotfixes. 12 are cc:stable and 14 are for MM. There's a two-patch DAMON series from SeongJae Park which addresses a missed check and possible memory leak. Apart from that it's all singletons - please see the changelogs for details" * tag 'mm-hotfixes-stable-2025-10-22-12-43' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: csky: abiv2: adapt to new folio flags field mm/damon/core: use damos_commit_quota_goal() for new goal commit mm/damon/core: fix potential memory leak by cleaning ops_filter in damon_destroy_scheme hugetlbfs: move lock assertions after early returns in huge_pmd_unshare() vmw_balloon: indicate success when effectively deflating during migration mm/damon/core: fix list_add_tail() call on damon_call() mm/mremap: correctly account old mapping after MREMAP_DONTUNMAP remap mm: prevent poison consumption when splitting THP ocfs2: clear extent cache after moving/defragmenting extents mm: don't spin in add_stack_record when gfp flags don't allow dma-debug: don't report false positives with DMA_BOUNCE_UNALIGNED_KMALLOC mm/damon/sysfs: dealloc commit test ctx always mm/damon/sysfs: catch commit test ctx alloc failure hung_task: fix warnings caused by unaligned lock pointers	2025-10-22 14:57:35 -10:00
K Prateek Nayak	0e4a169d1a	sched/fair: Start a cfs_rq on throttled hierarchy with PELT clock throttled Matteo reported hitting the assert_list_leaf_cfs_rq() warning from enqueue_task_fair() post commit `fe8d238e64` ("sched/fair: Propagate load for throttled cfs_rq") which transitioned to using cfs_rq_pelt_clock_throttled() check for leaf cfs_rq insertions in propagate_entity_cfs_rq(). The "cfs_rq->pelt_clock_throttled" flag is used to indicate if the hierarchy has its PELT frozen. If a cfs_rq's PELT is marked frozen, all its descendants should have their PELT frozen too or weird things can happen as a result of children accumulating PELT signals when the parents have their PELT clock stopped. Another side effect of this is the loss of integrity of the leaf cfs_rq list. As debugged by Aaron, consider the following hierarchy: root(#) / \ A(#) B() \| C <--- new cgroup \| D <--- new cgroup # - Already on leaf cfs_rq list - Throttled with PELT frozen The newly created cgroups don't have their "pelt_clock_throttled" signal synced with cgroup B. Next, the following series of events occur: 1. online_fair_sched_group() for cgroup D will call propagate_entity_cfs_rq(). (Same can happen if a throttled task is moved to cgroup C and enqueue_task_fair() returns early.) propagate_entity_cfs_rq() adds the cfs_rq of cgroup C to "rq->tmp_alone_branch" since its PELT clock is not marked throttled and cfs_rq of cgroup B is not on the list. cfs_rq of cgroup B is skipped since its PELT is throttled. root cfs_rq already exists on cfs_rq leading to list_add_leaf_cfs_rq() returning early. The cfs_rq of cgroup C is left dangling on the "rq->tmp_alone_branch". 2. A new task wakes up on cgroup A. Since the whole hierarchy is already on the leaf cfs_rq list, list_add_leaf_cfs_rq() keeps returning early without any modifications to "rq->tmp_alone_branch". The final assert_list_leaf_cfs_rq() in enqueue_task_fair() sees the dangling reference to cgroup C's cfs_rq in "rq->tmp_alone_branch". !!! Splat !!! Syncing the "pelt_clock_throttled" indicator with parent cfs_rq is not enough since the new cfs_rq is not yet enqueued on the hierarchy. A dequeue on other subtree on the throttled hierarchy can freeze the PELT clock for the parent hierarchy without setting the indicators for this newly added cfs_rq which was never enqueued. Since there are no tasks on the new hierarchy, start a cfs_rq on a throttled hierarchy with its PELT clock throttled. The first enqueue, or the distribution (whichever happens first) will unfreeze the PELT clock and queue the cfs_rq on the leaf cfs_rq list. While at it, add an assert_list_leaf_cfs_rq() in propagate_entity_cfs_rq() to catch such cases in the future. Closes: https://lore.kernel.org/lkml/58a587d694f33c2ea487c700b0d046fa@codethink.co.uk/ Fixes: `e1fad12dcb` ("sched/fair: Switch to task based throttle model") Reported-by: Matteo Martelli <matteo.martelli@codethink.co.uk> Suggested-by: Aaron Lu <ziqianlu@bytedance.com> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Aaron Lu <ziqianlu@bytedance.com> Tested-by: Aaron Lu <ziqianlu@bytedance.com> Tested-by: Matteo Martelli <matteo.martelli@codethink.co.uk> Link: https://patch.msgid.link/20251021053522.37583-1-kprateek.nayak@amd.com	2025-10-22 15:21:52 +02:00
Noorain Eqbal	4e90776383	bpf: Sync pending IRQ work before freeing ring buffer Fix a race where irq_work can be queued in bpf_ringbuf_commit() but the ring buffer is freed before the work executes. In the syzbot reproducer, a BPF program attached to sched_switch triggers bpf_ringbuf_commit(), queuing an irq_work. If the ring buffer is freed before this work executes, the irq_work thread may accesses freed memory. Calling `irq_work_sync(&rb->work)` ensures that all pending irq_work complete before freeing the buffer. Fixes: `457f44363a` ("bpf: Implement BPF ring buffer and verifier support for it") Reported-by: syzbot+2617fc732430968b45d2@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=2617fc732430968b45d2 Tested-by: syzbot+2617fc732430968b45d2@syzkaller.appspotmail.com Signed-off-by: Noorain Eqbal <nooraineqbal@gmail.com> Link: https://lore.kernel.org/r/20251020180301.103366-1-nooraineqbal@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2025-10-21 09:57:48 -07:00
Linus Torvalds	6548d364a3	Merge tag 'cgroup-for-6.18-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fixes from Tejun Heo: - Fix seqcount lockdep assertion failure in cgroup freezer on PREEMPT_RT. Plain seqcount_t expects preemption disabled, but PREEMPT_RT spinlocks don't disable preemption. Switch to seqcount_spinlock_t to properly associate css_set_lock with the freeze timing seqcount. - Misc changes including kernel-doc warning fix for misc_res_type enum and improved selftest diagnostics. * tag 'cgroup-for-6.18-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup/misc: fix misc_res_type kernel-doc warning selftests: cgroup: Use values_close_report in test_cpu selftests: cgroup: add values_close_report helper cgroup: Fix seqcount lockdep assertion in cgroup freezer	2025-10-20 09:41:27 -10:00
Haofeng Li	39a9ed0fb6	timekeeping: Fix aux clocks sysfs initialization loop bound The loop in tk_aux_sysfs_init() uses `i <= MAX_AUX_CLOCKS` as the termination condition, which results in 9 iterations (i=0 to 8) when MAX_AUX_CLOCKS is defined as 8. However, the kernel is designed to support only up to 8 auxiliary clocks. This off-by-one error causes the creation of a 9th sysfs entry that exceeds the intended auxiliary clock range. Fix the loop bound to use `i < MAX_AUX_CLOCKS` to ensure exactly 8 auxiliary clock entries are created, matching the design specification. Fixes: `7b95663a3d` ("timekeeping: Provide interface to control auxiliary clocks") Signed-off-by: Haofeng Li <lihaofeng@kylinos.cn> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://patch.msgid.link/tencent_2376993D9FC06A3616A4F981B3DE1C599607@qq.com	2025-10-20 19:56:12 +02:00
Nam Cao	3d62f95bd8	rv: Make rtapp/pagefault monitor depends on CONFIG_MMU There is no page fault without MMU. Compiling the rtapp/pagefault monitor without CONFIG_MMU fails as page fault tracepoints' definitions are not available. Make rtapp/pagefault monitor depends on CONFIG_MMU. Fixes: `9162620eb6` ("rv: Add rtapp_pagefault monitor") Signed-off-by: Nam Cao <namcao@linutronix.de> Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202509260455.6Z9Vkty4-lkp@intel.com/ Cc: stable@vger.kernel.org Reviewed-by: Gabriele Monaco <gmonaco@redhat.com> Link: https://lore.kernel.org/r/20251002082317.973839-1-namcao@linutronix.de Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>	2025-10-20 12:47:40 +02:00
Nam Cao	103541e6a5	rv: Fully convert enabled_monitors to use list_head as iterator The callbacks in enabled_monitors_seq_ops are inconsistent. Some treat the iterator as struct rv_monitor , while others treat the iterator as struct list_head . This causes a wrong type cast and crashes the system as reported by Nathan. Convert everything to use struct list_head * as iterator. This also makes enabled_monitors consistent with available_monitors. Fixes: `de090d1cca` ("rv: Fix wrong type cast in enabled_monitors_next()") Reported-by: Nathan Chancellor <nathan@kernel.org> Closes: https://lore.kernel.org/linux-trace-kernel/20250923002004.GA2836051@ax162/ Signed-off-by: Nam Cao <namcao@linutronix.de> Cc: stable@vger.kernel.org Reviewed-by: Gabriele Monaco <gmonaco@redhat.com> Link: https://lore.kernel.org/r/20251002082235.973099-1-namcao@linutronix.de Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>	2025-10-20 12:47:40 +02:00
Linus Torvalds	d9043c79ba	Merge tag 'sched_urgent_for_v6.18_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Borislav Petkov: - Make sure the check for lost pelt idle time is done unconditionally to have correct lost idle time accounting - Stop the deadline server task before a CPU goes offline * tag 'sched_urgent_for_v6.18_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/fair: Fix pelt lost idle time detection sched/deadline: Stop dl_server before CPU goes offline	2025-10-19 04:59:43 -10:00
Linus Torvalds	343b4b44a1	Merge tag 'perf_urgent_for_v6.18_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Borislav Petkov: - Make sure perf reporting works correctly in setups using overlayfs or FUSE - Move the uprobe optimization to a better location logically * tag 'perf_urgent_for_v6.18_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/core: Fix MMAP2 event device with backing files perf/core: Fix MMAP event path names with backing files perf/core: Fix address filter match with backing files uprobe: Move arch_uprobe_optimize right after handlers execution	2025-10-19 04:54:08 -10:00
Emil Tsalapatis	a3c4a0a42e	sched_ext: fix flag check for deferred callbacks When scheduling the deferred balance callbacks, check SCX_RQ_BAL_CB_PENDING instead of SCX_RQ_BAL_PENDING. This way schedule_deferred() properly tests whether there is already a pending request for queue_balance_callback() to be invoked at the end of .balance(). Fixes: `a8ad873113` ("sched_ext: defer queue_balance_callback() until after ops.dispatch") Signed-off-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2025-10-16 08:34:00 -10:00
Shardul Bankar	f6fddc6df3	bpf: Fix memory leak in __lookup_instance error path When __lookup_instance() allocates a func_instance structure but fails to allocate the must_write_set array, it returns an error without freeing the previously allocated func_instance. This causes a memory leak of 192 bytes (sizeof(struct func_instance)) each time this error path is triggered. Fix by freeing 'result' on must_write_set allocation failure. Fixes: `b3698c356a` ("bpf: callchain sensitive stack liveness tracking using CFG") Reported-by: BPF Runtime Fuzzer (BRF) Signed-off-by: Shardul Bankar <shardulsb08@gmail.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://patch.msgid.link/20251016063330.4107547-1-shardulsb08@gmail.com	2025-10-16 10:45:17 -07:00
Marek Szyprowski	03521c892b	dma-debug: don't report false positives with DMA_BOUNCE_UNALIGNED_KMALLOC Commit `370645f41e` ("dma-mapping: force bouncing if the kmalloc() size is not cache-line-aligned") introduced DMA_BOUNCE_UNALIGNED_KMALLOC feature and permitted architecture specific code configure kmalloc slabs with sizes smaller than the value of dma_get_cache_alignment(). When that feature is enabled, the physical address of some small kmalloc()-ed buffers might be not aligned to the CPU cachelines, thus not really suitable for typical DMA. To properly handle that case a SWIOTLB buffer bouncing is used, so no CPU cache corruption occurs. When that happens, there is no point reporting a false-positive DMA-API warning that the buffer is not properly aligned, as this is not a client driver fault. [m.szyprowski@samsung.com: replace is_swiotlb_allocated() with is_swiotlb_active(), per Catalin] Link: https://lkml.kernel.org/r/20251010173009.3916215-1-m.szyprowski@samsung.com Link: https://lkml.kernel.org/r/20251009141508.2342138-1-m.szyprowski@samsung.com Fixes: `370645f41e` ("dma-mapping: force bouncing if the kmalloc() size is not cache-line-aligned") Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Inki Dae <m.szyprowski@samsung.com> Cc: Robin Murohy <robin.murphy@arm.com> Cc: "Isaac J. Manjarres" <isaacmanjarres@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-10-15 13:24:33 -07:00
Alexei Starovoitov	5fb750e8a9	bpf: Replace bpf_map_kmalloc_node() with kmalloc_nolock() to allocate bpf_async_cb structures. The following kmemleak splat: [ 8.105530] kmemleak: Trying to color unknown object at 0xff11000100e918c0 as Black [ 8.106521] Call Trace: [ 8.106521] <TASK> [ 8.106521] dump_stack_lvl+0x4b/0x70 [ 8.106521] kvfree_call_rcu+0xcb/0x3b0 [ 8.106521] ? hrtimer_cancel+0x21/0x40 [ 8.106521] bpf_obj_free_fields+0x193/0x200 [ 8.106521] htab_map_update_elem+0x29c/0x410 [ 8.106521] bpf_prog_cfc8cd0f42c04044_overwrite_cb+0x47/0x4b [ 8.106521] bpf_prog_8c30cd7c4db2e963_overwrite_timer+0x65/0x86 [ 8.106521] bpf_prog_test_run_syscall+0xe1/0x2a0 happens due to the combination of features and fixes, but mainly due to commit `6d78b4473c` ("bpf: Tell memcg to use allow_spinning=false path in bpf_timer_init()") It's using __GFP_HIGH, which instructs slub/kmemleak internals to skip kmemleak_alloc_recursive() on allocation, so subsequent kfree_rcu()-> kvfree_call_rcu()->kmemleak_ignore() complains with the above splat. To fix this imbalance, replace bpf_map_kmalloc_node() with kmalloc_nolock() and kfree_rcu() with call_rcu() + kfree_nolock() to make sure that the objects allocated with kmalloc_nolock() are freed with kfree_nolock() rather than the implicit kfree() that kfree_rcu() uses internally. Note, the kmalloc_nolock() happens under bpf_spin_lock_irqsave(), so it will always fail in PREEMPT_RT. This is not an issue at the moment, since bpf_timers are disabled in PREEMPT_RT. In the future bpf_spin_lock will be replaced with state machine similar to bpf_task_work. Fixes: `6d78b4473c` ("bpf: Tell memcg to use allow_spinning=false path in bpf_timer_init()") Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Harry Yoo <harry.yoo@oracle.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: linux-mm@kvack.org Link: https://lore.kernel.org/bpf/20251015000700.28988-1-alexei.starovoitov@gmail.com	2025-10-15 12:22:22 +02:00

1 2 3 4 5 ...

49594 Commits