linux

mirror of https://github.com/torvalds/linux.git synced 2025-12-07 20:06:24 +00:00

Author	SHA1	Message	Date
Lance Yang	6ce3bc990c	mm: skip mlocked THPs that are underused early in deferred_split_scan() When we stumble over a fully-mapped mlocked THP in the deferred shrinker, it does not make sense to try to detect whether it is underused, because try_to_map_unused_to_zeropage(), called while splitting the folio, will not actually replace any zeroed pages by the shared zeropage. Splitting the folio in that case does not make any sense, so let's not even scan to check if the folio is underused. Link: https://lkml.kernel.org/r/20250908090741.61519-1-lance.yang@linux.dev Signed-off-by: Lance Yang <lance.yang@linux.dev> Suggested-by: David Hildenbrand <david@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Usama Arif <usamaarif642@gmail.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Mariano Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Kiryl Shutsemau <kirill@shutemov.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-21 14:22:30 -07:00
Francois Dugast	10b9feee2d	mm/hmm: populate PFNs from PMD swap entry Once support for THP migration of zone device pages is enabled, device private swap entries will be found during the walk not only for PTEs but also for PMDs. Therefore, it is necessary to extend to PMDs the special handling which is already in place for PTEs when device private pages are owned by the caller: instead of faulting or skipping the range, the correct behavior is to use the swap entry to populate HMM PFNs. This change is a prerequisite to make use of device-private THP in drivers using drivers/gpu/drm/drm_pagemap, such as xe. Even though subsequent PFNs can be inferred when handling large order PFNs, the PFN list is still fully populated because this is currently expected by HMM users. In case this changes in the future, that is all HMM users support a sparsely populated PFN list, the for() loop can be made to skip remaining PFNs for the current order. A quick test shows the loop takes about 10 ns, roughly 20 times faster than without this optimization. Link: https://lkml.kernel.org/r/20250908091052.612303-1-francois.dugast@intel.com Signed-off-by: Francois Dugast <francois.dugast@intel.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Leon Romanovsky <leonro@nvidia.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Balbir Singh <balbirs@nvidia.com> Cc: David Airlie <airlied@gmail.com> Cc: Christian König <christian.koenig@amd.com> Cc: Mika Penttilä <mpenttil@redhat.com> Cc: Thomas Hellstrom <thomas.hellstrom@linux.intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-21 14:22:29 -07:00
David Hildenbrand	7cad96ae59	mm/gup: fix handling of errors from arch_make_folio_accessible() in follow_page_pte() In case we call arch_make_folio_accessible() and it fails, we would incorrectly return a value that is "!= 0" to the caller, indicating that we pinned all requested pages and that the caller can keep going. follow_page_pte() is not supposed to return error values, but instead "0" on failure and "1" on success -- we'll clean that up separately. In case we return "!= 0", the caller will just keep going pinning more pages. If we happen to pin a page afterwards, we're in trouble, because we essentially skipped some pages in the requested range. Staring at the arch_make_folio_accessible() implementation on s390x, I assume it should actually never really fail unless something unexpected happens (BUG?). So let's not CC stable and just fix common code to do the right thing. Clean up the code a bit now that there is no reason to store the return value of arch_make_folio_accessible(). Link: https://lkml.kernel.org/r/20250908094517.303409-1-david@redhat.com Fixes: `f28d43636d` ("mm/gup/writeback: add callbacks for inaccessible pages") Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Peter Xu <peterx@redhat.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-21 14:22:29 -07:00
Chanwon Park	e7a5f249e6	mm: re-enable kswapd when memory pressure subsides or demotion is toggled If kswapd fails to reclaim pages from a node MAX_RECLAIM_RETRIES in a row, kswapd on that node gets disabled. That is, the system won't wakeup kswapd for that node until page reclamation is observed at least once. That reclamation is mostly done by direct reclaim, which in turn enables kswapd back. However, on systems with CXL memory nodes, workloads with high anon page usage can disable kswapd indefinitely, without triggering direct reclaim. This can be reproduced with following steps: numa node 0 (32GB memory, 48 CPUs) numa node 2~5 (512GB CXL memory, 128GB each) (numa node 1 is disabled) swap space 8GB 1) Set /sys/kernel/mm/demotion_enabled to 0. 2) Set /proc/sys/kernel/numa_balancing to 0. 3) Run a process that allocates and random accesses 500GB of anon pages. 4) Let the process exit normally. During 3), free memory on node 0 gets lower than low watermark, and kswapd runs and depletes swap space. Then, kswapd fails consecutively and gets disabled. Allocation afterwards happens on CXL memory, so node 0 never gains more memory pressure to trigger direct reclaim. After 4), kswapd on node 0 remains disabled, and tasks running on that node are unable to swap. If you turn on NUMA_BALANCING_MEMORY_TIERING and demotion now, it won't work properly since kswapd is disabled. To mitigate this problem, reset kswapd_failures to 0 on following conditions: a) ZONE_BELOW_HIGH bit of a zone in hopeless node with a fallback memory node gets cleared. b) demotion_enabled is changed from false to true. Rationale for a): ZONE_BELOW_HIGH bit being cleared might be a sign that the node may be reclaimable afterwards. This won't help much if the memory-hungry process keeps running without freeing anything, but at least the node will go back to reclaimable state when the process exits. Rationale for b): When demotion_enabled is false, kswapd can only reclaim anon pages by swapping them out to swap space. If demotion_enabled is turned on, kswapd can demote anon pages to another node for reclaiming. So, the original failure count for determining reclaimability is no longer valid. Since kswapd_failures resets may be missed by ++ operation, it is changed from int to atomic_t. [akpm@linux-foundation.org: tweak whitespace] Link: https://lkml.kernel.org/r/aL6qGi69jWXfPc4D@pcw-MS-7D22 Signed-off-by: Chanwon Park <flyinrm@gmail.com> Cc: Brendan Jackman <jackmanb@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-21 14:22:29 -07:00
Jan Kara	6028372689	readahead: add trace points Add a couple of trace points to make debugging readahead logic easier. [jack@suse.cz: v2] Link: https://lkml.kernel.org/r/20250909145849.5090-2-jack@suse.cz Link: https://lkml.kernel.org/r/20250908145533.31528-2-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Tested-by: Pankaj Raghav <p.raghav@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-21 14:22:28 -07:00
Stanislav Fort	72797d218b	mm/memcg: v1: account event registrations and drop world-writable cgroup.event_control In cgroup v1, the legacy cgroup.event_control file is world-writable and allows unprivileged users to register unbounded events and thresholds. Each registration allocates kernel memory without capping or memcg charging, which can be abused to exhaust kernel memory in affected configurations. Make the following minimal changes: - Account allocations with __GFP_ACCOUNT in event and threshold registration. - Remove CFTYPE_WORLD_WRITABLE from cgroup.event_control to make it owner-writable. This does not affect cgroup v2. Allocations are still subject to kmem accounting being enabled, but this reduces unbounded global growth. Link: https://lkml.kernel.org/r/20250905093851.80596-1-disclosure@aisle.com Signed-off-by: Stanislav Fort <disclosure@aisle.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-21 14:22:26 -07:00
Kairui Song	f83938e418	mm, swap: use a single page for swap table when the size fits We have a cluster size of 512 slots. Each slot consumes 8 bytes in swap table so the swap table size of each cluster is exactly one page (4K). If that condition is true, allocate one page direct and disable the slab cache to reduce the memory usage of swap table and avoid fragmentation. Link: https://lkml.kernel.org/r/20250916160100.31545-16-ryncsn@gmail.com Co-developed-by: Chris Li <chrisl@kernel.org> Signed-off-by: Chris Li <chrisl@kernel.org> Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Chris Li <chrisl@kernel.org> Suggested-by: Chris Li <chrisl@kernel.org> Reviewed-by: Barry Song <baohua@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: kernel test robot <oliver.sang@intel.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Zi Yan <ziy@nvidia.com> Cc: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-21 14:22:25 -07:00
Kairui Song	07adc4cf1e	mm, swap: implement dynamic allocation of swap table Now swap table is cluster based, which means free clusters can free its table since no one should modify it. There could be speculative readers, like swap cache look up, protect them by making them RCU protected. All swap table should be filled with null entries before free, so such readers will either see a NULL pointer or a null filled table being lazy freed. On allocation, allocate the table when a cluster is used by any order. This way, we can reduce the memory usage of large swap device significantly. This idea to dynamically release unused swap cluster data was initially suggested by Chris Li while proposing the cluster swap allocator and it suits the swap table idea very well. Link: https://lkml.kernel.org/r/20250916160100.31545-15-ryncsn@gmail.com Co-developed-by: Chris Li <chrisl@kernel.org> Signed-off-by: Chris Li <chrisl@kernel.org> Signed-off-by: Kairui Song <kasong@tencent.com> Suggested-by: Chris Li <chrisl@kernel.org> Reviewed-by: Barry Song <baohua@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: kernel test robot <oliver.sang@intel.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Zi Yan <ziy@nvidia.com> Cc: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-21 14:22:25 -07:00
Kairui Song	685a17fbd3	mm, swap: remove contention workaround for swap cache Swap cluster setup will try to shuffle the clusters on initialization. It was helpful to avoid contention for the swap cache space. The cluster size (2M) was much smaller than each swap cache space (64M), so shuffling the cluster means the allocator will try to allocate swap slots that are in different swap cache spaces for each CPU, reducing the chance of two CPUs using the same swap cache space, and hence reducing the contention. Now, swap cache is managed by swap clusters, this shuffle is pointless. Just remove it, and clean up related macros. This also improves the HDD swap performance as shuffling IO is a bad idea for HDD, and now the shuffling is gone. Test have shown a ~40% performance gain for HDD [1]: Doing sequential swap in of 8G data using 8 processes with usemem, average of 3 test runs: Before: 1270.91 KB/s per process After: 1849.54 KB/s per process Link: https://lore.kernel.org/linux-mm/CAMgjq7AdauQ8=X0zeih2r21QoV=-WWj1hyBxLWRzq74n-C=-Ng@mail.gmail.com/ [1] Link: https://lkml.kernel.org/r/20250916160100.31545-14-ryncsn@gmail.com Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202504241621.f27743ec-lkp@intel.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Chris Li <chrisl@kernel.org> Reviewed-by: Barry Song <baohua@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Suggested-by: Chris Li <chrisl@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Zi Yan <ziy@nvidia.com> Cc: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-21 14:22:25 -07:00
Kairui Song	8b47299a41	mm, swap: mark swap address space ro and add context debug check Swap cache is now backed by swap table, and the address space is not holding any mutable data anymore. And swap cache is now protected by the swap cluster lock, instead of the XArray lock. All access to swap cache are wrapped by swap cache helpers. Locking is mostly handled internally by swap cache helpers, only a few __swap_cache_* helpers require the caller to lock the cluster by themselves. Worth noting that, unlike XArray, the cluster lock is not IRQ safe. The swap cache was very different compared to filemap, and now it's completely separated from filemap. Nothing wants to mark or change anything or do a writeback callback in IRQ. So explicitly document this and add a debug check to avoid further potential misuse. And mark the swap cache space as read-only to avoid any user wrongly mixing unexpected filemap helpers with swap cache. Link: https://lkml.kernel.org/r/20250916160100.31545-13-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Chris Li <chrisl@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Suggested-by: Chris Li <chrisl@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: kernel test robot <oliver.sang@intel.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Zi Yan <ziy@nvidia.com> Cc: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-21 14:22:25 -07:00
Kairui Song	8578e0c00d	mm, swap: use the swap table for the swap cache and switch API Introduce basic swap table infrastructures, which are now just a fixed-sized flat array inside each swap cluster, with access wrappers. Each cluster contains a swap table of 512 entries. Each table entry is an opaque atomic long. It could be in 3 types: a shadow type (XA_VALUE), a folio type (pointer), or NULL. In this first step, it only supports storing a folio or shadow, and it is a drop-in replacement for the current swap cache. Convert all swap cache users to use the new sets of APIs. Chris Li has been suggesting using a new infrastructure for swap cache for better performance, and that idea combined well with the swap table as the new backing structure. Now the lock contention range is reduced to 2M clusters, which is much smaller than the 64M address_space. And we can also drop the multiple address_space design. All the internal works are done with swap_cache_get_* helpers. Swap cache lookup is still lock-less like before, and the helper's contexts are same with original swap cache helpers. They still require a pin on the swap device to prevent the backing data from being freed. Swap cache updates are now protected by the swap cluster lock instead of the XArray lock. This is mostly handled internally, but new __swap_cache_* helpers require the caller to lock the cluster. So, a few new cluster access and locking helpers are also introduced. A fully cluster-based unified swap table can be implemented on top of this to take care of all count tracking and synchronization work, with dynamic allocation. It should reduce the memory usage while making the performance even better. Link: https://lkml.kernel.org/r/20250916160100.31545-12-ryncsn@gmail.com Co-developed-by: Chris Li <chrisl@kernel.org> Signed-off-by: Chris Li <chrisl@kernel.org> Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Chris Li <chrisl@kernel.org> Suggested-by: Chris Li <chrisl@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: kernel test robot <oliver.sang@intel.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Zi Yan <ziy@nvidia.com> Cc: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-21 14:22:24 -07:00
Kairui Song	094dc8b059	mm, swap: wrap swap cache replacement with a helper There are currently three swap cache users that are trying to replace an existing folio with a new one: huge memory splitting, migration, and shmem replacement. What they are doing is quite similar. Introduce a common helper for this. In later commits, this can be easily switched to use the swap table by updating this helper. The newly added helper also makes the swap cache API better defined, and make debugging easier by adding a few more debug checks. Migration and shmem replace are meant to clone the folio, including content, swap entry value, and flags. And splitting will adjust each sub folio's swap entry according to order, which could be non-uniform in the future. So document it clearly that it's the caller's responsibility to set up the new folio's swap entries and flags before calling the helper. The helper will just follow the new folio's entry value. This also prepares for replacing high-order folios in the swap cache. Currently, only splitting to order 0 is allowed for swap cache folios. Using the new helper, we can handle high-order folio splitting better. Link: https://lkml.kernel.org/r/20250916160100.31545-11-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Chris Li <chrisl@kernel.org> Suggested-by: Chris Li <chrisl@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: kernel test robot <oliver.sang@intel.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Zi Yan <ziy@nvidia.com> Cc: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-21 14:22:24 -07:00
Kairui Song	84a7a9823e	mm/shmem, swap: remove redundant error handling for replacing folio Shmem may replace a folio in the swap cache if the cached one doesn't fit the swapin's GFP zone. When doing so, shmem has already double checked that the swap cache folio is locked, still has the swap cache flag set, and contains the wanted swap entry. So it is impossible to fail due to an XArray mismatch. There is even a comment for that. Delete the defensive error handling path, and add a WARN_ON instead: if that happened, something has broken the basic principle of how the swap cache works, we should catch and fix that. Link: https://lkml.kernel.org/r/20250916160100.31545-10-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Suggested-by: Chris Li <chrisl@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: kernel test robot <oliver.sang@intel.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Zi Yan <ziy@nvidia.com> Cc: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-21 14:22:24 -07:00
Kairui Song	fd8d4f862f	mm, swap: cleanup swap cache API and add kerneldoc In preparation for replacing the swap cache backend with the swap table, clean up and add proper kernel doc for all swap cache APIs. Now all swap cache APIs are well-defined with consistent names. No feature change, only renaming and documenting. Link: https://lkml.kernel.org/r/20250916160100.31545-9-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Chris Li <chrisl@kernel.org> Reviewed-by: Barry Song <baohua@kernel.org> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: David Hildenbrand <david@redhat.com> Suggested-by: Chris Li <chrisl@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: kernel test robot <oliver.sang@intel.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Zi Yan <ziy@nvidia.com> Cc: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-21 14:22:23 -07:00
Kairui Song	0fcf8ef4fd	mm, swap: tidy up swap device and cluster info helpers swp_swap_info is the most commonly used helper for retrieving swap info. It has an internal check that may lead to a NULL return value, but almost none of its caller checks the return value, making the internal check pointless. In fact, most of these callers already ensured the entry is valid and never expect a NULL value. Tidy this up and improve the function names. If the caller can make sure the swap entry/type is valid and the device is pinned, use the new introduced __swap_entry_to_info/__swap_type_to_info instead. They have more debug sanity checks and lower overhead as they are inlined. Callers that may expect a NULL value should use swap_entry_to_info/swap_type_to_info instead. No feature change. The rearranged codes should have had no effect, or they should have been hitting NULL de-ref bugs already. Only some new sanity checks are added so potential issues may show up in debug build. The new helpers will be frequently used with swap table later when working with swap cache folios. A locked swap cache folio ensures the entries are valid and stable so these helpers are very helpful. Link: https://lkml.kernel.org/r/20250916160100.31545-8-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Chris Li <chrisl@kernel.org> Reviewed-by: Barry Song <baohua@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Suggested-by: Chris Li <chrisl@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: kernel test robot <oliver.sang@intel.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Zi Yan <ziy@nvidia.com> Cc: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-21 14:22:23 -07:00
Kairui Song	4522aed4ff	mm, swap: rename and move some swap cluster definition and helpers No feature change, move cluster related definitions and helpers to mm/swap.h, also tidy up and add a "swap_" prefix for cluster lock/unlock helpers, so they can be used outside of swap files. And while at it, add kerneldoc. Link: https://lkml.kernel.org/r/20250916160100.31545-7-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Barry Song <baohua@kernel.org> Acked-by: Chris Li <chrisl@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Suggested-by: Chris Li <chrisl@kernel.org> Acked-by: Nhat Pham <nphamcs@gmail.com> Cc: Baoquan He <bhe@redhat.com> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: kernel test robot <oliver.sang@intel.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Zi Yan <ziy@nvidia.com> Cc: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-21 14:22:23 -07:00
Kairui Song	ae38eb2105	mm, swap: always lock and check the swap cache folio before use Swap cache lookup only increases the reference count of the returned folio. That's not enough to ensure a folio is stable in the swap cache, so the folio could be removed from the swap cache at any time. The caller should always lock and check the folio before using it. We have just documented this in kerneldoc, now introduce a helper for swap cache folio verification with proper sanity checks. Also, sanitize a few current users to use this convention and the new helper for easier debugging. They were not having observable problems yet, only trivial issues like wasted CPU cycles on swapoff or reclaiming. They would fail in some other way, but it is still better to always follow this convention to make things robust and make later commits easier to do. Link: https://lkml.kernel.org/r/20250916160100.31545-6-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Chris Li <chrisl@kernel.org> Acked-by: Nhat Pham <nphamcs@gmail.com> Suggested-by: Chris Li <chrisl@kernel.org> Reviewed-by: Barry Song <baohua@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: kernel test robot <oliver.sang@intel.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Zi Yan <ziy@nvidia.com> Cc: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-21 14:22:23 -07:00
Kairui Song	3518b931df	mm, swap: check page poison flag after locking it Instead of checking the poison flag only in the fast swap cache lookup path, always check the poison flags after locking a swap cache folio. There are two reasons to do so. The folio is unstable and could be removed from the swap cache anytime, so it's totally possible that the folio is no longer the backing folio of a swap entry, and could be an irrelevant poisoned folio. We might mistakenly kill a faulting process. And it's totally possible or even common for the slow swap in path (swapin_readahead) to bring in a cached folio. The cache folio could be poisoned, too. Only checking the poison flag in the fast path will miss such folios. The race window is tiny, so it's very unlikely to happen, though. While at it, also add a unlikely prefix. Link: https://lkml.kernel.org/r/20250916160100.31545-5-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Chris Li <chrisl@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Nhat Pham <nphamcs@gmail.com> Suggested-by: Chris Li <chrisl@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: kernel test robot <oliver.sang@intel.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Zi Yan <ziy@nvidia.com> Cc: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-21 14:22:22 -07:00
Kairui Song	a733d8de7f	mm, swap: fix swap cache index error when retrying reclaim The allocator will reclaim cached slots while scanning. Currently, it will try again if reclaim found a folio that is already removed from the swap cache due to a race. But the following lookup will be using the wrong index. It won't cause any OOB issue since the swap cache index is truncated upon lookup, but it may lead to reclaiming of an irrelevant folio. This should not cause a measurable issue, but we should fix it. Link: https://lkml.kernel.org/r/20250916160100.31545-4-ryncsn@gmail.com Fixes: `fae8595505` ("mm, swap: avoid reclaiming irrelevant swap cache") Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: Nhat Pham <nphamcs@gmail.com> Acked-by: Chris Li <chrisl@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Suggested-by: Chris Li <chrisl@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: kernel test robot <oliver.sang@intel.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Zi Yan <ziy@nvidia.com> Cc: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-21 14:22:22 -07:00
Kairui Song	f28124617f	mm, swap: use unified helper for swap cache look up The swap cache lookup helper swap_cache_get_folio currently does readahead updates as well, so callers that are not doing swapin from any VMA or mapping are forced to reuse filemap helpers instead, and have to access the swap cache space directly. So decouple readahead update with swap cache lookup. Move the readahead update part into a standalone helper. Let the caller call the readahead update helper if they do readahead. And convert all swap cache lookups to use swap_cache_get_folio. After this commit, there are only three special cases for accessing swap cache space now: huge memory splitting, migration, and shmem replacing, because they need to lock the XArray. The following commits will wrap their accesses to the swap cache too, with special helpers. And worth noting, currently dropbehind is not supported for anon folio, and we will never see a dropbehind folio in swap cache. The unified helper can be updated later to handle that. While at it, add proper kernedoc for touched helpers. No functional change. Link: https://lkml.kernel.org/r/20250916160100.31545-3-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Barry Song <baohua@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Chris Li <chrisl@kernel.org> Acked-by: Nhat Pham <nphamcs@gmail.com> Suggested-by: Chris Li <chrisl@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: kernel test robot <oliver.sang@intel.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Zi Yan <ziy@nvidia.com> Cc: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-21 14:22:22 -07:00
Alistair Popple	614d850efd	mm/memremap: remove unused get_dev_pagemap() parameter GUP no longer uses get_dev_pagemap(). As it was the only user of the get_dev_pagemap() pgmap caching feature it can be removed. Link: https://lkml.kernel.org/r/20250903225926.34702-2-apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-21 14:22:21 -07:00
Alistair Popple	d3f7922b92	mm/gup: remove dead pgmap refcounting code Prior to commit `aed877c2b4` ("device/dax: properly refcount device dax pages when mapping") ZONE_DEVICE pages were not fully reference counted when mapped into user page tables. Instead GUP would take a reference on the associated pgmap to ensure the results of pfn_to_page() remained valid. This is no longer required and most of the code was removed by commit `fd2825b076` ("mm/gup: remove pXX_devmap usage from get_user_pages()"). Finish cleaning this up by removing the dead calls to put_dev_pagemap() and the temporary context struct. Link: https://lkml.kernel.org/r/20250903225926.34702-1-apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-21 14:22:21 -07:00
Wei Yang	4805ef3707	mm/page_alloc: check the correct buddy if it is a starting block find_large_buddy() search buddy based on start_pfn, which maybe different from page's pfn, e.g. when page is not pageblock aligned, because prep_move_freepages_block() always align start_pfn to pageblock. This means when we found a starting block at start_pfn, it may check on the wrong page theoretically. And not split the free page as it is supposed to, causing a freelist migratetype mismatch. The good news is the page passed to __move_freepages_block_isolate() has only two possible cases: * page is pageblock aligned * page is __first_valid_page() of this block So it is safe for the first case, and it won't get a buddy larger than pageblock for the second case. To fix the issue, check the returned pfn of find_large_buddy() to decide whether to split the free page: 1. if it is not a PageBuddy pfn, no split; 2. if it is a PageBuddy pfn but order <= pageblock_order, no split; 3. if it is a PageBuddy pfn with order > pageblock_order, start_pfn is either in the starting block or tail block, split the PageBuddy at pageblock_order level. Link: https://lkml.kernel.org/r/20250905140358.28849-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: David Hildenbrand <david@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-21 14:22:21 -07:00
Miaohe Lin	5ce1dbfdd8	mm/hwpoison: decouple hwpoison_filter from mm/memory-failure.c mm/memory-failure.c defines and uses hwpoison_filter_* parameters but the values of those parameters can only be modified via mm/hwpoison-inject.c from userspace. They have a potentially different life time. Decouple those parameters from mm/memory-failure.c to fix this broken layering. Link: https://lkml.kernel.org/r/20250904062258.3336092-1-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Suggested-by: Michal Hocko <mhocko@suse.com> Cc: David Hildenbrand <david@redhat.com> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-21 14:22:21 -07:00
Jinjiang Tu	0faa77afe7	filemap: optimize folio refount update in filemap_map_pages There are two meaningless folio refcount update for order0 folio in filemap_map_pages(). First, filemap_map_order0_folio() adds folio refcount after the folio is mapped to pte. And then, filemap_map_pages() drops a refcount grabbed by next_uptodate_folio(). We could remain the refcount unchanged in this case. As Matthew metenioned in [1], it is safe to call folio_unlock() before calling folio_put() here, because the folio is in page cache with refcount held, and truncation will wait for the unlock. Optimize filemap_map_folio_range() with the same method too. With this patch, we can get 8% performance gain for lmbench testcase 'lat_pagefault -P 1 file' in order0 folio case, the size of file is 512M. Link: https://lkml.kernel.org/r/20250904132737.1250368-1-tujinjiang@huawei.com Link: https://lore.kernel.org/all/aKcU-fzxeW3xT5Wv@casper.infradead.org/ [1] Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-21 14:22:20 -07:00
Baolin Wang	69e0a3b490	mm: shmem: fix the strategy for the tmpfs 'huge=' options After commit `acd7ccb284` ("mm: shmem: add large folio support for tmpfs"), we have extended tmpfs to allow any sized large folios, rather than just PMD-sized large folios. The strategy discussed previously was: : Considering that tmpfs already has the 'huge=' option to control the : PMD-sized large folios allocation, we can extend the 'huge=' option to : allow any sized large folios. The semantics of the 'huge=' mount option : are: : : huge=never: no any sized large folios : huge=always: any sized large folios : huge=within_size: like 'always' but respect the i_size : huge=advise: like 'always' if requested with madvise() : : Note: for tmpfs mmap() faults, due to the lack of a write size hint, still : allocate the PMD-sized huge folios if huge=always/within_size/advise is : set. : : Moreover, the 'deny' and 'force' testing options controlled by : '/sys/kernel/mm/transparent_hugepage/shmem_enabled', still retain the same : semantics. The 'deny' can disable any sized large folios for tmpfs, while : the 'force' can enable PMD sized large folios for tmpfs. This means that when tmpfs is mounted with 'huge=always' or 'huge=within_size', tmpfs will allow getting a highest order hint based on the size of write() and fallocate() paths. It will then try each allowable large order, rather than continually attempting to allocate PMD-sized large folios as before. However, this might break some user scenarios for those who want to use PMD-sized large folios, such as the i915 driver which did not supply a write size hint when allocating shmem [1]. Moreover, Hugh also complained that this will cause a regression in userspace with 'huge=always' or 'huge=within_size'. So, let's revisit the strategy for tmpfs large page allocation. A simple fix would be to always try PMD-sized large folios first, and if that fails, fall back to smaller large folios. This approach differs from the strategy for large folio allocation used by other file systems, however, tmpfs is somewhat different from other file systems, as quoted from David's opinion: : There were opinions in the past that tmpfs should just behave like any : other fs, and I think that's what we tried to satisfy here: use the write : size as an indication. : : I assume there will be workloads where either approach will be beneficial. : I also assume that workloads that use ordinary fs'es could benefit from : the same strategy (start with PMD), while others will clearly not. Link: https://lkml.kernel.org/r/10e7ac6cebe6535c137c064d5c5a235643eebb4a.1756888965.git.baolin.wang@linux.alibaba.com Link: https://lore.kernel.org/lkml/0d734549d5ed073c80b11601da3abdd5223e1889.1753689802.git.baolin.wang@linux.alibaba.com/ [1] Fixes: `acd7ccb284` ("mm: shmem: add large folio support for tmpfs") Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Hugh Dickins <hughd@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mariano Pache <npache@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-21 14:22:19 -07:00
Yueyang Pan	8147bc15b4	mm/show_mem: add trylock while printing alloc info In production, show_mem() can be called concurrently from two different entities, for example one from oom_kill_process() another from __alloc_pages_slowpath from another kthread. This patch adds a spinlock and invokes trylock before printing out the kernel alloc info in show_mem(). This way two alloc info won't interleave with each other, which then makes parsing easier. Link: https://lkml.kernel.org/r/4ed91296e0c595d945a38458f7a8d9611b0c1e52.1756897825.git.pyyjason@gmail.com Signed-off-by: Yueyang Pan <pyyjason@gmail.com> Acked-by: Usama Arif <usamaarif642@gmail.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Zi Yan <ziy@nvidia.com> Acked-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Brendan Jackman <jackmanb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-21 14:22:18 -07:00
Yueyang Pan	9abd8bd4c6	mm/show_mem: dump the status of the mem alloc profiling before printing This patchset fixes two issues we saw in production rollout. The first issue is that we saw all zero output of memory allocation profiling information from show_mem() if CONFIG_MEM_ALLOC_PROFILING is set and sysctl.vm.mem_profiling=0. This cause ambiguity as we don't know what 0B actually means in the output. It can mean either memory allocation profiling is temporary disabled or the allocation at that position is actually 0. Such ambiguity will make further parsing harder as we cannot differentiate between two case. The second issue is that multiple entities can call show_mem() which messed up the allocation info in dmesg. We saw outputs like this: 327 MiB 83635 mm/compaction.c:1880 func:compaction_alloc 48.4 GiB 12684937 mm/memory.c:1061 func:folio_prealloc 7.48 GiB 10899 mm/huge_memory.c:1159 func:vma_alloc_anon_folio_pmd 298 MiB 95216 kernel/fork.c:318 func:alloc_thread_stack_node 250 MiB 63901 mm/zsmalloc.c:987 func:alloc_zspage 1.42 GiB 372527 mm/memory.c:1063 func:folio_prealloc 1.17 GiB 95693 mm/slub.c:2424 func:alloc_slab_page 651 MiB 166732 mm/readahead.c:270 func:page_cache_ra_unbounded 419 MiB 107261 net/core/page_pool.c:572 func:__page_pool_alloc_pages_slow 404 MiB 103425 arch/x86/mm/pgtable.c:25 func:pte_alloc_one The above example is because one kthread invokes show_mem() from __alloc_pages_slowpath while kernel itself calls oom_kill_process() This patch (of 2): This patch prints the status of the memory allocation profiling before __show_mem actually prints the detailed allocation info. This way will let us know the `0B` we saw in allocation info is because the profiling is disabled or the allocation is actually 0B. Link: https://lkml.kernel.org/r/cover.1756897825.git.pyyjason@gmail.com Link: https://lkml.kernel.org/r/d7998ea0ddc2ea1a78bb6e89adf530526f76679a.1756897825.git.pyyjason@gmail.com Signed-off-by: Yueyang Pan <pyyjason@gmail.com> Acked-by: Usama Arif <usamaarif642@gmail.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Zi Yan <ziy@nvidia.com> Acked-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Brendan Jackman <jackmanb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-21 14:22:18 -07:00
Vishal Moola (Oracle)	162f6c69ea	mm/page_alloc: add kernel-docs for free_pages() Patch series "Cleanup free_pages() misuse", v3. free_pages() is supposed to be called when we only have a virtual address. __free_pages() is supposed to be called when we have a page. There are a number of callers that use page_address() to get a page's virtual address then call free_pages() on it when they should just call __free_pages() directly. Add kernel-docs for free_pages() to help callers better understand which function they should be calling, and replace the obvious cases of misuse. This patch (of 7): Add kernel-docs to free_pages(). This will help callers understand when to use it instead of __free_pages(). Link: https://lkml.kernel.org/r/20250903185921.1785167-1-vishal.moola@gmail.com Link: https://lkml.kernel.org/r/20250903185921.1785167-2-vishal.moola@gmail.com Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: SeongJae Park <sj@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Andy Lutomirski <luto@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Justin Sanders <justin@coraid.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-21 14:22:16 -07:00
Youling Tang	9fd53c8122	mm/filemap: align last_index to folio size On XFS systems with pagesize=4K, blocksize=16K, and CONFIG_TRANSPARENT_HUGEPAGE enabled, We observed the following readahead behaviors: # echo 3 > /proc/sys/vm/drop_caches # dd if=test of=/dev/null bs=64k count=1 # ./tools/mm/page-types -r -L -f /mnt/xfs/test foffset offset flags 0 136d4c __RU_l_________H______t_________________F_1 1 136d4d __RU_l__________T_____t_________________F_1 2 136d4e __RU_l__________T_____t_________________F_1 3 136d4f __RU_l__________T_____t_________________F_1 ... c 136bb8 __RU_l_________H______t_________________F_1 d 136bb9 __RU_l__________T_____t_________________F_1 e 136bba __RU_l__________T_____t_________________F_1 f 136bbb __RU_l__________T_____t_________________F_1 <-- first read 10 13c2cc ___U_l_________H______t______________I__F_1 <-- readahead flag 11 13c2cd ___U_l__________T_____t______________I__F_1 12 13c2ce ___U_l__________T_____t______________I__F_1 13 13c2cf ___U_l__________T_____t______________I__F_1 ... 1c 1405d4 ___U_l_________H______t_________________F_1 1d 1405d5 ___U_l__________T_____t_________________F_1 1e 1405d6 ___U_l__________T_____t_________________F_1 1f 1405d7 ___U_l__________T_____t_________________F_1 [ra_size = 32, req_count = 16, async_size = 16] # echo 3 > /proc/sys/vm/drop_caches # dd if=test of=/dev/null bs=60k count=1 # ./page-types -r -L -f /mnt/xfs/test foffset offset flags 0 136048 __RU_l_________H______t_________________F_1 ... c 110a40 __RU_l_________H______t_________________F_1 d 110a41 __RU_l__________T_____t_________________F_1 e 110a42 __RU_l__________T_____t_________________F_1 <-- first read f 110a43 __RU_l__________T_____t_________________F_1 <-- first readahead flag 10 13e7a8 ___U_l_________H______t_________________F_1 ... 20 137a00 ___U_l_________H______t_______P______I__F_1 <-- second readahead flag (20 - 2f) 21 137a01 ___U_l__________T_____t_______P______I__F_1 ... 3f 10d4af ___U_l__________T_____t_______P_________F_1 [first readahead: ra_size = 32, req_count = 15, async_size = 17] When reading 64k data (same for 61-63k range, where last_index is page-aligned in filemap_get_pages()), 128k readahead is triggered via page_cache_sync_ra() and the PG_readahead flag is set on the next folio (the one containing 0x10 page). When reading 60k data, 128k readahead is also triggered via page_cache_sync_ra(). However, in this case the readahead flag is set on the 0xf page. Although the requested read size (req_count) is 60k, the actual read will be aligned to folio size (64k), which triggers the readahead flag and initiates asynchronous readahead via page_cache_async_ra(). This results in two readahead operations totaling 256k. The root cause is that when the requested size is smaller than the actual read size (due to folio alignment), it triggers asynchronous readahead. By changing last_index alignment from page size to folio size, we ensure the requested size matches the actual read size, preventing the case where a single read operation triggers two readahead operations. After applying the patch: # echo 3 > /proc/sys/vm/drop_caches # dd if=test of=/dev/null bs=60k count=1 # ./page-types -r -L -f /mnt/xfs/test foffset offset flags 0 136d4c __RU_l_________H______t_________________F_1 1 136d4d __RU_l__________T_____t_________________F_1 2 136d4e __RU_l__________T_____t_________________F_1 3 136d4f __RU_l__________T_____t_________________F_1 ... c 136bb8 __RU_l_________H______t_________________F_1 d 136bb9 __RU_l__________T_____t_________________F_1 e 136bba __RU_l__________T_____t_________________F_1 <-- first read f 136bbb __RU_l__________T_____t_________________F_1 10 13c2cc ___U_l_________H______t______________I__F_1 <-- readahead flag 11 13c2cd ___U_l__________T_____t______________I__F_1 12 13c2ce ___U_l__________T_____t______________I__F_1 13 13c2cf ___U_l__________T_____t______________I__F_1 ... 1c 1405d4 ___U_l_________H______t_________________F_1 1d 1405d5 ___U_l__________T_____t_________________F_1 1e 1405d6 ___U_l__________T_____t_________________F_1 1f 1405d7 ___U_l__________T_____t_________________F_1 [ra_size = 32, req_count = 16, async_size = 16] The same phenomenon will occur when reading from 49k to 64k. Set the readahead flag to the next folio. Because the minimum order of folio in address_space equals the block size (at least in xfs and bcachefs that already support bs > ps), having request_count aligned to block size will not cause overread. [klarasmodin@gmail.com: fix overflow on 32-bit] Link: https://lkml.kernel.org/r/yru7qf5gvyzccq5ohhpylvxug5lr5tf54omspbjh4sm6pcdb2r@fpjgj2pxw7va [akpm@linux-foundation.org: update it for Max's constification efforts] Link: https://lkml.kernel.org/r/20250711055509.91587-1-youling.tang@linux.dev Co-developed-by: Chi Zhiling <chizhiling@kylinos.cn> Signed-off-by: Chi Zhiling <chizhiling@kylinos.cn> Signed-off-by: Youling Tang <tangyouling@kylinos.cn> Signed-off-by: Klara Modin <klarasmodin@gmail.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Youling Tang <youling.tang@linux.dev> Cc: David Hildenbrand <david@redhat.com> Cc: Klara Modin <klarasmodin@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-21 14:22:15 -07:00
Max Kellermann	a847b17009	mm: constify highmem related functions for improved const-correctness Lots of functions in mm/highmem.c do not write to the given pointers and do not call functions that take non-const pointers and can therefore be constified. This includes functions like kunmap() which might be implemented in a way that writes to the pointer (e.g. to update reference counters or mapping fields), but currently are not. kmap() on the other hand cannot be made const because it calls set_page_address() which is non-const in some architectures/configurations. [akpm@linux-foundation.org: "fix" folio_page() build failure] Link: https://lkml.kernel.org/r/20250901205021.3573313-13-max.kellermann@ionos.com Signed-off-by: Max Kellermann <max.kellermann@ionos.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Borislav Betkov <bp@alien8.de> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Christian Zankel <chris@zankel.net> Cc: David Rientjes <rientjes@google.com> Cc: David S. Miller <davem@davemloft.net> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Bottomley <james.bottomley@HansenPartnership.com> Cc: Jan Kara <jack@suse.cz> Cc: Jocelyn Falempe <jfalempe@redhat.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Mark Brown <broonie@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: "Nysal Jan K.A" <nysal@linux.ibm.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Russel King <linux@armlinux.org.uk> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Thomas Huth <thuth@redhat.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Wei Xu <weixugc@google.com> Cc: Yuanchu Xie <yuanchu@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-21 14:22:15 -07:00
Max Kellermann	a955cca372	mm: constify arch_pick_mmap_layout() for improved const-correctness This function only reads from the rlimit pointer (but writes to the mm_struct pointer which is kept without `const`). All callees are already const-ified or (internal functions) are being constified by this patch. Link: https://lkml.kernel.org/r/20250901205021.3573313-9-max.kellermann@ionos.com Signed-off-by: Max Kellermann <max.kellermann@ionos.com> Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Borislav Betkov <bp@alien8.de> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Christian Zankel <chris@zankel.net> Cc: David Rientjes <rientjes@google.com> Cc: David S. Miller <davem@davemloft.net> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Bottomley <james.bottomley@HansenPartnership.com> Cc: Jan Kara <jack@suse.cz> Cc: Jocelyn Falempe <jfalempe@redhat.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Mark Brown <broonie@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: "Nysal Jan K.A" <nysal@linux.ibm.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Russel King <linux@armlinux.org.uk> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Thomas Huth <thuth@redhat.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Wei Xu <weixugc@google.com> Cc: Yuanchu Xie <yuanchu@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-21 14:22:14 -07:00
Max Kellermann	0bf25cfc9e	mm, s390: constify mapping related test/getter functions For improved const-correctness. We select certain test functions which either invoke each other, functions that are already const-ified, or no further functions. It is therefore relatively trivial to const-ify them, which provides a basis for further const-ification further up the call stack. (Even though seemingly unrelated, this also constifies the pointer parameter of mmap_is_legacy() in arch/s390/mm/mmap.c because a copy of the function exists in mm/util.c.) Link: https://lkml.kernel.org/r/20250901205021.3573313-7-max.kellermann@ionos.com Signed-off-by: Max Kellermann <max.kellermann@ionos.com> Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Borislav Betkov <bp@alien8.de> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Christian Zankel <chris@zankel.net> Cc: David Rientjes <rientjes@google.com> Cc: David S. Miller <davem@davemloft.net> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Bottomley <james.bottomley@HansenPartnership.com> Cc: Jan Kara <jack@suse.cz> Cc: Jocelyn Falempe <jfalempe@redhat.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Mark Brown <broonie@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: "Nysal Jan K.A" <nysal@linux.ibm.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Russel King <linux@armlinux.org.uk> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Thomas Huth <thuth@redhat.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Wei Xu <weixugc@google.com> Cc: Yuanchu Xie <yuanchu@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-21 14:22:13 -07:00
Max Kellermann	4680092f8c	mm: constify process_shares_mm() for improved const-correctness This function only reads from the pointer arguments. Local (loop) variables are also annotated with `const` to clarify that these will not be written to. Link: https://lkml.kernel.org/r/20250901205021.3573313-6-max.kellermann@ionos.com Signed-off-by: Max Kellermann <max.kellermann@ionos.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Borislav Betkov <bp@alien8.de> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Christian Zankel <chris@zankel.net> Cc: David Rientjes <rientjes@google.com> Cc: David S. Miller <davem@davemloft.net> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Bottomley <james.bottomley@HansenPartnership.com> Cc: Jan Kara <jack@suse.cz> Cc: Jocelyn Falempe <jfalempe@redhat.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Mark Brown <broonie@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: "Nysal Jan K.A" <nysal@linux.ibm.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Russel King <linux@armlinux.org.uk> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Thomas Huth <thuth@redhat.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Wei Xu <weixugc@google.com> Cc: Yuanchu Xie <yuanchu@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-21 14:22:13 -07:00
Max Kellermann	8eccb066f2	mm: constify shmem related test functions for improved const-correctness Patch series "mm: establish const-correctness for pointer parameters", v6. This series is to improved const-correctness in the low-level memory-management subsystem, which provides a basis for further constification further up the call stack (e.g. filesystems). I started this work when I tried to constify the Ceph filesystem code, but found that to be impossible because many "mm" functions accept non-const pointers, even though they modify nothing. This patch (of 12): We select certain test functions which either invoke each other, functions that are already const-ified, or no further functions. It is therefore relatively trivial to const-ify them, which provides a basis for further const-ification further up the call stack. Link: https://lkml.kernel.org/r/20250901205021.3573313-1-max.kellermann@ionos.com Link: https://lkml.kernel.org/r/20250901205021.3573313-2-max.kellermann@ionos.com Signed-off-by: Max Kellermann <max.kellermann@ionos.com> Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Borislav Betkov <bp@alien8.de> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Christian Zankel <chris@zankel.net> Cc: David Rientjes <rientjes@google.com> Cc: David S. Miller <davem@davemloft.net> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Bottomley <james.bottomley@HansenPartnership.com> Cc: Jan Kara <jack@suse.cz> Cc: Jocelyn Falempe <jfalempe@redhat.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Mark Brown <broonie@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: "Nysal Jan K.A" <nysal@linux.ibm.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Russel King <linux@armlinux.org.uk> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Thomas Huth <thuth@redhat.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Wei Xu <weixugc@google.com> Cc: Yuanchu Xie <yuanchu@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-21 14:22:12 -07:00
Kefeng Wang	4fe2a8107f	mm: hugeltb: check NUMA_NO_NODE in only_alloc_fresh_hugetlb_folio() Move the NUMA_NO_NODE check out of buddy and gigantic folio allocation to cleanup code a bit, also this will avoid NUMA_NO_NODE passed as 'nid' to node_isset() in alloc_buddy_hugetlb_folio(). Link: https://lkml.kernel.org/r/20250910133958.301467-6-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Acked-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Jane Chu <jane.chu@oracle.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Brendan Jackman <jackmanb@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-21 14:22:12 -07:00
Kefeng Wang	dd4d324bc0	mm: hugetlb: remove struct hstate from init_new_hugetlb_folio() The struct hstate is never used since commit `d67e32f267` ("hugetlb: restructure pool allocations”), remove it. Link: https://lkml.kernel.org/r/20250910133958.301467-5-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Acked-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Jane Chu <jane.chu@oracle.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Brendan Jackman <jackmanb@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-21 14:22:12 -07:00
Kefeng Wang	4a25f995bd	mm: hugetlb: directly pass order when allocate a hugetlb folio Use order instead of struct hstate to remove huge_page_order() call from all hugetlb folio allocation, also order_is_gigantic() is added to check whether it is a gigantic order. Link: https://lkml.kernel.org/r/20250910133958.301467-4-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Acked-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Jane Chu <jane.chu@oracle.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Brendan Jackman <jackmanb@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-21 14:22:11 -07:00
Kefeng Wang	4094d3434b	mm: hugetlb: convert to account_new_hugetlb_folio() In order to avoid the wrong nid passed into the account, and we did make such mistake before, so it's better to move folio_nid() into account_new_hugetlb_folio(). Link: https://lkml.kernel.org/r/20250910133958.301467-3-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Acked-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Brendan Jackman <jackmanb@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jane Chu <jane.chu@oracle.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-21 14:22:11 -07:00
Kefeng Wang	902020f027	mm: hugetlb: convert to use more alloc_fresh_hugetlb_folio() Patch series "mm: hugetlb: cleanup hugetlb folio allocation", v3. Some cleanups for hugetlb folio allocation. This patch (of 3): Simplify alloc_fresh_hugetlb_folio() and convert more functions to use it, which help us to remove prep_new_hugetlb_folio() and __prep_new_hugetlb_folio(). Link: https://lkml.kernel.org/r/20250910133958.301467-1-wangkefeng.wang@huawei.com Link: https://lkml.kernel.org/r/20250910133958.301467-2-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Acked-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Brendan Jackman <jackmanb@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jane Chu <jane.chu@oracle.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-21 14:22:11 -07:00
Thadeu Lima de Souza Cascardo	0c83e7faa8	mm: show_mem: show number of zspages in show_free_areas When OOM is triggered, it will show where the pages might be for each zone. When using zram or zswap, it might look like lots of pages are missing. After this patch, zspages are shown as below. [ 48.792859] Node 0 DMA free:2812kB boost:0kB min:60kB low:72kB high:84kB reserved_highatomic:0KB free_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB zspages:11160kB present:15992kB managed:15360kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB [ 48.792962] lowmem_reserve[]: 0 956 956 956 956 [ 48.792988] Node 0 DMA32 free:3512kB boost:0kB min:3912kB low:4888kB high:5864kB reserved_highatomic:0KB free_highatomic:0KB active_anon:0kB inactive_anon:28kB active_file:8kB inactive_file:16kB unevictable:0kB writepending:0kB zspages:916780kB present:1032064kB managed:978944kB mlocked:0kB bounce:0kB free_pcp:500kB local_pcp:248kB free_cma:0kB [ 48.793118] lowmem_reserve[]: 0 0 0 0 0 Link: https://lkml.kernel.org/r/20250902-show_mem_zspages-v2-1-545daaa8b410@igalia.com Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@igalia.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Zi Yan <ziy@nvidia.com> Acked-by: SeongJae Park <sj@kernel.org> Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Nhat Pham <nphamcs@gmail.com> Cc: Brendan Jackman <jackmanb@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-21 14:22:11 -07:00
Li RongQing	2a8f3f44f5	mm/hugetlb: retry to allocate for early boot hugepage allocation In cloud environments with massive hugepage reservations (95%+ of system RAM), single-attempt allocation during early boot often fails due to memory pressure. Commit `91f386bf07` ("hugetlb: batch freeing of vmemmap pages") intensified this by deferring page frees, increase peak memory usage during allocation. Introduce a retry mechanism that leverages vmemmap optimization reclaim (~1.6% memory) when available. Upon initial allocation failure, the system retries until successful or no further progress is made, ensuring reliable hugepage allocation while preserving batched vmemmap freeing benefits. Testing on a 256G machine allocating 252G of hugepages: Before: 128056/129024 hugepages allocated After: Successfully allocated all 129024 hugepages Link: https://lkml.kernel.org/r/20250901082052.3247-1-lirongqing@baidu.com Signed-off-by: Li RongQing <lirongqing@baidu.com> Suggested-by: David Hildenbrand <david@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Li RongQing <lirongqing@baidu.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-21 14:22:10 -07:00
Yeoreum Yun	2b79cb3eac	kasan: apply write-only mode in kasan kunit testcases When KASAN is configured in write-only mode, fetch/load operations do not trigger tag check faults. As a result, the outcome of some test cases may differ compared to when KASAN is configured without write-only mode. Therefore, by modifying pre-exist testcases check the write only makes tag check fault (TCF) where writing is perform in "allocated memory" but tag is invalid (i.e) redzone write in atomic_set() testcases. Otherwise check the invalid fetch/read doesn't generate TCF. Also, skip some testcases affected by initial value (i.e) atomic_cmpxchg() testcase maybe successd if it passes valid atomic_t address and invalid oldaval address. In this case, if invalid atomic_t doesn't have the same oldval, it won't trigger write operation so the test will pass. Link: https://lkml.kernel.org/r/20250916222755.466009-3-yeoreum.yun@arm.com Signed-off-by: Yeoreum Yun <yeoreum.yun@arm.com> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Reviewed-by: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Breno Leitao <leitao@debian.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Dmitriy Vyukov <dvyukov@google.com> Cc: D Scott Phillips <scott@os.amperecomputing.com> Cc: Hardevsinh Palaniya <hardevsinh.palaniya@siliconsignals.io> Cc: James Morse <james.morse@arm.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Oliver Upton <oliver.upton@linux.dev> Cc: Pankaj Gupta <pankaj.gupta@amd.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <yang@os.amperecomputing.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-21 14:22:10 -07:00
Yeoreum Yun	31d8edb535	kasan/hw-tags: introduce kasan.write_only option Patch series "introduce kasan.write_only option in hw-tags", v8. Hardware tag based KASAN is implemented using the Memory Tagging Extension (MTE) feature. MTE is built on top of the ARMv8.0 virtual address tagging TBI (Top Byte Ignore) feature and allows software to access a 4-bit allocation tag for each 16-byte granule in the physical address space. A logical tag is derived from bits 59-56 of the virtual address used for the memory access. A CPU with MTE enabled will compare the logical tag against the allocation tag and potentially raise an tag check fault on mismatch, subject to system registers configuration. Since ARMv8.9, FEAT_MTE_STORE_ONLY can be used to restrict raise of tag check fault on store operation only. Using this feature (FEAT_MTE_STORE_ONLY), introduce KASAN write-only mode which restricts KASAN check write (store) operation only. This mode omits KASAN check for read (fetch/load) operation. Therefore, it might be used not only debugging purpose but also in normal environment. This patch (of 2): Since Armv8.9, FEATURE_MTE_STORE_ONLY feature is introduced to restrict raise of tag check fault on store operation only. Introduce KASAN write only mode based on this feature. KASAN write only mode restricts KASAN checks operation for write only and omits the checks for fetch/read operations when accessing memory. So it might be used not only debugging enviroment but also normal enviroment to check memory safty. This features can be controlled with "kasan.write_only" arguments. When "kasan.write_only=on", KASAN checks write operation only otherwise KASAN checks all operations. This changes the MTE_STORE_ONLY feature as BOOT_CPU_FEATURE like ARM64_MTE_ASYMM so that makes it initialise in kasan_init_hw_tags() with other function together. Link: https://lkml.kernel.org/r/20250916222755.466009-1-yeoreum.yun@arm.com Link: https://lkml.kernel.org/r/20250916222755.466009-2-yeoreum.yun@arm.com Signed-off-by: Yeoreum Yun <yeoreum.yun@arm.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Breno Leitao <leitao@debian.org> Cc: David Hildenbrand <david@redhat.com> Cc: Dmitriy Vyukov <dvyukov@google.com> Cc: D Scott Phillips <scott@os.amperecomputing.com> Cc: Hardevsinh Palaniya <hardevsinh.palaniya@siliconsignals.io> Cc: James Morse <james.morse@arm.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: levi.yun <yeoreum.yun@arm.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Oliver Upton <oliver.upton@linux.dev> Cc: Pankaj Gupta <pankaj.gupta@amd.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <yang@os.amperecomputing.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-21 14:22:10 -07:00
David Hildenbrand	56531761d4	kfence: drop nth_page() usage We want to get rid of nth_page(), and kfence init code is the last user. Unfortunately, we might actually walk a PFN range where the pages are not contiguous, because we might be allocating an area from memblock that could span memory sections in problematic kernel configs (SPARSEMEM without SPARSEMEM_VMEMMAP). We could check whether the page range is contiguous using page_range_contiguous() and failing kfence init, or making kfence incompatible these problemtic kernel configs. Let's keep it simple and simply use pfn_to_page() by iterating PFNs. Link: https://lkml.kernel.org/r/20250901150359.867252-36-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Marco Elver <elver@google.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-21 14:22:09 -07:00
David Hildenbrand	b5ba761a7f	mm/gup: drop nth_page() usage in unpin_user_page_range_dirty_lock() There is the concern that unpin_user_page_range_dirty_lock() might do some weird merging of PFN ranges -- either now or in the future -- such that PFN range is contiguous but the page range might not be. Let's sanity-check for that and drop the nth_page() usage. Link: https://lkml.kernel.org/r/20250901150359.867252-35-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-21 14:22:09 -07:00
David Hildenbrand	80e7bb74d4	scatterlist: disallow non-contigous page ranges in a single SG entry The expectation is that there is currently no user that would pass in non-contigous page ranges: no allocator, not even VMA, will hand these out. The only problematic part would be if someone would provide a range obtained directly from memblock, or manually merge problematic ranges. If we find such cases, we should fix them to create separate SG entries. Let's check in sg_set_page() that this is really the case. No need to check in sg_set_folio(), as pages in a folio are guaranteed to be contiguous. As sg_set_page() gets inlined into modules, we have to export the page_range_contiguous() helper -- use EXPORT_SYMBOL, there is nothing special about this helper such that we would want to enforce GPL-only modules. We can now drop the nth_page() usage in sg_page_iter_page(). Link: https://lkml.kernel.org/r/20250901150359.867252-25-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Marek Szyprowski <m.szyprowski@samsung.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-21 14:22:06 -07:00
David Hildenbrand	6972706f95	mm/cma: refuse handing out non-contiguous page ranges Let's disallow handing out PFN ranges with non-contiguous pages, so we can remove the nth-page usage in __cma_alloc(), and so any callers don't have to worry about that either when wanting to blindly iterate pages. This is really only a problem in configs with SPARSEMEM but without SPARSEMEM_VMEMMAP, and only when we would cross memory sections in some cases. Will this cause harm? Probably not, because it's mostly 32bit that does not support SPARSEMEM_VMEMMAP. If this ever becomes a problem we could look into allocating the memmap for the memory sections spanned by a single CMA region in one go from memblock. [david@redhat.com: we can have NUMMU configs with SPARSEMEM enabled] Link: https://lkml.kernel.org/r/6ec933b1-b3f7-41c0-95d8-e518bb87375e@redhat.com Link: https://lkml.kernel.org/r/20250901150359.867252-23-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-21 14:22:06 -07:00
David Hildenbrand	e3c05b6e37	mm/gup: remove record_subpages() We can just cleanup the code by calculating the #refs earlier, so we can just inline what remains of record_subpages(). Calculate the number of references/pages ahead of times, and record them only once all our tests passed. [david@redhat.com: fix `pages' adjustment] Link: https://lkml.kernel.org/r/cc7f03f8-da8b-407e-a03a-e8e5a9ec5462@redhat.com Link: https://lkml.kernel.org/r/20250901150359.867252-20-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Eric Biggers <ebiggers@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-21 14:22:05 -07:00
David Hildenbrand	541541dbfe	mm/gup: drop nth_page() usage within folio when recording subpages nth_page() is no longer required when iterating over pages within a single folio, so let's just drop it when recording subpages. Link: https://lkml.kernel.org/r/20250901150359.867252-19-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-21 14:22:05 -07:00

1 2 3 4 5 ...

25161 Commits