Files
linux/Documentation/admin-guide/mm/index.rst
Baoquan He 8e689f8ea4 mm/swap: do not choose swap device according to numa node
Patch series "mm/swapfile.c: select swap devices of default priority round
robin", v5.

Currently, on system with multiple swap devices, swap allocation will
select one swap device according to priority.  The swap device with the
highest priority will be chosen to allocate firstly.

People can specify a priority from 0 to 32767 when swapon a swap device,
or the system will set it from -2 then downwards by default.  Meanwhile,
on NUMA system, the swap device with node_id will be considered first on
that NUMA node of the node_id.

In the current code, an array of plist, swap_avail_heads[nid], is used to
organize swap devices on each NUMA node.  For each NUMA node, there is a
plist organizing all swap devices.  The 'prio' value in the plist is the
negated value of the device's priority due to plist being sorted from low
to high.  The swap device owning one node_id will be promoted to the front
position on that NUMA node, then other swap devices are put in order of
their default priority.

E.g I got a system with 8 NUMA nodes, and I setup 4 zram partition as
swap devices.

Current behaviour:
their priorities will be(note that -1 is skipped):
NAME       TYPE      SIZE USED PRIO
/dev/zram0 partition  16G   0B   -2
/dev/zram1 partition  16G   0B   -3
/dev/zram2 partition  16G   0B   -4
/dev/zram3 partition  16G   0B   -5

And their positions in the 8 swap_avail_lists[nid] will be:
swap_avail_lists[0]: /* node 0's available swap device list */
zram0   -> zram1   -> zram2   -> zram3
prio:1     prio:3     prio:4     prio:5
swap_avali_lists[1]: /* node 1's available swap device list */
zram1   -> zram0   -> zram2   -> zram3
prio:1     prio:2     prio:4     prio:5
swap_avail_lists[2]: /* node 2's available swap device list */
zram2   -> zram0   -> zram1   -> zram3
prio:1     prio:2     prio:3     prio:5
swap_avail_lists[3]: /* node 3's available swap device list */
zram3   -> zram0   -> zram1   -> zram2
prio:1     prio:2     prio:3     prio:4
swap_avail_lists[4-7]: /* node 4,5,6,7's available swap device list */
zram0   -> zram1   -> zram2   -> zram3
prio:2     prio:3     prio:4     prio:5

The adjustment for swap device with node_id intended to decrease the
pressure of lock contention for one swap device by taking different swap
device on different node.  The adjustment was introduced in commit
a2468cc9bf ("swap: choose swap device according to numa node"). 
However, the adjustment is a little coarse-grained.  On the node, the swap
device sharing the node's id will always be selected firstly by node's
CPUs until exhausted, then next one.  And on other nodes where no swap
device shares its node id, swap device with priority '-2' will be selected
firstly until exhausted, then next with priority '-3'.

This is the swapon output during the process high pressure vm-scability
test is being taken.  It's clearly showing zram0 is heavily exploited
until exhausted.

===================================
[root@hp-dl385g10-03 ~]# swapon
NAME       TYPE      SIZE  USED PRIO
/dev/zram0 partition  16G 15.7G   -2
/dev/zram1 partition  16G  3.4G   -3
/dev/zram2 partition  16G  3.4G   -4
/dev/zram3 partition  16G  2.6G   -5

The node based strategy on selecting swap device is much better then the
old way one by one selecting swap device.  However it is still
unreasonable because swap devices are assumed to have similar accessing
speed if no priority is specified when swapon.  It's unfair and doesn't
make sense just because one swap device is swapped on firstly, its
priority will be higher than the one swapped on later.

So in this patchset, change is made to select the swap device round robin
if default priority.  In code, the plist array swap_avail_heads[nid] is
replaced with a plist swap_avail_head which reverts commit a2468cc9bf. 
Meanwhile, on top of the revert, further change is taken to make any
device w/o specified priority get the same default priority '-1'.  Surely,
swap device with specified priority are always put foremost, this is not
impacted.  If you care about their different accessing speed, then use
'swapon -p xx' to deploy priority for your swap devices.

New behaviour:

swap_avail_list: /* one global available swap device list */
zram0   -> zram1   -> zram2   -> zram3
prio:1     prio:1     prio:1     prio:1

This is the swapon output during the process high pressure vm-scability
being taken, all is selected round robin:
=======================================
[root@hp-dl385g10-03 linux]# swapon
NAME       TYPE      SIZE  USED PRIO
/dev/zram0 partition  16G 12.6G   -1
/dev/zram1 partition  16G 12.6G   -1
/dev/zram2 partition  16G 12.6G   -1
/dev/zram3 partition  16G 12.6G   -1

With the change, we can see about 18% efficiency promotion as below:

vm-scability test:
==================
Test with:
usemem --init-time -O -y -x -n 31 2G (4G memcg, zram as swap)
                           Before:          After:
System time:               637.92 s         526.74 s      (lower is better)
Sum Throughput:            3546.56 MB/s     4207.56 MB/s  (higher is better)
Single process Throughput: 114.40 MB/s      135.72 MB/s   (higher is better)
free latency:              10138455.99 us   6810119.01 us (low is better)


This patch (of 2):

This reverts commit a2468cc9bf ("swap: choose swap device according to
numa node").

After this patch, the behaviour will change back to pre-commit
a2468cc9bf.  Means the priority will be set from -1 then downwards by
default, and when swapping, it will exhault swap device one by one
according to priority from high to low.  This is preparation work for
later change.

[root@hp-dl385g10-03 ~]# swapon
NAME       TYPE      SIZE   USED PRIO
/dev/zram0 partition  16G    16G   -1
/dev/zram1 partition  16G 966.2M   -2
/dev/zram2 partition  16G     0B   -3
/dev/zram3 partition  16G     0B   -4

Link: https://lkml.kernel.org/r/20251028034308.929550-1-bhe@redhat.com
Link: https://lkml.kernel.org/r/20251028034308.929550-2-bhe@redhat.com
Signed-off-by: Baoquan He <bhe@redhat.com>
Suggested-by: Chris Li <chrisl@kernel.org>
Acked-by: Chris Li <chrisl@kernel.org>
Acked-by: Nhat Pham <nphamcs@gmail.com>
Reviewed-by: Kairui Song <kasong@tencent.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Kemeng Shi <shikemeng@huaweicloud.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-16 17:28:27 -08:00

46 lines
1.2 KiB
ReStructuredText

=================
Memory Management
=================
Linux memory management subsystem is responsible, as the name implies,
for managing the memory in the system. This includes implementation of
virtual memory and demand paging, memory allocation both for kernel
internal structures and user space programs, mapping of files into
processes address space and many other cool things.
Linux memory management is a complex system with many configurable
settings. Most of these settings are available via ``/proc``
filesystem and can be queried and adjusted using ``sysctl``. These APIs
are described in Documentation/admin-guide/sysctl/vm.rst and in `man 5 proc`_.
.. _man 5 proc: http://man7.org/linux/man-pages/man5/proc.5.html
Linux memory management has its own jargon and if you are not yet
familiar with it, consider reading Documentation/admin-guide/mm/concepts.rst.
Here we document in detail how to interact with various mechanisms in
the Linux memory management.
.. toctree::
:maxdepth: 1
concepts
cma_debugfs
damon/index
hugetlbpage
idle_page_tracking
ksm
memory-hotplug
multigen_lru
nommu-mmap
numa_memory_policy
numaperf
pagemap
shrinker_debugfs
slab
soft-dirty
transhuge
userfaultfd
zswap
kho