mirror of
https://github.com/torvalds/linux.git
synced 2025-12-07 20:06:24 +00:00
NFSD: add Documentation/filesystems/nfs/nfsd-io-modes.rst
This document details the NFSD IO modes that are configurable using NFSD's experimental debugfs interfaces: /sys/kernel/debug/nfsd/io_cache_read /sys/kernel/debug/nfsd/io_cache_write This document will evolve as NFSD's interfaces do (e.g. if/when NFSD's debugfs interfaces are replaced with per-export controls). Future updates will provide more specific guidance and howto information to help others use and evaluate NFSD's IO modes: BUFFERED, DONTCACHE and DIRECT. Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
This commit is contained in:
committed by
Chuck Lever
parent
06c5c97293
commit
fa8d4e6784
144
Documentation/filesystems/nfs/nfsd-io-modes.rst
Normal file
144
Documentation/filesystems/nfs/nfsd-io-modes.rst
Normal file
@@ -0,0 +1,144 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
=============
|
||||
NFSD IO MODES
|
||||
=============
|
||||
|
||||
Overview
|
||||
========
|
||||
|
||||
NFSD has historically always used buffered IO when servicing READ and
|
||||
WRITE operations. BUFFERED is NFSD's default IO mode, but it is possible
|
||||
to override that default to use either DONTCACHE or DIRECT IO modes.
|
||||
|
||||
Experimental NFSD debugfs interfaces are available to allow the NFSD IO
|
||||
mode used for READ and WRITE to be configured independently. See both:
|
||||
- /sys/kernel/debug/nfsd/io_cache_read
|
||||
- /sys/kernel/debug/nfsd/io_cache_write
|
||||
|
||||
The default value for both io_cache_read and io_cache_write reflects
|
||||
NFSD's default IO mode (which is NFSD_IO_BUFFERED=0).
|
||||
|
||||
Based on the configured settings, NFSD's IO will either be:
|
||||
- cached using page cache (NFSD_IO_BUFFERED=0)
|
||||
- cached but removed from page cache on completion (NFSD_IO_DONTCACHE=1)
|
||||
- not cached stable_how=NFS_UNSTABLE (NFSD_IO_DIRECT=2)
|
||||
|
||||
To set an NFSD IO mode, write a supported value (0 - 2) to the
|
||||
corresponding IO operation's debugfs interface, e.g.:
|
||||
echo 2 > /sys/kernel/debug/nfsd/io_cache_read
|
||||
echo 2 > /sys/kernel/debug/nfsd/io_cache_write
|
||||
|
||||
To check which IO mode NFSD is using for READ or WRITE, simply read the
|
||||
corresponding IO operation's debugfs interface, e.g.:
|
||||
cat /sys/kernel/debug/nfsd/io_cache_read
|
||||
cat /sys/kernel/debug/nfsd/io_cache_write
|
||||
|
||||
If you experiment with NFSD's IO modes on a recent kernel and have
|
||||
interesting results, please report them to linux-nfs@vger.kernel.org
|
||||
|
||||
NFSD DONTCACHE
|
||||
==============
|
||||
|
||||
DONTCACHE offers a hybrid approach to servicing IO that aims to offer
|
||||
the benefits of using DIRECT IO without any of the strict alignment
|
||||
requirements that DIRECT IO imposes. To achieve this buffered IO is used
|
||||
but the IO is flagged to "drop behind" (meaning associated pages are
|
||||
dropped from the page cache) when IO completes.
|
||||
|
||||
DONTCACHE aims to avoid what has proven to be a fairly significant
|
||||
limition of Linux's memory management subsystem if/when large amounts of
|
||||
data is infrequently accessed (e.g. read once _or_ written once but not
|
||||
read until much later). Such use-cases are particularly problematic
|
||||
because the page cache will eventually become a bottleneck to servicing
|
||||
new IO requests.
|
||||
|
||||
For more context on DONTCACHE, please see these Linux commit headers:
|
||||
- Overview: 9ad6344568cc3 ("mm/filemap: change filemap_create_folio()
|
||||
to take a struct kiocb")
|
||||
- for READ: 8026e49bff9b1 ("mm/filemap: add read support for
|
||||
RWF_DONTCACHE")
|
||||
- for WRITE: 974c5e6139db3 ("xfs: flag as supporting FOP_DONTCACHE")
|
||||
|
||||
NFSD_IO_DONTCACHE will fall back to NFSD_IO_BUFFERED if the underlying
|
||||
filesystem doesn't indicate support by setting FOP_DONTCACHE.
|
||||
|
||||
NFSD DIRECT
|
||||
===========
|
||||
|
||||
DIRECT IO doesn't make use of the page cache, as such it is able to
|
||||
avoid the Linux memory management's page reclaim scalability problems
|
||||
without resorting to the hybrid use of page cache that DONTCACHE does.
|
||||
|
||||
Some workloads benefit from NFSD avoiding the page cache, particularly
|
||||
those with a working set that is significantly larger than available
|
||||
system memory. The pathological worst-case workload that NFSD DIRECT has
|
||||
proven to help most is: NFS client issuing large sequential IO to a file
|
||||
that is 2-3 times larger than the NFS server's available system memory.
|
||||
The reason for such improvement is NFSD DIRECT eliminates a lot of work
|
||||
that the memory management subsystem would otherwise be required to
|
||||
perform (e.g. page allocation, dirty writeback, page reclaim). When
|
||||
using NFSD DIRECT, kswapd and kcompactd are no longer commanding CPU
|
||||
time trying to find adequate free pages so that forward IO progress can
|
||||
be made.
|
||||
|
||||
The performance win associated with using NFSD DIRECT was previously
|
||||
discussed on linux-nfs, see:
|
||||
https://lore.kernel.org/linux-nfs/aEslwqa9iMeZjjlV@kernel.org/
|
||||
But in summary:
|
||||
- NFSD DIRECT can significantly reduce memory requirements
|
||||
- NFSD DIRECT can reduce CPU load by avoiding costly page reclaim work
|
||||
- NFSD DIRECT can offer more deterministic IO performance
|
||||
|
||||
As always, your mileage may vary and so it is important to carefully
|
||||
consider if/when it is beneficial to make use of NFSD DIRECT. When
|
||||
assessing comparative performance of your workload please be sure to log
|
||||
relevant performance metrics during testing (e.g. memory usage, cpu
|
||||
usage, IO performance). Using perf to collect perf data that may be used
|
||||
to generate a "flamegraph" for work Linux must perform on behalf of your
|
||||
test is a really meaningful way to compare the relative health of the
|
||||
system and how switching NFSD's IO mode changes what is observed.
|
||||
|
||||
If NFSD_IO_DIRECT is specified by writing 2 (or 3 and 4 for WRITE) to
|
||||
NFSD's debugfs interfaces, ideally the IO will be aligned relative to
|
||||
the underlying block device's logical_block_size. Also the memory buffer
|
||||
used to store the READ or WRITE payload must be aligned relative to the
|
||||
underlying block device's dma_alignment.
|
||||
|
||||
But NFSD DIRECT does handle misaligned IO in terms of O_DIRECT as best
|
||||
it can:
|
||||
|
||||
Misaligned READ:
|
||||
If NFSD_IO_DIRECT is used, expand any misaligned READ to the next
|
||||
DIO-aligned block (on either end of the READ). The expanded READ is
|
||||
verified to have proper offset/len (logical_block_size) and
|
||||
dma_alignment checking.
|
||||
|
||||
Misaligned WRITE:
|
||||
If NFSD_IO_DIRECT is used, split any misaligned WRITE into a start,
|
||||
middle and end as needed. The large middle segment is DIO-aligned
|
||||
and the start and/or end are misaligned. Buffered IO is used for the
|
||||
misaligned segments and O_DIRECT is used for the middle DIO-aligned
|
||||
segment. DONTCACHE buffered IO is _not_ used for the misaligned
|
||||
segments because using normal buffered IO offers significant RMW
|
||||
performance benefit when handling streaming misaligned WRITEs.
|
||||
|
||||
Tracing:
|
||||
The nfsd_read_direct trace event shows how NFSD expands any
|
||||
misaligned READ to the next DIO-aligned block (on either end of the
|
||||
original READ, as needed).
|
||||
|
||||
This combination of trace events is useful for READs:
|
||||
echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_read_vector/enable
|
||||
echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_read_direct/enable
|
||||
echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_read_io_done/enable
|
||||
echo 1 > /sys/kernel/tracing/events/xfs/xfs_file_direct_read/enable
|
||||
|
||||
The nfsd_write_direct trace event shows how NFSD splits a given
|
||||
misaligned WRITE into a DIO-aligned middle segment.
|
||||
|
||||
This combination of trace events is useful for WRITEs:
|
||||
echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_write_opened/enable
|
||||
echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_write_direct/enable
|
||||
echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_write_io_done/enable
|
||||
echo 1 > /sys/kernel/tracing/events/xfs/xfs_file_direct_write/enable
|
||||
Reference in New Issue
Block a user