mirror of
https://github.com/torvalds/linux.git
synced 2025-12-07 20:06:24 +00:00
NFSD: add Documentation/filesystems/nfs/nfsd-io-modes.rst
This document details the NFSD IO modes that are configurable using NFSD's experimental debugfs interfaces: /sys/kernel/debug/nfsd/io_cache_read /sys/kernel/debug/nfsd/io_cache_write This document will evolve as NFSD's interfaces do (e.g. if/when NFSD's debugfs interfaces are replaced with per-export controls). Future updates will provide more specific guidance and howto information to help others use and evaluate NFSD's IO modes: BUFFERED, DONTCACHE and DIRECT. Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
This commit is contained in:
committed by
Chuck Lever
parent
06c5c97293
commit
fa8d4e6784
144
Documentation/filesystems/nfs/nfsd-io-modes.rst
Normal file
144
Documentation/filesystems/nfs/nfsd-io-modes.rst
Normal file
@@ -0,0 +1,144 @@
|
|||||||
|
.. SPDX-License-Identifier: GPL-2.0
|
||||||
|
|
||||||
|
=============
|
||||||
|
NFSD IO MODES
|
||||||
|
=============
|
||||||
|
|
||||||
|
Overview
|
||||||
|
========
|
||||||
|
|
||||||
|
NFSD has historically always used buffered IO when servicing READ and
|
||||||
|
WRITE operations. BUFFERED is NFSD's default IO mode, but it is possible
|
||||||
|
to override that default to use either DONTCACHE or DIRECT IO modes.
|
||||||
|
|
||||||
|
Experimental NFSD debugfs interfaces are available to allow the NFSD IO
|
||||||
|
mode used for READ and WRITE to be configured independently. See both:
|
||||||
|
- /sys/kernel/debug/nfsd/io_cache_read
|
||||||
|
- /sys/kernel/debug/nfsd/io_cache_write
|
||||||
|
|
||||||
|
The default value for both io_cache_read and io_cache_write reflects
|
||||||
|
NFSD's default IO mode (which is NFSD_IO_BUFFERED=0).
|
||||||
|
|
||||||
|
Based on the configured settings, NFSD's IO will either be:
|
||||||
|
- cached using page cache (NFSD_IO_BUFFERED=0)
|
||||||
|
- cached but removed from page cache on completion (NFSD_IO_DONTCACHE=1)
|
||||||
|
- not cached stable_how=NFS_UNSTABLE (NFSD_IO_DIRECT=2)
|
||||||
|
|
||||||
|
To set an NFSD IO mode, write a supported value (0 - 2) to the
|
||||||
|
corresponding IO operation's debugfs interface, e.g.:
|
||||||
|
echo 2 > /sys/kernel/debug/nfsd/io_cache_read
|
||||||
|
echo 2 > /sys/kernel/debug/nfsd/io_cache_write
|
||||||
|
|
||||||
|
To check which IO mode NFSD is using for READ or WRITE, simply read the
|
||||||
|
corresponding IO operation's debugfs interface, e.g.:
|
||||||
|
cat /sys/kernel/debug/nfsd/io_cache_read
|
||||||
|
cat /sys/kernel/debug/nfsd/io_cache_write
|
||||||
|
|
||||||
|
If you experiment with NFSD's IO modes on a recent kernel and have
|
||||||
|
interesting results, please report them to linux-nfs@vger.kernel.org
|
||||||
|
|
||||||
|
NFSD DONTCACHE
|
||||||
|
==============
|
||||||
|
|
||||||
|
DONTCACHE offers a hybrid approach to servicing IO that aims to offer
|
||||||
|
the benefits of using DIRECT IO without any of the strict alignment
|
||||||
|
requirements that DIRECT IO imposes. To achieve this buffered IO is used
|
||||||
|
but the IO is flagged to "drop behind" (meaning associated pages are
|
||||||
|
dropped from the page cache) when IO completes.
|
||||||
|
|
||||||
|
DONTCACHE aims to avoid what has proven to be a fairly significant
|
||||||
|
limition of Linux's memory management subsystem if/when large amounts of
|
||||||
|
data is infrequently accessed (e.g. read once _or_ written once but not
|
||||||
|
read until much later). Such use-cases are particularly problematic
|
||||||
|
because the page cache will eventually become a bottleneck to servicing
|
||||||
|
new IO requests.
|
||||||
|
|
||||||
|
For more context on DONTCACHE, please see these Linux commit headers:
|
||||||
|
- Overview: 9ad6344568cc3 ("mm/filemap: change filemap_create_folio()
|
||||||
|
to take a struct kiocb")
|
||||||
|
- for READ: 8026e49bff9b1 ("mm/filemap: add read support for
|
||||||
|
RWF_DONTCACHE")
|
||||||
|
- for WRITE: 974c5e6139db3 ("xfs: flag as supporting FOP_DONTCACHE")
|
||||||
|
|
||||||
|
NFSD_IO_DONTCACHE will fall back to NFSD_IO_BUFFERED if the underlying
|
||||||
|
filesystem doesn't indicate support by setting FOP_DONTCACHE.
|
||||||
|
|
||||||
|
NFSD DIRECT
|
||||||
|
===========
|
||||||
|
|
||||||
|
DIRECT IO doesn't make use of the page cache, as such it is able to
|
||||||
|
avoid the Linux memory management's page reclaim scalability problems
|
||||||
|
without resorting to the hybrid use of page cache that DONTCACHE does.
|
||||||
|
|
||||||
|
Some workloads benefit from NFSD avoiding the page cache, particularly
|
||||||
|
those with a working set that is significantly larger than available
|
||||||
|
system memory. The pathological worst-case workload that NFSD DIRECT has
|
||||||
|
proven to help most is: NFS client issuing large sequential IO to a file
|
||||||
|
that is 2-3 times larger than the NFS server's available system memory.
|
||||||
|
The reason for such improvement is NFSD DIRECT eliminates a lot of work
|
||||||
|
that the memory management subsystem would otherwise be required to
|
||||||
|
perform (e.g. page allocation, dirty writeback, page reclaim). When
|
||||||
|
using NFSD DIRECT, kswapd and kcompactd are no longer commanding CPU
|
||||||
|
time trying to find adequate free pages so that forward IO progress can
|
||||||
|
be made.
|
||||||
|
|
||||||
|
The performance win associated with using NFSD DIRECT was previously
|
||||||
|
discussed on linux-nfs, see:
|
||||||
|
https://lore.kernel.org/linux-nfs/aEslwqa9iMeZjjlV@kernel.org/
|
||||||
|
But in summary:
|
||||||
|
- NFSD DIRECT can significantly reduce memory requirements
|
||||||
|
- NFSD DIRECT can reduce CPU load by avoiding costly page reclaim work
|
||||||
|
- NFSD DIRECT can offer more deterministic IO performance
|
||||||
|
|
||||||
|
As always, your mileage may vary and so it is important to carefully
|
||||||
|
consider if/when it is beneficial to make use of NFSD DIRECT. When
|
||||||
|
assessing comparative performance of your workload please be sure to log
|
||||||
|
relevant performance metrics during testing (e.g. memory usage, cpu
|
||||||
|
usage, IO performance). Using perf to collect perf data that may be used
|
||||||
|
to generate a "flamegraph" for work Linux must perform on behalf of your
|
||||||
|
test is a really meaningful way to compare the relative health of the
|
||||||
|
system and how switching NFSD's IO mode changes what is observed.
|
||||||
|
|
||||||
|
If NFSD_IO_DIRECT is specified by writing 2 (or 3 and 4 for WRITE) to
|
||||||
|
NFSD's debugfs interfaces, ideally the IO will be aligned relative to
|
||||||
|
the underlying block device's logical_block_size. Also the memory buffer
|
||||||
|
used to store the READ or WRITE payload must be aligned relative to the
|
||||||
|
underlying block device's dma_alignment.
|
||||||
|
|
||||||
|
But NFSD DIRECT does handle misaligned IO in terms of O_DIRECT as best
|
||||||
|
it can:
|
||||||
|
|
||||||
|
Misaligned READ:
|
||||||
|
If NFSD_IO_DIRECT is used, expand any misaligned READ to the next
|
||||||
|
DIO-aligned block (on either end of the READ). The expanded READ is
|
||||||
|
verified to have proper offset/len (logical_block_size) and
|
||||||
|
dma_alignment checking.
|
||||||
|
|
||||||
|
Misaligned WRITE:
|
||||||
|
If NFSD_IO_DIRECT is used, split any misaligned WRITE into a start,
|
||||||
|
middle and end as needed. The large middle segment is DIO-aligned
|
||||||
|
and the start and/or end are misaligned. Buffered IO is used for the
|
||||||
|
misaligned segments and O_DIRECT is used for the middle DIO-aligned
|
||||||
|
segment. DONTCACHE buffered IO is _not_ used for the misaligned
|
||||||
|
segments because using normal buffered IO offers significant RMW
|
||||||
|
performance benefit when handling streaming misaligned WRITEs.
|
||||||
|
|
||||||
|
Tracing:
|
||||||
|
The nfsd_read_direct trace event shows how NFSD expands any
|
||||||
|
misaligned READ to the next DIO-aligned block (on either end of the
|
||||||
|
original READ, as needed).
|
||||||
|
|
||||||
|
This combination of trace events is useful for READs:
|
||||||
|
echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_read_vector/enable
|
||||||
|
echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_read_direct/enable
|
||||||
|
echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_read_io_done/enable
|
||||||
|
echo 1 > /sys/kernel/tracing/events/xfs/xfs_file_direct_read/enable
|
||||||
|
|
||||||
|
The nfsd_write_direct trace event shows how NFSD splits a given
|
||||||
|
misaligned WRITE into a DIO-aligned middle segment.
|
||||||
|
|
||||||
|
This combination of trace events is useful for WRITEs:
|
||||||
|
echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_write_opened/enable
|
||||||
|
echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_write_direct/enable
|
||||||
|
echo 1 > /sys/kernel/tracing/events/nfsd/nfsd_write_io_done/enable
|
||||||
|
echo 1 > /sys/kernel/tracing/events/xfs/xfs_file_direct_write/enable
|
||||||
Reference in New Issue
Block a user