mirror of
https://github.com/torvalds/linux.git
synced 2025-12-07 20:06:24 +00:00
94dea151bf3651c01acb12a38ca75ba9d26ea4da
948892 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
5e105c88ab |
KVM: nVMX: check for invalid hdr.vmx.flags
hdr.vmx.flags is meant for future extensions to the ABI, rejecting invalid flags is necessary to avoid broken half-loads of the nVMX state. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
||
|
|
0f02bd0ade |
KVM: nVMX: check for required but missing VMCS12 in KVM_SET_NESTED_STATE
A missing VMCS12 was not causing -EINVAL (it was just read with copy_from_user, so it is not a security issue, but it is still wrong). Test for VMCS12 validity and reject the nested state if a VMCS12 is required but not present. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
||
|
|
9319676595 |
selftests: kvm: do not set guest mode flag
Setting KVM_STATE_NESTED_GUEST_MODE enables various consistency checks on VMCS12 and therefore causes KVM_SET_NESTED_STATE to fail spuriously with -EINVAL. Do not set the flag so that we're sure to cover the conditions included by the test, and cover the case where VMCS12 is set and KVM_SET_NESTED_STATE is called with invalid VMCS12 contents. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
||
|
|
2ab9a40966 |
ASoC: intel: use asoc_substream_to_rtd()
Now we can use asoc_substream_to_rtd() macro, let's use it. Signed-off-by: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com> Reviewed-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com> Link: https://lore.kernel.org/r/87tuxtydcz.wl-kuninori.morimoto.gx@renesas.com Signed-off-by: Mark Brown <broonie@kernel.org> |
||
|
|
2207b93bc7 |
ASoC: intel/boards: use asoc_substream_to_rtd()
Now we can use asoc_substream_to_rtd() macro, let's use it. Signed-off-by: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com> Reviewed-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com> Link: https://lore.kernel.org/r/87v9i9yddc.wl-kuninori.morimoto.gx@renesas.com Signed-off-by: Mark Brown <broonie@kernel.org> |
||
|
|
2861ba7a0c |
ACPICA: Update version to 20200717
ACPICA commit c1adb9a2a775df7a85df0103342ebf090e1b2016 Version 20200717. Link: https://github.com/acpica/acpica/commit/c1adb9a2 Signed-off-by: Bob Moore <robert.moore@intel.com> Signed-off-by: Erik Kaneda <erik.kaneda@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> |
||
|
|
6a54ebae6d |
ACPICA: Do not increment operation_region reference counts for field units
ACPICA commit e17b28cfcc31918d0db9547b6b274b09c413eb70 Object reference counts are used as a part of ACPICA's garbage collection mechanism. This mechanism keeps track of references to heap-allocated structures such as the ACPI operand objects. Recent server firmware has revealed that this reference count can overflow on large servers that declare many field units under the same operation_region. This occurs because each field unit declaration will add a reference count to the source operation_region. This change solves the reference count overflow for operation_regions objects by preventing fieldunits from incrementing their operation_region's reference count. Each operation_region's reference count will not be changed by named objects declared under the Field operator. During namespace deletion, the operation_region namespace node will be deleted and each fieldunit will be deleted without touching the deleted operation_region object. Link: https://github.com/acpica/acpica/commit/e17b28cf Signed-off-by: Erik Kaneda <erik.kaneda@intel.com> Signed-off-by: Bob Moore <robert.moore@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> |
||
|
|
10cfde5dc6 |
ACPICA: Replace one-element array with flexible-array
ACPICA commit 7ba2f3d91a32f104765961fda0ed78b884ae193d
The current codebase makes use of one-element arrays in the following
form:
struct something {
int length;
u8 data[1];
};
struct something *instance;
instance = kmalloc(sizeof(*instance) + size, GFP_KERNEL);
instance->length = size;
memcpy(instance->data, source, size);
but the preferred mechanism to declare variable-length types such as
these ones is a flexible array member[1][2], introduced in C99:
struct foo {
int stuff;
struct boo array[];
};
By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure,
which will help us prevent some kind of undefined behavior bugs from
being inadvertently introduced[3] to the linux codebase from now on.
This issue was found with the help of Coccinelle and audited _manually_.
[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit
|
||
|
|
4ce7796632 |
ACPI: Replace HTTP links with HTTPS ones
Rationale:
Reduces attack surface on kernel devs opening the links for MITM
as HTTPS traffic is much harder to manipulate.
Deterministic algorithm:
For each file:
If not .svg:
For each line:
If doesn't contain `\bxmlns\b`:
For each link, `\bhttp://[^# \t\r\n]*(?:\w|/)`:
If neither `\bgnu\.org/license`, nor `\bmozilla\.org/MPL\b`:
If both the HTTP and HTTPS versions
return 200 OK and serve the same content:
Replace HTTP with HTTPS.
Signed-off-by: Alexander A. Klimov <grandmaster@al2klimov.de>
Acked-by: Vishal Verma <vishal.l.verma@intel.com>
Acked-by: Sakari Ailus <sakari.ailus@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
||
|
|
66d3037898 |
Merge tag 'at91-defconfig-5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/at91/linux into arm/defconfig
AT91 defconfig for 5.9 - Add ClassD, KSZ ethernet switches, brdige, vlan to sama5_defconfig - Reenable CAN support in sama5_defconfig * tag 'at91-defconfig-5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/at91/linux: ARM: configs: at91: sama5: enable CAN PLATFORM driver ARM: configs: at91: sama5: enable bridge and VLAN filtering ARM: configs: at91: sama5: add support for KSZ ethernet switches ARM: configs: at91: sama5: Enable CLASSD Link: https://lore.kernel.org/r/20200726192810.GA181818@piout.net Signed-off-by: Arnd Bergmann <arnd@arndb.de> |
||
|
|
9c52a2647a |
Merge tag 'drivers_soc_for_5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/ssantosh/linux-keystone into arm/drivers
SOC: TI Keystone driver update for v5.9 - TI K3 Ring Accelerator updates - Few non critical warining fixes * tag 'drivers_soc_for_5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/ssantosh/linux-keystone: soc: TI knav_qmss: make symbol 'knav_acc_range_ops' static firmware: ti_sci: Replace HTTP links with HTTPS ones soc: ti/ti_sci_protocol.h: drop a duplicated word + clarify soc: ti: k3: fix semicolon.cocci warnings soc: ti: k3-ringacc: fix: warn: variable dereferenced before check 'ring' dmaengine: ti: k3-udma: Switch to k3_ringacc_request_rings_pair soc: ti: k3-ringacc: separate soc specific initialization soc: ti: k3-ringacc: add request pair of rings api. soc: ti: k3-ringacc: add ring's flags to dump soc: ti: k3-ringacc: Move state tracking variables under a struct dt-bindings: soc: ti: k3-ringacc: convert bindings to json-schema Link: https://lore.kernel.org/r/1595711814-7015-1-git-send-email-santosh.shilimkar@oracle.com Signed-off-by: Arnd Bergmann <arnd@arndb.de> |
||
|
|
8365a898fe |
powercap: Add Power Limit4 support
Modern Intel Mobile platforms support power limit4 (PL4), which is the SoC package level maximum power limit (in Watts). It can be used to preemptively limits potential SoC power to prevent power spikes from tripping the power adapter and battery over-current protection. This patch enables this feature by exposing package level peak power capping control to userspace via RAPL sysfs interface. With this, application like DTPF can modify PL4 power limit, the similar way of other package power limit (PL1). As this feature is not tested on previous generations, here it is enabled only for the platform that has been verified to work, for safety concerns. Signed-off-by: Sumeet Pawnikar <sumeet.r.pawnikar@intel.com> Co-developed-by: Zhang Rui <rui.zhang@intel.com> Signed-off-by: Zhang Rui <rui.zhang@intel.com> Reviewed-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Tested-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> |
||
|
|
0585c1c06a |
ACPI: Use valid link to the ACPI specification
Currently, acpi.info is an invalid link to access ACPI specification, the new valid link is https://uefi.org/specifications. Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> |
||
|
|
6fa38ef153 |
ALSA: hda/realtek: Fix add a "ultra_low_power" function for intel reference board (alc256)
Intel requires to enable power saving mode for intel reference board (alc256) Signed-off-by: PeiSen Hou <pshou@realtek.com> Cc: <stable@vger.kernel.org> Link: https://lore.kernel.org/r/20200727115647.10967-1-tiwai@suse.de Signed-off-by: Takashi Iwai <tiwai@suse.de> |
||
|
|
e86d1aa8b6 |
iommu/arm-smmu: Move Arm SMMU drivers into their own subdirectory
The Arm SMMU drivers are getting fat on vendor value-add, so move them to their own subdirectory out of the way of the other IOMMU drivers. Suggested-by: Joerg Roedel <joro@8bytes.org> Signed-off-by: Will Deacon <will@kernel.org> |
||
|
|
02fd86b661 |
mmc: jz4740: Use pm_ptr() macro
Use the newly introduced pm_ptr() macro to simplify the code. Signed-off-by: Paul Cercueil <paul@crapouillou.net> Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> |
||
|
|
756a64ce34 |
PM: Make *_DEV_PM_OPS macros use __maybe_unused
This way, when the dev_pm_ops instance is not referenced anywhere, it will simply be dropped by the compiler without a warning. Signed-off-by: Paul Cercueil <paul@crapouillou.net> Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> |
||
|
|
7a82e97a11 |
PM: core: introduce pm_ptr() macro
This macro is analogous to the infamous of_match_ptr(). If CONFIG_PM is enabled, this macro will resolve to its argument, otherwise to NULL. Signed-off-by: Paul Cercueil <paul@crapouillou.net> Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> |
||
|
|
17a8271658 |
USB: iowarrior: fix up report size handling for some devices
In previous patches that added support for new iowarrior devices, the handling of the report size was not done correct. Fix that up and update the copyright date for the driver Reworked from an original patch written by Christoph Jung. Fixes: |
||
|
|
e98ba8cc3f |
Merge tag 'usb-for-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/balbi/usb into usb-next
Felipe writes: USB: changes for v5.9 merge window CDNS3 got several improvements, most of which are non-critical fixes. DWC3 has a reset fix for the meson platform, while dwc2 has improvements for role switch on STM32MP15 SoCs. Apart from these, we have the usual set of non-critical fixes all over the place and support for new Ingenic SoC to their PHY driver. * tag 'usb-for-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/balbi/usb: (38 commits) usb: dwc3: gadget: when the started list is empty stop the active xfer usb: dwc3: gadget: make starting isoc transfers more robust usb: dwc3: gadget: add frame number mask usb: gadget: function: printer: Interface is disabled and returns error usb: gadget: f_uac2: fix AC Interface Header Descriptor wTotalLength dt-bindings: usb: ti,keystone-dwc3.yaml: Improve schema usb: bdc: Use devm_clk_get_optional() usb: bdc: Halt controller on suspend usb: bdc: driver runs out of buffer descriptors on large ADB transfers usb: bdc: Adb shows offline after resuming from S2 bdc: Fix bug causing crash after multiple disconnects usb: bdc: Add compatible string for new style USB DT nodes dt-bindings: usb: bdc: Update compatible strings USB: PHY: JZ4770: Reformat the code to align it. USB: PHY: JZ4770: Add support for new Ingenic SoCs. USB: PHY: JZ4770: Unify code style and simplify code. dt-bindings: USB: Add bindings for new Ingenic SoCs. usb: gadget: net2280: fix memory leak on probe error handling paths usb: cdns3: drd: simplify *switch_gadet and *switch_host usb: cdns3: core: removed overwriting some error code ... |
||
|
|
d6af2ed29c |
PCI: hv: Fix a timing issue which causes kdump to fail occasionally
Kdump could fail sometime on Hyper-V guest because the retry in
hv_pci_enter_d0() releases child device structures in hv_pci_bus_exit().
Although there is a second asynchronous device relations message sending
from the host, if this message arrives to the guest after
hv_send_resource_allocated() is called, the retry would fail.
Fix the problem by moving retry to hv_pci_probe() and start the retry
from hv_pci_query_relations() call. This will cause a device relations
message to arrive to the guest synchronously; the guest would then be
able to rebuild the child device structures before calling
hv_send_resource_allocated().
Link: https://lore.kernel.org/r/20200727071731.18516-1-weh@microsoft.com
Fixes:
|
||
|
|
68f9b22842 |
pm-graph v5.7 - important s2idle fixes
Important fixes: - in s2idle, use timekeeping_freeze trace mark instead of machine_suspend to denote entry into s2idle mode. - in s2idle, use machine_suspend trace mark to create a new virtual device called "s2idle_enter_<n>x". It denotes an s2idle_enter call loop of <n> iterations where s2idle was never actually achieved. It isn't counted as "freeze time" in the header. - in s2idle, only show multiple freeze times if s2idle went in and out of resume_noirq. Otherwise multiple freezes are shown with "waking" time subtracted (waking time is time spent outside s2idle dealing with wakeups). - in s2idle summaries, include "FREEZEWAKE" as an issue when at least 1ms is spent waking from s2idle. A clean run should only wake for the rtc timer. - add support for device callbacks with matching names in the same phase. In rare cases some devices register multiple callbacks from separate drivers using the same name. Without this fix only one is shown. - add kparamsfmt string back to fix bootgraph General updates: - when suspend_machine is missing, error says "failed in suspend_machine" - extract target count/time and add to summary title if -multi used - include any instances of "timeout" in dmesg as issues to be logged. - fix ftrace parse to handle any number of flags (instead of just 4). - remove sync/async_device string from device detail, remains in hover. - when using callgraph (-f) add driver name to callgraph titles. Signed-off-by: Todd Brandt <todd.e.brandt@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> |
||
|
|
5e548b3201 |
btrfs: do not set the full sync flag on the inode during page release
When removing an extent map at try_release_extent_mapping(), called through
the page release callback (btrfs_releasepage()), we always set the full
sync flag on the inode, which forces the next fsync to use a slower code
path.
This hurts performance for workloads that dirty an amount of data that
exceeds or is very close to the system's RAM memory and do frequent fsync
operations (like database servers can for example). In particular if there
are concurrent fsyncs against different files, by falling back to a full
fsync we do a lot more checksum lookups in the checksums btree, as we do
it for all the extents created in the current transaction, instead of only
the new ones since the last fsync. These checksums lookups not only take
some time but, more importantly, they also cause contention on the
checksums btree locks due to the concurrency with checksum insertions in
the btree by ordered extents from other inodes.
We actually don't need to set the full sync flag on the inode, because we
only remove extent maps that are in the list of modified extents if they
were created in a past transaction, in which case an fsync skips them as
it's pointless to log them. So stop setting the full fsync flag on the
inode whenever we remove an extent map.
This patch is part of a patchset that consists of 3 patches, which have
the following subjects:
1/3 btrfs: fix race between page release and a fast fsync
2/3 btrfs: release old extent maps during page release
3/3 btrfs: do not set the full sync flag on the inode during page release
Performance tests were ran against a branch (misc-next) containing the
whole patchset. The test exercises a workload where there are multiple
processes writing to files and fsyncing them (each writing and fsyncing
its own file), and in total the amount of data dirtied ranges from 2x to
4x the system's RAM memory (16GiB), so that the page release callback is
invoked frequently.
The following script, using fio, was used to perform the tests:
$ cat test-fsync.sh
#!/bin/bash
DEV=/dev/sdk
MNT=/mnt/sdk
MOUNT_OPTIONS="-o ssd"
MKFS_OPTIONS="-d single -m single"
if [ $# -ne 3 ]; then
echo "Use $0 NUM_JOBS FILE_SIZE FSYNC_FREQ"
exit 1
fi
NUM_JOBS=$1
FILE_SIZE=$2
FSYNC_FREQ=$3
cat <<EOF > /tmp/fio-job.ini
[writers]
rw=write
fsync=$FSYNC_FREQ
fallocate=none
group_reporting=1
direct=0
bs=64k
ioengine=sync
size=$FILE_SIZE
directory=$MNT
numjobs=$NUM_JOBS
thread
EOF
echo "Using config:"
echo
cat /tmp/fio-job.ini
echo
mkfs.btrfs -f $MKFS_OPTIONS $DEV &> /dev/null
mount $MOUNT_OPTIONS $DEV $MNT
fio /tmp/fio-job.ini
umount $MNT
The tests were performed for different numbers of jobs, file sizes and
fsync frequency. A qemu VM using kvm was used, with 8 cores (the host has
12 cores, with cpu governance set to performance mode on all cores), 16GiB
of ram (the host has 64GiB) and using a NVMe device directly (without an
intermediary filesystem in the host). While running the tests, the host
was not used for anything else, to avoid disturbing the tests.
The obtained results were the following, and the last line printed by
fio is pasted (includes aggregated throughput and test run time).
*****************************************************
**** 1 job, 32GiB file, fsync frequency 1 ****
*****************************************************
Before patchset:
WRITE: bw=29.1MiB/s (30.5MB/s), 29.1MiB/s-29.1MiB/s (30.5MB/s-30.5MB/s), io=32.0GiB (34.4GB), run=1127557-1127557msec
After patchset:
WRITE: bw=29.3MiB/s (30.7MB/s), 29.3MiB/s-29.3MiB/s (30.7MB/s-30.7MB/s), io=32.0GiB (34.4GB), run=1119042-1119042msec
(+0.7% throughput, -0.8% run time)
*****************************************************
**** 2 jobs, 16GiB files, fsync frequency 1 ****
*****************************************************
Before patchset:
WRITE: bw=33.5MiB/s (35.1MB/s), 33.5MiB/s-33.5MiB/s (35.1MB/s-35.1MB/s), io=32.0GiB (34.4GB), run=979000-979000msec
After patchset:
WRITE: bw=39.9MiB/s (41.8MB/s), 39.9MiB/s-39.9MiB/s (41.8MB/s-41.8MB/s), io=32.0GiB (34.4GB), run=821283-821283msec
(+19.1% throughput, -16.1% runtime)
*****************************************************
**** 4 jobs, 8GiB files, fsync frequency 1 ****
*****************************************************
Before patchset:
WRITE: bw=52.1MiB/s (54.6MB/s), 52.1MiB/s-52.1MiB/s (54.6MB/s-54.6MB/s), io=32.0GiB (34.4GB), run=629130-629130msec
After patchset:
WRITE: bw=71.8MiB/s (75.3MB/s), 71.8MiB/s-71.8MiB/s (75.3MB/s-75.3MB/s), io=32.0GiB (34.4GB), run=456357-456357msec
(+37.8% throughput, -27.5% runtime)
*****************************************************
**** 8 jobs, 4GiB files, fsync frequency 1 ****
*****************************************************
Before patchset:
WRITE: bw=76.1MiB/s (79.8MB/s), 76.1MiB/s-76.1MiB/s (79.8MB/s-79.8MB/s), io=32.0GiB (34.4GB), run=430708-430708msec
After patchset:
WRITE: bw=133MiB/s (140MB/s), 133MiB/s-133MiB/s (140MB/s-140MB/s), io=32.0GiB (34.4GB), run=245458-245458msec
(+74.7% throughput, -43.0% run time)
*****************************************************
**** 16 jobs, 2GiB files, fsync frequency 1 ****
*****************************************************
Before patchset:
WRITE: bw=74.7MiB/s (78.3MB/s), 74.7MiB/s-74.7MiB/s (78.3MB/s-78.3MB/s), io=32.0GiB (34.4GB), run=438625-438625msec
After patchset:
WRITE: bw=184MiB/s (193MB/s), 184MiB/s-184MiB/s (193MB/s-193MB/s), io=32.0GiB (34.4GB), run=177864-177864msec
(+146.3% throughput, -59.5% run time)
*****************************************************
**** 32 jobs, 2GiB files, fsync frequency 1 ****
*****************************************************
Before patchset:
WRITE: bw=72.6MiB/s (76.1MB/s), 72.6MiB/s-72.6MiB/s (76.1MB/s-76.1MB/s), io=64.0GiB (68.7GB), run=902615-902615msec
After patchset:
WRITE: bw=227MiB/s (238MB/s), 227MiB/s-227MiB/s (238MB/s-238MB/s), io=64.0GiB (68.7GB), run=288936-288936msec
(+212.7% throughput, -68.0% run time)
*****************************************************
**** 64 jobs, 1GiB files, fsync frequency 1 ****
*****************************************************
Before patchset:
WRITE: bw=98.8MiB/s (104MB/s), 98.8MiB/s-98.8MiB/s (104MB/s-104MB/s), io=64.0GiB (68.7GB), run=663126-663126msec
After patchset:
WRITE: bw=294MiB/s (308MB/s), 294MiB/s-294MiB/s (308MB/s-308MB/s), io=64.0GiB (68.7GB), run=222940-222940msec
(+197.6% throughput, -66.4% run time)
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
||
|
|
fbc2bd7e7a |
btrfs: release old extent maps during page release
When removing an extent map at try_release_extent_mapping(), called through the page release callback (btrfs_releasepage()), we never release an extent map that is in the list of modified extents. This is to prevent races with a concurrent fsync using the fast path, which could lead to not logging an extent created in the current transaction. However we can safely remove an extent map created in a past transaction that is still in the list of modified extents (because no one fsynced yet the inode after that transaction got commited), because such extents are skipped during an fsync as it is pointless to log them. This change does that. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
|
3d6448e631 |
btrfs: fix race between page release and a fast fsync
When releasing an extent map, done through the page release callback, we
can race with an ongoing fast fsync and cause the fsync to miss a new
extent and not log it. The steps for this to happen are the following:
1) A page is dirtied for some inode I;
2) Writeback for that page is triggered by a path other than fsync, for
example by the system due to memory pressure;
3) When the ordered extent for the extent (a single 4K page) finishes,
we unpin the corresponding extent map and set its generation to N,
the current transaction's generation;
4) The btrfs_releasepage() callback is invoked by the system due to
memory pressure for that no longer dirty page of inode I;
5) At the same time, some task calls fsync on inode I, joins transaction
N, and at btrfs_log_inode() it sees that the inode does not have the
full sync flag set, so we proceed with a fast fsync. But before we get
into btrfs_log_changed_extents() and lock the inode's extent map tree:
6) Through btrfs_releasepage() we end up at try_release_extent_mapping()
and we remove the extent map for the new 4Kb extent, because it is
neither pinned anymore nor locked. By calling remove_extent_mapping(),
we remove the extent map from the list of modified extents, since the
extent map does not have the logging flag set. We unlock the inode's
extent map tree;
7) The task doing the fast fsync now enters btrfs_log_changed_extents(),
locks the inode's extent map tree and iterates its list of modified
extents, which no longer has the 4Kb extent in it, so it does not log
the extent;
8) The fsync finishes;
9) Before transaction N is committed, a power failure happens. After
replaying the log, the 4K extent of inode I will be missing, since
it was not logged due to the race with try_release_extent_mapping().
So fix this by teaching try_release_extent_mapping() to not remove an
extent map if it's still in the list of modified extents.
Fixes:
|
||
|
|
88c4703f00 |
btrfs: open-code remount flag setting in btrfs_remount
When we're (re)mounting a btrfs filesystem we set the BTRFS_FS_STATE_REMOUNTING state in fs_info to serialize against async reclaim or defrags. This flag is set in btrfs_remount_prepare() called by btrfs_remount(). As btrfs_remount_prepare() does nothing but setting this flag and doesn't have a second caller, we can just open-code the flag setting in btrfs_remount(). Similarly do for so clearing of the flag by moving it out of btrfs_remount_cleanup() into btrfs_remount() to be symmetrical. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
|
162e0a16b7 |
btrfs: if we're restriping, use the target restripe profile
Previously we depended on some weird behavior in our chunk allocator to force the allocation of new stripes, so by the time we got to doing the reduce we would usually already have a chunk with the proper target. However that behavior causes other problems and needs to be removed. First however we need to remove this check to only restripe if we already have those available profiles, because if we're allocating our first chunk it obviously will not be available. Simply use the target as specified, and if that fails it'll be because we're out of space. Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
|
349e120ece |
btrfs: don't adjust bg flags and use default allocation profiles
btrfs/061 has been failing consistently for me recently with a
transaction abort. We run out of space in the system chunk array, which
means we've allocated way too many system chunks than we need.
Chris added this a long time ago for balance as a poor mans restriping.
If you had a single disk and then added another disk and then did a
balance, update_block_group_flags would then figure out which RAID level
you needed.
Fast forward to today and we have restriping behavior, so we can
explicitly tell the fs that we're trying to change the raid level. This
is accomplished through the normal get_alloc_profile path.
Furthermore this code actually causes btrfs/061 to fail, because we do
things like mkfs -m dup -d single with multiple devices. This trips
this check
alloc_flags = update_block_group_flags(fs_info, cache->flags);
if (alloc_flags != cache->flags) {
ret = btrfs_chunk_alloc(trans, alloc_flags, CHUNK_ALLOC_FORCE);
in btrfs_inc_block_group_ro. Because we're balancing and scrubbing, but
not actually restriping, we keep forcing chunk allocation of RAID1
chunks. This eventually causes us to run out of system space and the
file system aborts and flips read only.
We don't need this poor mans restriping any more, simply use the normal
get_alloc_profile helper, which will get the correct alloc_flags and
thus make the right decision for chunk allocation. This keeps us from
allocating a billion system chunks and falling over.
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
||
|
|
ab0db043c3 |
btrfs: fix lockdep splat from btrfs_dump_space_info
When running with -o enospc_debug you can get the following splat if one of the dump_space_info's trip ====================================================== WARNING: possible circular locking dependency detected 5.8.0-rc5+ #20 Tainted: G OE ------------------------------------------------------ dd/563090 is trying to acquire lock: ffff9e7dbf4f1e18 (&ctl->tree_lock){+.+.}-{2:2}, at: btrfs_dump_free_space+0x2b/0xa0 [btrfs] but task is already holding lock: ffff9e7e2284d428 (&cache->lock){+.+.}-{2:2}, at: btrfs_dump_space_info+0xaa/0x120 [btrfs] which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #3 (&cache->lock){+.+.}-{2:2}: _raw_spin_lock+0x25/0x30 btrfs_add_reserved_bytes+0x3c/0x3c0 [btrfs] find_free_extent+0x7ef/0x13b0 [btrfs] btrfs_reserve_extent+0x9b/0x180 [btrfs] btrfs_alloc_tree_block+0xc1/0x340 [btrfs] alloc_tree_block_no_bg_flush+0x4a/0x60 [btrfs] __btrfs_cow_block+0x122/0x530 [btrfs] btrfs_cow_block+0x106/0x210 [btrfs] commit_cowonly_roots+0x55/0x300 [btrfs] btrfs_commit_transaction+0x4ed/0xac0 [btrfs] sync_filesystem+0x74/0x90 generic_shutdown_super+0x22/0x100 kill_anon_super+0x14/0x30 btrfs_kill_super+0x12/0x20 [btrfs] deactivate_locked_super+0x36/0x70 cleanup_mnt+0x104/0x160 task_work_run+0x5f/0x90 __prepare_exit_to_usermode+0x1bd/0x1c0 do_syscall_64+0x5e/0xb0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 -> #2 (&space_info->lock){+.+.}-{2:2}: _raw_spin_lock+0x25/0x30 btrfs_block_rsv_release+0x1a6/0x3f0 [btrfs] btrfs_inode_rsv_release+0x4f/0x170 [btrfs] btrfs_clear_delalloc_extent+0x155/0x480 [btrfs] clear_state_bit+0x81/0x1a0 [btrfs] __clear_extent_bit+0x25c/0x5d0 [btrfs] clear_extent_bit+0x15/0x20 [btrfs] btrfs_invalidatepage+0x2b7/0x3c0 [btrfs] truncate_cleanup_page+0x47/0xe0 truncate_inode_pages_range+0x238/0x840 truncate_pagecache+0x44/0x60 btrfs_setattr+0x202/0x5e0 [btrfs] notify_change+0x33b/0x490 do_truncate+0x76/0xd0 path_openat+0x687/0xa10 do_filp_open+0x91/0x100 do_sys_openat2+0x215/0x2d0 do_sys_open+0x44/0x80 do_syscall_64+0x52/0xb0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 -> #1 (&tree->lock#2){+.+.}-{2:2}: _raw_spin_lock+0x25/0x30 find_first_extent_bit+0x32/0x150 [btrfs] write_pinned_extent_entries.isra.0+0xc5/0x100 [btrfs] __btrfs_write_out_cache+0x172/0x480 [btrfs] btrfs_write_out_cache+0x7a/0xf0 [btrfs] btrfs_write_dirty_block_groups+0x286/0x3b0 [btrfs] commit_cowonly_roots+0x245/0x300 [btrfs] btrfs_commit_transaction+0x4ed/0xac0 [btrfs] close_ctree+0xf9/0x2f5 [btrfs] generic_shutdown_super+0x6c/0x100 kill_anon_super+0x14/0x30 btrfs_kill_super+0x12/0x20 [btrfs] deactivate_locked_super+0x36/0x70 cleanup_mnt+0x104/0x160 task_work_run+0x5f/0x90 __prepare_exit_to_usermode+0x1bd/0x1c0 do_syscall_64+0x5e/0xb0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 -> #0 (&ctl->tree_lock){+.+.}-{2:2}: __lock_acquire+0x1240/0x2460 lock_acquire+0xab/0x360 _raw_spin_lock+0x25/0x30 btrfs_dump_free_space+0x2b/0xa0 [btrfs] btrfs_dump_space_info+0xf4/0x120 [btrfs] btrfs_reserve_extent+0x176/0x180 [btrfs] __btrfs_prealloc_file_range+0x145/0x550 [btrfs] cache_save_setup+0x28d/0x3b0 [btrfs] btrfs_start_dirty_block_groups+0x1fc/0x4f0 [btrfs] btrfs_commit_transaction+0xcc/0xac0 [btrfs] btrfs_alloc_data_chunk_ondemand+0x162/0x4c0 [btrfs] btrfs_check_data_free_space+0x4c/0xa0 [btrfs] btrfs_buffered_write.isra.0+0x19b/0x740 [btrfs] btrfs_file_write_iter+0x3cf/0x610 [btrfs] new_sync_write+0x11e/0x1b0 vfs_write+0x1c9/0x200 ksys_write+0x68/0xe0 do_syscall_64+0x52/0xb0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 other info that might help us debug this: Chain exists of: &ctl->tree_lock --> &space_info->lock --> &cache->lock Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&cache->lock); lock(&space_info->lock); lock(&cache->lock); lock(&ctl->tree_lock); *** DEADLOCK *** 6 locks held by dd/563090: #0: ffff9e7e21d18448 (sb_writers#14){.+.+}-{0:0}, at: vfs_write+0x195/0x200 #1: ffff9e7dd0410ed8 (&sb->s_type->i_mutex_key#19){++++}-{3:3}, at: btrfs_file_write_iter+0x86/0x610 [btrfs] #2: ffff9e7e21d18638 (sb_internal#2){.+.+}-{0:0}, at: start_transaction+0x40b/0x5b0 [btrfs] #3: ffff9e7e1f05d688 (&cur_trans->cache_write_mutex){+.+.}-{3:3}, at: btrfs_start_dirty_block_groups+0x158/0x4f0 [btrfs] #4: ffff9e7e2284ddb8 (&space_info->groups_sem){++++}-{3:3}, at: btrfs_dump_space_info+0x69/0x120 [btrfs] #5: ffff9e7e2284d428 (&cache->lock){+.+.}-{2:2}, at: btrfs_dump_space_info+0xaa/0x120 [btrfs] stack backtrace: CPU: 3 PID: 563090 Comm: dd Tainted: G OE 5.8.0-rc5+ #20 Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./890FX Deluxe5, BIOS P1.40 05/03/2011 Call Trace: dump_stack+0x96/0xd0 check_noncircular+0x162/0x180 __lock_acquire+0x1240/0x2460 ? wake_up_klogd.part.0+0x30/0x40 lock_acquire+0xab/0x360 ? btrfs_dump_free_space+0x2b/0xa0 [btrfs] _raw_spin_lock+0x25/0x30 ? btrfs_dump_free_space+0x2b/0xa0 [btrfs] btrfs_dump_free_space+0x2b/0xa0 [btrfs] btrfs_dump_space_info+0xf4/0x120 [btrfs] btrfs_reserve_extent+0x176/0x180 [btrfs] __btrfs_prealloc_file_range+0x145/0x550 [btrfs] ? btrfs_qgroup_reserve_data+0x1d/0x60 [btrfs] cache_save_setup+0x28d/0x3b0 [btrfs] btrfs_start_dirty_block_groups+0x1fc/0x4f0 [btrfs] btrfs_commit_transaction+0xcc/0xac0 [btrfs] ? start_transaction+0xe0/0x5b0 [btrfs] btrfs_alloc_data_chunk_ondemand+0x162/0x4c0 [btrfs] btrfs_check_data_free_space+0x4c/0xa0 [btrfs] btrfs_buffered_write.isra.0+0x19b/0x740 [btrfs] ? ktime_get_coarse_real_ts64+0xa8/0xd0 ? trace_hardirqs_on+0x1c/0xe0 btrfs_file_write_iter+0x3cf/0x610 [btrfs] new_sync_write+0x11e/0x1b0 vfs_write+0x1c9/0x200 ksys_write+0x68/0xe0 do_syscall_64+0x52/0xb0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 This is because we're holding the block_group->lock while trying to dump the free space cache. However we don't need this lock, we just need it to read the values for the printk, so move the free space cache dumping outside of the block group lock. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
|
01d01caf19 |
btrfs: move the chunk_mutex in btrfs_read_chunk_tree
We are currently getting this lockdep splat in btrfs/161: ====================================================== WARNING: possible circular locking dependency detected 5.8.0-rc5+ #20 Tainted: G E ------------------------------------------------------ mount/678048 is trying to acquire lock: ffff9b769f15b6e0 (&fs_devs->device_list_mutex){+.+.}-{3:3}, at: clone_fs_devices+0x4d/0x170 [btrfs] but task is already holding lock: ffff9b76abdb08d0 (&fs_info->chunk_mutex){+.+.}-{3:3}, at: btrfs_read_chunk_tree+0x6a/0x800 [btrfs] which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (&fs_info->chunk_mutex){+.+.}-{3:3}: __mutex_lock+0x8b/0x8f0 btrfs_init_new_device+0x2d2/0x1240 [btrfs] btrfs_ioctl+0x1de/0x2d20 [btrfs] ksys_ioctl+0x87/0xc0 __x64_sys_ioctl+0x16/0x20 do_syscall_64+0x52/0xb0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 -> #0 (&fs_devs->device_list_mutex){+.+.}-{3:3}: __lock_acquire+0x1240/0x2460 lock_acquire+0xab/0x360 __mutex_lock+0x8b/0x8f0 clone_fs_devices+0x4d/0x170 [btrfs] btrfs_read_chunk_tree+0x330/0x800 [btrfs] open_ctree+0xb7c/0x18ce [btrfs] btrfs_mount_root.cold+0x13/0xfa [btrfs] legacy_get_tree+0x30/0x50 vfs_get_tree+0x28/0xc0 fc_mount+0xe/0x40 vfs_kern_mount.part.0+0x71/0x90 btrfs_mount+0x13b/0x3e0 [btrfs] legacy_get_tree+0x30/0x50 vfs_get_tree+0x28/0xc0 do_mount+0x7de/0xb30 __x64_sys_mount+0x8e/0xd0 do_syscall_64+0x52/0xb0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&fs_info->chunk_mutex); lock(&fs_devs->device_list_mutex); lock(&fs_info->chunk_mutex); lock(&fs_devs->device_list_mutex); *** DEADLOCK *** 3 locks held by mount/678048: #0: ffff9b75ff5fb0e0 (&type->s_umount_key#63/1){+.+.}-{3:3}, at: alloc_super+0xb5/0x380 #1: ffffffffc0c2fbc8 (uuid_mutex){+.+.}-{3:3}, at: btrfs_read_chunk_tree+0x54/0x800 [btrfs] #2: ffff9b76abdb08d0 (&fs_info->chunk_mutex){+.+.}-{3:3}, at: btrfs_read_chunk_tree+0x6a/0x800 [btrfs] stack backtrace: CPU: 2 PID: 678048 Comm: mount Tainted: G E 5.8.0-rc5+ #20 Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./890FX Deluxe5, BIOS P1.40 05/03/2011 Call Trace: dump_stack+0x96/0xd0 check_noncircular+0x162/0x180 __lock_acquire+0x1240/0x2460 ? asm_sysvec_apic_timer_interrupt+0x12/0x20 lock_acquire+0xab/0x360 ? clone_fs_devices+0x4d/0x170 [btrfs] __mutex_lock+0x8b/0x8f0 ? clone_fs_devices+0x4d/0x170 [btrfs] ? rcu_read_lock_sched_held+0x52/0x60 ? cpumask_next+0x16/0x20 ? module_assert_mutex_or_preempt+0x14/0x40 ? __module_address+0x28/0xf0 ? clone_fs_devices+0x4d/0x170 [btrfs] ? static_obj+0x4f/0x60 ? lockdep_init_map_waits+0x43/0x200 ? clone_fs_devices+0x4d/0x170 [btrfs] clone_fs_devices+0x4d/0x170 [btrfs] btrfs_read_chunk_tree+0x330/0x800 [btrfs] open_ctree+0xb7c/0x18ce [btrfs] ? super_setup_bdi_name+0x79/0xd0 btrfs_mount_root.cold+0x13/0xfa [btrfs] ? vfs_parse_fs_string+0x84/0xb0 ? rcu_read_lock_sched_held+0x52/0x60 ? kfree+0x2b5/0x310 legacy_get_tree+0x30/0x50 vfs_get_tree+0x28/0xc0 fc_mount+0xe/0x40 vfs_kern_mount.part.0+0x71/0x90 btrfs_mount+0x13b/0x3e0 [btrfs] ? cred_has_capability+0x7c/0x120 ? rcu_read_lock_sched_held+0x52/0x60 ? legacy_get_tree+0x30/0x50 legacy_get_tree+0x30/0x50 vfs_get_tree+0x28/0xc0 do_mount+0x7de/0xb30 ? memdup_user+0x4e/0x90 __x64_sys_mount+0x8e/0xd0 do_syscall_64+0x52/0xb0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 This is because btrfs_read_chunk_tree() can come upon DEV_EXTENT's and then read the device, which takes the device_list_mutex. The device_list_mutex needs to be taken before the chunk_mutex, so this is a problem. We only really need the chunk mutex around adding the chunk, so move the mutex around read_one_chunk. An argument could be made that we don't even need the chunk_mutex here as it's during mount, and we are protected by various other locks. However we already have special rules for ->device_list_mutex, and I'd rather not have another special case for ->chunk_mutex. CC: stable@vger.kernel.org # 4.19+ Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
|
18c850fdc5 |
btrfs: open device without device_list_mutex
There's long existed a lockdep splat because we open our bdev's under the ->device_list_mutex at mount time, which acquires the bd_mutex. Usually this goes unnoticed, but if you do loopback devices at all suddenly the bd_mutex comes with a whole host of other dependencies, which results in the splat when you mount a btrfs file system. ====================================================== WARNING: possible circular locking dependency detected 5.8.0-0.rc3.1.fc33.x86_64+debug #1 Not tainted ------------------------------------------------------ systemd-journal/509 is trying to acquire lock: ffff970831f84db0 (&fs_info->reloc_mutex){+.+.}-{3:3}, at: btrfs_record_root_in_trans+0x44/0x70 [btrfs] but task is already holding lock: ffff97083144d598 (sb_pagefaults){.+.+}-{0:0}, at: btrfs_page_mkwrite+0x59/0x560 [btrfs] which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #6 (sb_pagefaults){.+.+}-{0:0}: __sb_start_write+0x13e/0x220 btrfs_page_mkwrite+0x59/0x560 [btrfs] do_page_mkwrite+0x4f/0x130 do_wp_page+0x3b0/0x4f0 handle_mm_fault+0xf47/0x1850 do_user_addr_fault+0x1fc/0x4b0 exc_page_fault+0x88/0x300 asm_exc_page_fault+0x1e/0x30 -> #5 (&mm->mmap_lock#2){++++}-{3:3}: __might_fault+0x60/0x80 _copy_from_user+0x20/0xb0 get_sg_io_hdr+0x9a/0xb0 scsi_cmd_ioctl+0x1ea/0x2f0 cdrom_ioctl+0x3c/0x12b4 sr_block_ioctl+0xa4/0xd0 block_ioctl+0x3f/0x50 ksys_ioctl+0x82/0xc0 __x64_sys_ioctl+0x16/0x20 do_syscall_64+0x52/0xb0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 -> #4 (&cd->lock){+.+.}-{3:3}: __mutex_lock+0x7b/0x820 sr_block_open+0xa2/0x180 __blkdev_get+0xdd/0x550 blkdev_get+0x38/0x150 do_dentry_open+0x16b/0x3e0 path_openat+0x3c9/0xa00 do_filp_open+0x75/0x100 do_sys_openat2+0x8a/0x140 __x64_sys_openat+0x46/0x70 do_syscall_64+0x52/0xb0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 -> #3 (&bdev->bd_mutex){+.+.}-{3:3}: __mutex_lock+0x7b/0x820 __blkdev_get+0x6a/0x550 blkdev_get+0x85/0x150 blkdev_get_by_path+0x2c/0x70 btrfs_get_bdev_and_sb+0x1b/0xb0 [btrfs] open_fs_devices+0x88/0x240 [btrfs] btrfs_open_devices+0x92/0xa0 [btrfs] btrfs_mount_root+0x250/0x490 [btrfs] legacy_get_tree+0x30/0x50 vfs_get_tree+0x28/0xc0 vfs_kern_mount.part.0+0x71/0xb0 btrfs_mount+0x119/0x380 [btrfs] legacy_get_tree+0x30/0x50 vfs_get_tree+0x28/0xc0 do_mount+0x8c6/0xca0 __x64_sys_mount+0x8e/0xd0 do_syscall_64+0x52/0xb0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 -> #2 (&fs_devs->device_list_mutex){+.+.}-{3:3}: __mutex_lock+0x7b/0x820 btrfs_run_dev_stats+0x36/0x420 [btrfs] commit_cowonly_roots+0x91/0x2d0 [btrfs] btrfs_commit_transaction+0x4e6/0x9f0 [btrfs] btrfs_sync_file+0x38a/0x480 [btrfs] __x64_sys_fdatasync+0x47/0x80 do_syscall_64+0x52/0xb0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 -> #1 (&fs_info->tree_log_mutex){+.+.}-{3:3}: __mutex_lock+0x7b/0x820 btrfs_commit_transaction+0x48e/0x9f0 [btrfs] btrfs_sync_file+0x38a/0x480 [btrfs] __x64_sys_fdatasync+0x47/0x80 do_syscall_64+0x52/0xb0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 -> #0 (&fs_info->reloc_mutex){+.+.}-{3:3}: __lock_acquire+0x1241/0x20c0 lock_acquire+0xb0/0x400 __mutex_lock+0x7b/0x820 btrfs_record_root_in_trans+0x44/0x70 [btrfs] start_transaction+0xd2/0x500 [btrfs] btrfs_dirty_inode+0x44/0xd0 [btrfs] file_update_time+0xc6/0x120 btrfs_page_mkwrite+0xda/0x560 [btrfs] do_page_mkwrite+0x4f/0x130 do_wp_page+0x3b0/0x4f0 handle_mm_fault+0xf47/0x1850 do_user_addr_fault+0x1fc/0x4b0 exc_page_fault+0x88/0x300 asm_exc_page_fault+0x1e/0x30 other info that might help us debug this: Chain exists of: &fs_info->reloc_mutex --> &mm->mmap_lock#2 --> sb_pagefaults Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(sb_pagefaults); lock(&mm->mmap_lock#2); lock(sb_pagefaults); lock(&fs_info->reloc_mutex); *** DEADLOCK *** 3 locks held by systemd-journal/509: #0: ffff97083bdec8b8 (&mm->mmap_lock#2){++++}-{3:3}, at: do_user_addr_fault+0x12e/0x4b0 #1: ffff97083144d598 (sb_pagefaults){.+.+}-{0:0}, at: btrfs_page_mkwrite+0x59/0x560 [btrfs] #2: ffff97083144d6a8 (sb_internal){.+.+}-{0:0}, at: start_transaction+0x3f8/0x500 [btrfs] stack backtrace: CPU: 0 PID: 509 Comm: systemd-journal Not tainted 5.8.0-0.rc3.1.fc33.x86_64+debug #1 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 Call Trace: dump_stack+0x92/0xc8 check_noncircular+0x134/0x150 __lock_acquire+0x1241/0x20c0 lock_acquire+0xb0/0x400 ? btrfs_record_root_in_trans+0x44/0x70 [btrfs] ? lock_acquire+0xb0/0x400 ? btrfs_record_root_in_trans+0x44/0x70 [btrfs] __mutex_lock+0x7b/0x820 ? btrfs_record_root_in_trans+0x44/0x70 [btrfs] ? kvm_sched_clock_read+0x14/0x30 ? sched_clock+0x5/0x10 ? sched_clock_cpu+0xc/0xb0 btrfs_record_root_in_trans+0x44/0x70 [btrfs] start_transaction+0xd2/0x500 [btrfs] btrfs_dirty_inode+0x44/0xd0 [btrfs] file_update_time+0xc6/0x120 btrfs_page_mkwrite+0xda/0x560 [btrfs] ? sched_clock+0x5/0x10 do_page_mkwrite+0x4f/0x130 do_wp_page+0x3b0/0x4f0 handle_mm_fault+0xf47/0x1850 do_user_addr_fault+0x1fc/0x4b0 exc_page_fault+0x88/0x300 ? asm_exc_page_fault+0x8/0x30 asm_exc_page_fault+0x1e/0x30 RIP: 0033:0x7fa3972fdbfe Code: Bad RIP value. Fix this by not holding the ->device_list_mutex at this point. The device_list_mutex exists to protect us from modifying the device list while the file system is running. However it can also be modified by doing a scan on a device. But this action is specifically protected by the uuid_mutex, which we are holding here. We cannot race with opening at this point because we have the ->s_mount lock held during the mount. Not having the ->device_list_mutex here is perfectly safe as we're not going to change the devices at this point. CC: stable@vger.kernel.org # 4.19+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add some comments ] Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
|
a47bd78d0c |
btrfs: sysfs: use NOFS for device creation
Dave hit this splat during testing btrfs/078: ====================================================== WARNING: possible circular locking dependency detected 5.8.0-rc6-default+ #1191 Not tainted ------------------------------------------------------ kswapd0/75 is trying to acquire lock: ffffa040e9d04ff8 (&delayed_node->mutex){+.+.}-{3:3}, at: __btrfs_release_delayed_node.part.0+0x3f/0x310 [btrfs] but task is already holding lock: ffffffff8b0c8040 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #2 (fs_reclaim){+.+.}-{0:0}: __lock_acquire+0x56f/0xaa0 lock_acquire+0xa3/0x440 fs_reclaim_acquire.part.0+0x25/0x30 __kmalloc_track_caller+0x49/0x330 kstrdup+0x2e/0x60 __kernfs_new_node.constprop.0+0x44/0x250 kernfs_new_node+0x25/0x50 kernfs_create_link+0x34/0xa0 sysfs_do_create_link_sd+0x5e/0xd0 btrfs_sysfs_add_devices_dir+0x65/0x100 [btrfs] btrfs_init_new_device+0x44c/0x12b0 [btrfs] btrfs_ioctl+0xc3c/0x25c0 [btrfs] ksys_ioctl+0x68/0xa0 __x64_sys_ioctl+0x16/0x20 do_syscall_64+0x50/0xe0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 -> #1 (&fs_info->chunk_mutex){+.+.}-{3:3}: __lock_acquire+0x56f/0xaa0 lock_acquire+0xa3/0x440 __mutex_lock+0xa0/0xaf0 btrfs_chunk_alloc+0x137/0x3e0 [btrfs] find_free_extent+0xb44/0xfb0 [btrfs] btrfs_reserve_extent+0x9b/0x180 [btrfs] btrfs_alloc_tree_block+0xc1/0x350 [btrfs] alloc_tree_block_no_bg_flush+0x4a/0x60 [btrfs] __btrfs_cow_block+0x143/0x7a0 [btrfs] btrfs_cow_block+0x15f/0x310 [btrfs] push_leaf_right+0x150/0x240 [btrfs] split_leaf+0x3cd/0x6d0 [btrfs] btrfs_search_slot+0xd14/0xf70 [btrfs] btrfs_insert_empty_items+0x64/0xc0 [btrfs] __btrfs_commit_inode_delayed_items+0xb2/0x840 [btrfs] btrfs_async_run_delayed_root+0x10e/0x1d0 [btrfs] btrfs_work_helper+0x2f9/0x650 [btrfs] process_one_work+0x22c/0x600 worker_thread+0x50/0x3b0 kthread+0x137/0x150 ret_from_fork+0x1f/0x30 -> #0 (&delayed_node->mutex){+.+.}-{3:3}: check_prev_add+0x98/0xa20 validate_chain+0xa8c/0x2a00 __lock_acquire+0x56f/0xaa0 lock_acquire+0xa3/0x440 __mutex_lock+0xa0/0xaf0 __btrfs_release_delayed_node.part.0+0x3f/0x310 [btrfs] btrfs_evict_inode+0x3bf/0x560 [btrfs] evict+0xd6/0x1c0 dispose_list+0x48/0x70 prune_icache_sb+0x54/0x80 super_cache_scan+0x121/0x1a0 do_shrink_slab+0x175/0x420 shrink_slab+0xb1/0x2e0 shrink_node+0x192/0x600 balance_pgdat+0x31f/0x750 kswapd+0x206/0x510 kthread+0x137/0x150 ret_from_fork+0x1f/0x30 other info that might help us debug this: Chain exists of: &delayed_node->mutex --> &fs_info->chunk_mutex --> fs_reclaim Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(fs_reclaim); lock(&fs_info->chunk_mutex); lock(fs_reclaim); lock(&delayed_node->mutex); *** DEADLOCK *** 3 locks held by kswapd0/75: #0: ffffffff8b0c8040 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30 #1: ffffffff8b0b50b8 (shrinker_rwsem){++++}-{3:3}, at: shrink_slab+0x54/0x2e0 #2: ffffa040e057c0e8 (&type->s_umount_key#26){++++}-{3:3}, at: trylock_super+0x16/0x50 stack backtrace: CPU: 2 PID: 75 Comm: kswapd0 Not tainted 5.8.0-rc6-default+ #1191 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014 Call Trace: dump_stack+0x78/0xa0 check_noncircular+0x16f/0x190 check_prev_add+0x98/0xa20 validate_chain+0xa8c/0x2a00 __lock_acquire+0x56f/0xaa0 lock_acquire+0xa3/0x440 ? __btrfs_release_delayed_node.part.0+0x3f/0x310 [btrfs] __mutex_lock+0xa0/0xaf0 ? __btrfs_release_delayed_node.part.0+0x3f/0x310 [btrfs] ? __lock_acquire+0x56f/0xaa0 ? __btrfs_release_delayed_node.part.0+0x3f/0x310 [btrfs] ? lock_acquire+0xa3/0x440 ? btrfs_evict_inode+0x138/0x560 [btrfs] ? btrfs_evict_inode+0x2fe/0x560 [btrfs] ? __btrfs_release_delayed_node.part.0+0x3f/0x310 [btrfs] __btrfs_release_delayed_node.part.0+0x3f/0x310 [btrfs] btrfs_evict_inode+0x3bf/0x560 [btrfs] evict+0xd6/0x1c0 dispose_list+0x48/0x70 prune_icache_sb+0x54/0x80 super_cache_scan+0x121/0x1a0 do_shrink_slab+0x175/0x420 shrink_slab+0xb1/0x2e0 shrink_node+0x192/0x600 balance_pgdat+0x31f/0x750 kswapd+0x206/0x510 ? _raw_spin_unlock_irqrestore+0x3e/0x50 ? finish_wait+0x90/0x90 ? balance_pgdat+0x750/0x750 kthread+0x137/0x150 ? kthread_stop+0x2a0/0x2a0 ret_from_fork+0x1f/0x30 This is because we're holding the chunk_mutex while adding this device and adding its sysfs entries. We actually hold different locks in different places when calling this function, the dev_replace semaphore for instance in dev replace, so instead of moving this call around simply wrap it's operations in NOFS. CC: stable@vger.kernel.org # 4.14+ Reported-by: David Sterba <dsterba@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
|
fbabd4a36f |
btrfs: return EROFS for BTRFS_FS_STATE_ERROR cases
Eric reported seeing this message while running generic/475 BTRFS: error (device dm-3) in btrfs_sync_log:3084: errno=-117 Filesystem corrupted Full stack trace: BTRFS: error (device dm-0) in btrfs_commit_transaction:2323: errno=-5 IO failure (Error while writing out transaction) BTRFS info (device dm-0): forced readonly BTRFS warning (device dm-0): Skipping commit of aborted transaction. ------------[ cut here ]------------ BTRFS: error (device dm-0) in cleanup_transaction:1894: errno=-5 IO failure BTRFS: Transaction aborted (error -117) BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c6480 len 4096 err no 10 BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c6488 len 4096 err no 10 BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c6490 len 4096 err no 10 BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c6498 len 4096 err no 10 BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c64a0 len 4096 err no 10 BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c64a8 len 4096 err no 10 BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c64b0 len 4096 err no 10 BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c64b8 len 4096 err no 10 BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c64c0 len 4096 err no 10 BTRFS warning (device dm-0): direct IO failed ino 3572 rw 0,0 sector 0x1b85e8 len 4096 err no 10 BTRFS warning (device dm-0): direct IO failed ino 3572 rw 0,0 sector 0x1b85f0 len 4096 err no 10 WARNING: CPU: 3 PID: 23985 at fs/btrfs/tree-log.c:3084 btrfs_sync_log+0xbc8/0xd60 [btrfs] BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d4288 len 4096 err no 10 BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d4290 len 4096 err no 10 BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d4298 len 4096 err no 10 BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42a0 len 4096 err no 10 BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42a8 len 4096 err no 10 BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42b0 len 4096 err no 10 BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42b8 len 4096 err no 10 BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42c0 len 4096 err no 10 BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42c8 len 4096 err no 10 BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42d0 len 4096 err no 10 CPU: 3 PID: 23985 Comm: fsstress Tainted: G W L 5.8.0-rc4-default+ #1181 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014 RIP: 0010:btrfs_sync_log+0xbc8/0xd60 [btrfs] RSP: 0018:ffff909a44d17bd0 EFLAGS: 00010286 RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000001 RDX: ffff8f3be41cb940 RSI: ffffffffb0108d2b RDI: ffffffffb0108ff7 RBP: ffff909a44d17e70 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000037988 R12: ffff8f3bd20e4000 R13: ffff8f3bd20e4428 R14: 00000000ffffff8b R15: ffff909a44d17c70 FS: 00007f6a6ed3fb80(0000) GS:ffff8f3c3dc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f6a6ed3e000 CR3: 00000000525c0003 CR4: 0000000000160ee0 Call Trace: ? finish_wait+0x90/0x90 ? __mutex_unlock_slowpath+0x45/0x2a0 ? lock_acquire+0xa3/0x440 ? lockref_put_or_lock+0x9/0x30 ? dput+0x20/0x4a0 ? dput+0x20/0x4a0 ? do_raw_spin_unlock+0x4b/0xc0 ? _raw_spin_unlock+0x1f/0x30 btrfs_sync_file+0x335/0x490 [btrfs] do_fsync+0x38/0x70 __x64_sys_fsync+0x10/0x20 do_syscall_64+0x50/0xe0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f6a6ef1b6e3 Code: Bad RIP value. RSP: 002b:00007ffd01e20038 EFLAGS: 00000246 ORIG_RAX: 000000000000004a RAX: ffffffffffffffda RBX: 000000000007a120 RCX: 00007f6a6ef1b6e3 RDX: 00007ffd01e1ffa0 RSI: 00007ffd01e1ffa0 RDI: 0000000000000003 RBP: 0000000000000003 R08: 0000000000000001 R09: 00007ffd01e2004c R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000009f R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 irq event stamp: 0 hardirqs last enabled at (0): [<0000000000000000>] 0x0 hardirqs last disabled at (0): [<ffffffffb007fe0b>] copy_process+0x67b/0x1b00 softirqs last enabled at (0): [<ffffffffb007fe0b>] copy_process+0x67b/0x1b00 softirqs last disabled at (0): [<0000000000000000>] 0x0 ---[ end trace af146e0e38433456 ]--- BTRFS: error (device dm-0) in btrfs_sync_log:3084: errno=-117 Filesystem corrupted This ret came from btrfs_write_marked_extents(). If we get an aborted transaction via EIO before, we'll see it in btree_write_cache_pages() and return EUCLEAN, which gets printed as "Filesystem corrupted". Except we shouldn't be returning EUCLEAN here, we need to be returning EROFS because EUCLEAN is reserved for actual corruption, not IO errors. We are inconsistent about our handling of BTRFS_FS_STATE_ERROR elsewhere, but we want to use EROFS for this particular case. The original transaction abort has the real error code for why we ended up with an aborted transaction, all subsequent actions just need to return EROFS because they may not have a trans handle and have no idea about the original cause of the abort. After patch "btrfs: don't WARN if we abort a transaction with EROFS" the stacktrace will not be dumped either. Reported-by: Eric Sandeen <esandeen@redhat.com> CC: stable@vger.kernel.org # 5.4+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add full test stacktrace ] Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
|
5913139343 |
btrfs: document special case error codes for fs errors
We've had some discussions about what to do in certain scenarios for error codes, specifically EUCLEAN and EROFS. Document these near the error handling code so its clear what their intentions are. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
|
f95ebdbed4 |
btrfs: don't WARN if we abort a transaction with EROFS
If we got some sort of corruption via a read and call btrfs_handle_fs_error() we'll set BTRFS_FS_STATE_ERROR on the fs and complain. If a subsequent trans handle trips over this it'll get EROFS and then abort. However at that point we're not aborting for the original reason, we're aborting because we've been flipped read only. We do not need to WARN_ON() here. CC: stable@vger.kernel.org # 5.4+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
|
3ebac17ce5 |
btrfs: reduce contention on log trees when logging checksums
The possibility of extents being shared (through clone and deduplication operations) requires special care when logging data checksums, to avoid having a log tree with different checksum items that cover ranges which overlap (which resulted in missing checksums after replaying a log tree). Such problems were fixed in the past by the following commits: commit |
||
|
|
b69d1ee923 |
btrfs: remove done label in writepage_delalloc
Since there is not common cleanup run after the label it makes it somewhat redundant. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
|
fd7fb634d6 |
btrfs: add comments for btrfs_reserve_flush_enum
This enum is the interface exposed to developers. Although we have a detailed comment explaining the whole idea of space flushing at the beginning of space-info.c, the exposed enum interface doesn't have any comment. Some corner cases, like BTRFS_RESERVE_FLUSH_ALL and BTRFS_RESERVE_FLUSH_ALL_STEAL can be interrupted by fatal signals, are not explained at all. So add some simple comments for these enums as a quick reference. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
|
44d354abf3 |
btrfs: relocation: review the call sites which can be interrupted by signal
Since most metadata reservation calls can return -EINTR when get interrupted by fatal signal, we need to review the all the metadata reservation call sites. In relocation code, the metadata reservation happens in the following sites: - btrfs_block_rsv_refill() in merge_reloc_root() merge_reloc_root() is a pretty critical section, we don't want to be interrupted by signal, so change the flush status to BTRFS_RESERVE_FLUSH_LIMIT, so it won't get interrupted by signal. Since such change can be ENPSPC-prone, also shrink the amount of metadata to reserve least amount avoid deadly ENOSPC there. - btrfs_block_rsv_refill() in reserve_metadata_space() It calls with BTRFS_RESERVE_FLUSH_LIMIT, which won't get interrupted by signal. - btrfs_block_rsv_refill() in prepare_to_relocate() - btrfs_block_rsv_add() in prepare_to_relocate() - btrfs_block_rsv_refill() in relocate_block_group() - btrfs_delalloc_reserve_metadata() in relocate_file_extent_cluster() - btrfs_start_transaction() in relocate_block_group() - btrfs_start_transaction() in create_reloc_inode() Can be interrupted by fatal signal and we can handle it easily. For these call sites, just catch the -EINTR value in btrfs_balance() and count them as canceled. CC: stable@vger.kernel.org # 5.4+ Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
|
f3e3d9cc35 |
btrfs: avoid possible signal interruption of btrfs_drop_snapshot() on relocation tree
[BUG]
There is a bug report about bad signal timing could lead to read-only
fs during balance:
BTRFS info (device xvdb): balance: start -d -m -s
BTRFS info (device xvdb): relocating block group 73001861120 flags metadata
BTRFS info (device xvdb): found 12236 extents, stage: move data extents
BTRFS info (device xvdb): relocating block group 71928119296 flags data
BTRFS info (device xvdb): found 3 extents, stage: move data extents
BTRFS info (device xvdb): found 3 extents, stage: update data pointers
BTRFS info (device xvdb): relocating block group 60922265600 flags metadata
BTRFS: error (device xvdb) in btrfs_drop_snapshot:5505: errno=-4 unknown
BTRFS info (device xvdb): forced readonly
BTRFS info (device xvdb): balance: ended with status: -4
[CAUSE]
The direct cause is the -EINTR from the following call chain when a
fatal signal is pending:
relocate_block_group()
|- clean_dirty_subvols()
|- btrfs_drop_snapshot()
|- btrfs_start_transaction()
|- btrfs_delayed_refs_rsv_refill()
|- btrfs_reserve_metadata_bytes()
|- __reserve_metadata_bytes()
|- wait_reserve_ticket()
|- prepare_to_wait_event();
|- ticket->error = -EINTR;
Normally this behavior is fine for most btrfs_start_transaction()
callers, as they need to catch any other error, same for the signal, and
exit ASAP.
However for balance, especially for the clean_dirty_subvols() case, we're
already doing cleanup works, getting -EINTR from btrfs_drop_snapshot()
could cause a lot of unexpected problems.
From the mentioned forced read-only report, to later balance error due
to half dropped reloc trees.
[FIX]
Fix this problem by using btrfs_join_transaction() if
btrfs_drop_snapshot() is called from relocation context.
Since btrfs_join_transaction() won't get interrupted by signal, we can
continue the cleanup.
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>3
Signed-off-by: David Sterba <dsterba@suse.com>
|
||
|
|
5cb502f4ab |
btrfs: relocation: allow signal to cancel balance
Although btrfs balance can be canceled with "btrfs balance cancel" command, it's still almost muscle memory to press Ctrl-C to cancel a long running btrfs balance. So allow btrfs balance to check signal to determine if it should exit. The cancellation points are in known location and we're only adding one more reason, so this should be safe. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
|
813f8a0e26 |
btrfs: raid56: remove out label in __raid56_parity_recover
There's no cleanup that occurs so we can simply return 0 directly. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
|
f37c563bab |
btrfs: add missing check for nocow and compression inode flags
User Forza reported on IRC that some invalid combinations of file attributes are accepted by chattr. The NODATACOW and compression file flags/attributes are mutually exclusive, but they could be set by 'chattr +c +C' on an empty file. The nodatacow will be in effect because it's checked first in btrfs_run_delalloc_range. Extend the flag validation to catch the following cases: - input flags are conflicting - old and new flags are conflicting - initialize the local variable with inode flags after inode ls locked Inode attributes take precedence over mount options and are an independent setting. Nocompress would be a no-op with nodatacow, but we don't want to mix any compression-related options with nodatacow. CC: stable@vger.kernel.org # 4.4+ Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
|
4faf55b038 |
btrfs: don't traverse into the seed devices in show_devname
->show_devname currently shows the lowest devid in the list. As the seed
devices have the lowest devid in the sprouted filesystem, the userland
tool such as findmnt end up seeing seed device instead of the device from
the read-writable sprouted filesystem. As shown below.
mount /dev/sda /btrfs
mount: /btrfs: WARNING: device write-protected, mounted read-only.
findmnt --output SOURCE,TARGET,UUID /btrfs
SOURCE TARGET UUID
/dev/sda /btrfs 899f7027-3e46-4626-93e7-7d4c9ad19111
btrfs dev add -f /dev/sdb /btrfs
umount /btrfs
mount /dev/sdb /btrfs
findmnt --output SOURCE,TARGET,UUID /btrfs
SOURCE TARGET UUID
/dev/sda /btrfs 899f7027-3e46-4626-93e7-7d4c9ad19111
All sprouts from a single seed will show the same seed device and the
same fsid. That's confusing.
This is causing problems in our prototype as there isn't any reference
to the sprout file-system(s) which is being used for actual read and
write.
This was added in the patch which implemented the show_devname in btrfs
commit
|
||
|
|
a3cf0e4342 |
btrfs: qgroup: free per-trans reserved space when a subvolume gets dropped
[BUG] Sometime fsstress could lead to qgroup warning for case like generic/013: BTRFS warning (device dm-3): qgroup 0/259 has unreleased space, type 1 rsv 81920 ------------[ cut here ]------------ WARNING: CPU: 9 PID: 24535 at fs/btrfs/disk-io.c:4142 close_ctree+0x1dc/0x323 [btrfs] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:close_ctree+0x1dc/0x323 [btrfs] Call Trace: btrfs_put_super+0x15/0x17 [btrfs] generic_shutdown_super+0x72/0x110 kill_anon_super+0x18/0x30 btrfs_kill_super+0x17/0x30 [btrfs] deactivate_locked_super+0x3b/0xa0 deactivate_super+0x40/0x50 cleanup_mnt+0x135/0x190 __cleanup_mnt+0x12/0x20 task_work_run+0x64/0xb0 __prepare_exit_to_usermode+0x1bc/0x1c0 __syscall_return_slowpath+0x47/0x230 do_syscall_64+0x64/0xb0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 ---[ end trace 6c341cdf9b6cc3c1 ]--- BTRFS error (device dm-3): qgroup reserved space leaked While that subvolume 259 is no longer in that filesystem. [CAUSE] Normally per-trans qgroup reserved space is freed when a transaction is committed, in commit_fs_roots(). However for completely dropped subvolume, that subvolume is completely gone, thus is no longer in the fs_roots_radix, and its per-trans reserved qgroup will never be freed. Since the subvolume is already gone, leaked per-trans space won't cause any trouble for end users. [FIX] Just call btrfs_qgroup_free_meta_all_pertrans() before a subvolume is completely dropped. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
|
d60ba8de11 |
btrfs: ref-verify: fix memory leak in add_block_entry
clang static analysis flags this error
fs/btrfs/ref-verify.c:290:3: warning: Potential leak of memory pointed to by 're' [unix.Malloc]
kfree(be);
^~~~~
The problem is in this block of code:
if (root_objectid) {
struct root_entry *exist_re;
exist_re = insert_root_entry(&exist->roots, re);
if (exist_re)
kfree(re);
}
There is no 'else' block freeing when root_objectid is 0. Add the
missing kfree to the else branch.
Fixes:
|
||
|
|
d85327b1d8 |
btrfs: prefetch chunk tree leaves at mount
The whole chunk tree is read at mount time so we can utilize readahead to get the tree blocks to memory before we read the items. The idea is from Robbie, but instead of updating search slot readahead, this patch implements the chunk tree readahead manually from nodes on level 1. We've decided to do specific readahead optimizations and then unify them under a common API so we don't break everything by changing the search slot readahead logic. Higher chunk trees grow on large filesystems (many terabytes), and prefetching just level 1 seems to be sufficient. Provided example was from a 200TiB filesystem with chunk tree level 2. CC: Robbie Ko <robbieko@synology.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
|
49bac89768 |
btrfs: add metadata_uuid to FS_INFO ioctl
Add retrieval of the filesystem's metadata UUID to the fsinfo ioctl. This is driven by setting the BTRFS_FS_INFO_FLAG_METADATA_UUID flag in btrfs_ioctl_fs_info_args::flags. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
|
0fb408a558 |
btrfs: add filesystem generation to FS_INFO ioctl
Add retrieval of the filesystem's generation to the fsinfo ioctl. This is driven by setting the BTRFS_FS_INFO_FLAG_GENERATION flag in btrfs_ioctl_fs_info_args::flags. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
|
|
137c541821 |
btrfs: pass checksum type via BTRFS_IOC_FS_INFO ioctl
With the recent addition of filesystem checksum types other than CRC32c,
it is not anymore hard-coded which checksum type a btrfs filesystem uses.
Up to now there is no good way to read the filesystem checksum, apart from
reading the filesystem UUID and then query sysfs for the checksum type.
Add a new csum_type and csum_size fields to the BTRFS_IOC_FS_INFO ioctl
command which usually is used to query filesystem features. Also add a
flags member indicating that the kernel responded with a set csum_type and
csum_size field.
For compatibility reasons, only return the csum_type and csum_size if
the BTRFS_FS_INFO_FLAG_CSUM_INFO flag was passed to the kernel. Also
clear any unknown flags so we don't pass false positives to user-space
newer than the kernel.
To simplify further additions to the ioctl, also switch the padding to a
u8 array. Pahole was used to verify the result of this switch:
The csum members are added before flags, which might look odd, but this
is to keep the alignment requirements and not to introduce holes in the
structure.
$ pahole -C btrfs_ioctl_fs_info_args fs/btrfs/btrfs.ko
struct btrfs_ioctl_fs_info_args {
__u64 max_id; /* 0 8 */
__u64 num_devices; /* 8 8 */
__u8 fsid[16]; /* 16 16 */
__u32 nodesize; /* 32 4 */
__u32 sectorsize; /* 36 4 */
__u32 clone_alignment; /* 40 4 */
__u16 csum_type; /* 44 2 */
__u16 csum_size; /* 46 2 */
__u64 flags; /* 48 8 */
__u8 reserved[968]; /* 56 968 */
/* size: 1024, cachelines: 16, members: 10 */
};
Fixes:
|