linux

mirror of https://github.com/torvalds/linux.git synced 2025-12-07 20:06:24 +00:00

Author	SHA1	Message	Date
Linus Torvalds	c9cfc122f0	Merge tag 'for-6.18-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - fix memory leak in qgroup relation ioctl when qgroup levels are invalid - don't write back dirty metadata on filesystem with errors - properly log renamed links - properly mark prealloc extent range beyond inode size as dirty (when no-noles is not enabled) * tag 'for-6.18-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: mark dirty extent range for out of bound prealloc extents btrfs: set inode flag BTRFS_INODE_COPY_EVERYTHING when logging new name btrfs: fix memory leak of qgroup_list in btrfs_add_qgroup_relation btrfs: ensure no dirty metadata is written back for an fs with errors	2025-11-04 14:25:38 +09:00
austinchang	3b1a4a59a2	btrfs: mark dirty extent range for out of bound prealloc extents In btrfs_fallocate(), when the allocated range overlaps with a prealloc extent and the extent starts after i_size, the range doesn't get marked dirty in file_extent_tree. This results in persisting an incorrect disk_i_size for the inode when not using the no-holes feature. This is reproducible since commit `41a2ee75aa` ("btrfs: introduce per-inode file extent tree"), then became hidden since commit `3d7db6e8bd` ("btrfs: don't allocate file extent tree for non regular files") and then visible again after commit `8679d2687c` ("btrfs: initialize inode::file_extent_tree after i_mode has been set"), which fixes the previous commit. The following reproducer triggers the problem: $ cat test.sh MNT=/mnt/test DEV=/dev/vdb mkdir -p $MNT mkfs.btrfs -f -O ^no-holes $DEV mount $DEV $MNT touch $MNT/file1 fallocate -n -o 1M -l 2M $MNT/file1 umount $MNT mount $DEV $MNT len=$((1 * 1024 * 1024)) fallocate -o 1M -l $len $MNT/file1 du --bytes $MNT/file1 umount $MNT mount $DEV $MNT du --bytes $MNT/file1 umount $MNT Running the reproducer gives the following result: $ ./test.sh (...) 2097152 /mnt/test/file1 1048576 /mnt/test/file1 The difference is exactly 1048576 as we assigned. Fix by adding a call to btrfs_inode_set_file_extent_range() in btrfs_fallocate_update_isize(). Fixes: `41a2ee75aa` ("btrfs: introduce per-inode file extent tree") Signed-off-by: austinchang <austinchang@synology.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-10-30 19:18:18 +01:00
Filipe Manana	953902e4fb	btrfs: set inode flag BTRFS_INODE_COPY_EVERYTHING when logging new name If we are logging a new name make sure our inode has the runtime flag BTRFS_INODE_COPY_EVERYTHING set so that at btrfs_log_inode() we will find new inode refs/extrefs in the subvolume tree and copy them into the log tree. We are currently doing it when adding a new link but we are missing it when renaming. An example where this makes a new name not persisted: 1) create symlink with name foo in directory A 2) fsync directory A, which persists the symlink 3) rename the symlink from foo to bar 4) fsync directory A to persist the new symlink name Step 4 isn't working correctly as it's not logging the new name and also leaving the old inode ref in the log tree, so after a power failure the symlink still has the old name of "foo". This is because when we first fsync directoy A we log the symlink's inode (as it's a new entry) and at btrfs_log_inode() we set the log mode to LOG_INODE_ALL and then because we are using that mode and the inode has the runtime flag BTRFS_INODE_NEEDS_FULL_SYNC set, we clear that flag as well as the flag BTRFS_INODE_COPY_EVERYTHING. That means the next time we log the inode, during the rename through the call to btrfs_log_new_name() (calling btrfs_log_inode_parent() and then btrfs_log_inode()), we will not search the subvolume tree for new refs/extrefs and jump directory to the 'log_extents' label. Fix this by making sure we set BTRFS_INODE_COPY_EVERYTHING on an inode when we are about to log a new name. A test case for fstests will follow soon. Reported-by: Vyacheslav Kovalevsky <slava.kovalevskiy.2014@gmail.com> Link: https://lore.kernel.org/linux-btrfs/ac949c74-90c2-4b9a-b7fd-1ffc5c3175c7@gmail.com/ Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-10-30 19:17:33 +01:00
Shardul Bankar	f260c6aff0	btrfs: fix memory leak of qgroup_list in btrfs_add_qgroup_relation When btrfs_add_qgroup_relation() is called with invalid qgroup levels (src >= dst), the function returns -EINVAL directly without freeing the preallocated qgroup_list structure passed by the caller. This causes a memory leak because the caller unconditionally sets the pointer to NULL after the call, preventing any cleanup. The issue occurs because the level validation check happens before the mutex is acquired and before any error handling path that would free the prealloc pointer. On this early return, the cleanup code at the 'out' label (which includes kfree(prealloc)) is never reached. In btrfs_ioctl_qgroup_assign(), the code pattern is: prealloc = kzalloc(sizeof(*prealloc), GFP_KERNEL); ret = btrfs_add_qgroup_relation(trans, sa->src, sa->dst, prealloc); prealloc = NULL; // Always set to NULL regardless of return value ... kfree(prealloc); // This becomes kfree(NULL), does nothing When the level check fails, 'prealloc' is never freed by either the callee or the caller, resulting in a 64-byte memory leak per failed operation. This can be triggered repeatedly by an unprivileged user with access to a writable btrfs mount, potentially exhausting kernel memory. Fix this by freeing prealloc before the early return, ensuring prealloc is always freed on all error paths. Fixes: `4addc1ffd6` ("btrfs: qgroup: preallocate memory before adding a relation") Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Shardul Bankar <shardulsb08@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-10-30 19:16:06 +01:00
Qu Wenruo	2618849f31	btrfs: ensure no dirty metadata is written back for an fs with errors [BUG] During development of a minor feature (make sure all btrfs_bio::end_io() is called in task context), I noticed a crash in generic/388, where metadata writes triggered new works after btrfs_stop_all_workers(). It turns out that it can even happen without any code modification, just using RAID5 for metadata and the same workload from generic/388 is going to trigger the use-after-free. [CAUSE] If btrfs hits an error, the fs is marked as error, no new transaction is allowed thus metadata is in a frozen state. But there are some metadata modifications before that error, and they are still in the btree inode page cache. Since there will be no real transaction commit, all those dirty folios are just kept as is in the page cache, and they can not be invalidated by invalidate_inode_pages2() call inside close_ctree(), because they are dirty. And finally after btrfs_stop_all_workers(), we call iput() on btree inode, which triggers writeback of those dirty metadata. And if the fs is using RAID56 metadata, this will trigger RMW and queue new works into rmw_workers, which is already stopped, causing warning from queue_work() and use-after-free. [FIX] Add a special handling for write_one_eb(), that if the fs is already in an error state, immediately mark the bbio as failure, instead of really submitting them. Then during close_ctree(), iput() will just discard all those dirty tree blocks without really writing them back, thus no more new jobs for already stopped-and-freed workqueues. The extra discard in write_one_eb() also acts as an extra safenet. E.g. the transaction abort is triggered by some extent/free space tree corruptions, and since extent/free space tree is already corrupted some tree blocks may be allocated where they shouldn't be (overwriting existing tree blocks). In that case writing them back will further corrupting the fs. CC: stable@vger.kernel.org # 6.6+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2025-10-30 19:16:01 +01:00

5 changed files with 24 additions and 2 deletions

									
										8

fs/btrfs/extent_io.c
									
												View File
												
				@@ -2228,6 +2228,14 @@ static noinline_for_stack void write_one_eb(struct extent_buffer *eb,

						wbc_account_cgroup_owner(wbc, folio, range_len);

						folio_unlock(folio);

					}

					/*

					 * If the fs is already in error status, do not submit any writeback

					 * but immediately finish it.

					 */

					if (unlikely(BTRFS_FS_ERROR(fs_info))) {

						btrfs_bio_end_io(bbio, errno_to_blk_status(BTRFS_FS_ERROR(fs_info)));

						return;

					}

					btrfs_submit_bbio(bbio, 0);

				}

									
										10

fs/btrfs/file.c
									
												View File
												
				@@ -2854,12 +2854,22 @@ static int btrfs_fallocate_update_isize(struct inode *inode,

				{

					struct btrfs_trans_handle *trans;

					struct btrfs_root *root = BTRFS_I(inode)->root;

					u64 range_start;

					u64 range_end;

					int ret;

					int ret2;

					if (mode & FALLOC_FL_KEEP_SIZE || end <= i_size_read(inode))

						return 0;

					range_start = round_down(i_size_read(inode), root->fs_info->sectorsize);

					range_end = round_up(end, root->fs_info->sectorsize);

					ret = btrfs_inode_set_file_extent_range(BTRFS_I(inode), range_start,

										range_end - range_start);

					if (ret)

						return ret;

					trans = btrfs_start_transaction(root, 1);

					if (IS_ERR(trans))

						return PTR_ERR(trans);

									
										1

fs/btrfs/inode.c
									
												View File
												
				@@ -6873,7 +6873,6 @@ static int btrfs_link(struct dentry *old_dentry, struct inode *dir,

					BTRFS_I(inode)->dir_index = 0ULL;

					inode_inc_iversion(inode);

					inode_set_ctime_current(inode);

					set_bit(BTRFS_INODE_COPY_EVERYTHING, &BTRFS_I(inode)->runtime_flags);

					ret = btrfs_add_link(trans, BTRFS_I(dir), BTRFS_I(inode),

							     &fname.disk_name, 1, index);

									
										4

fs/btrfs/qgroup.c
									
												View File
												
				@@ -1539,8 +1539,10 @@ int btrfs_add_qgroup_relation(struct btrfs_trans_handle *trans, u64 src, u64 dst

					ASSERT(prealloc);

					/* Check the level of src and dst first */

					if (btrfs_qgroup_level(src) >= btrfs_qgroup_level(dst))

					if (btrfs_qgroup_level(src) >= btrfs_qgroup_level(dst)) {

						kfree(prealloc);

						return -EINVAL;

					}

					mutex_lock(&fs_info->qgroup_ioctl_lock);

					if (!fs_info->quota_root) {

									
										3

fs/btrfs/tree-log.c
									
												View File
												
				@@ -7910,6 +7910,9 @@ void btrfs_log_new_name(struct btrfs_trans_handle *trans,

					bool log_pinned = false;

					int ret;

					/* The inode has a new name (ref/extref), so make sure we log it. */

					set_bit(BTRFS_INODE_COPY_EVERYTHING, &inode->runtime_flags);

					btrfs_init_log_ctx(&ctx, inode);

					ctx.logging_new_name = true;

Compare commits

5 Commits

8bb886cb8f ... c9cfc122f0

8

fs/btrfs/extent_io.c

View File

10

fs/btrfs/file.c

View File

1

fs/btrfs/inode.c

View File

4

fs/btrfs/qgroup.c

View File

3

fs/btrfs/tree-log.c

View File

Compare commits

5 Commits 8bb886cb8f ... c9cfc122f0

8 fs/btrfs/extent_io.c Unescape Escape View File

10 fs/btrfs/file.c Unescape Escape View File

1 fs/btrfs/inode.c Unescape Escape View File

4 fs/btrfs/qgroup.c Unescape Escape View File

3 fs/btrfs/tree-log.c Unescape Escape View File

5 Commits

8bb886cb8f ... c9cfc122f0

8

fs/btrfs/extent_io.c

View File

10

fs/btrfs/file.c

View File

1

fs/btrfs/inode.c

View File

4

fs/btrfs/qgroup.c

View File

3

fs/btrfs/tree-log.c

View File