linux

mirror of https://github.com/torvalds/linux.git synced 2025-12-07 20:06:24 +00:00

Author	SHA1	Message	Date
Al Viro	90006f21b7	do_lock_mount(): don't modify path. Currently do_lock_mount() has the target path switched to whatever might be overmounting it. We _do_ want to have the parent mount/mountpoint chosen on top of the overmounting pile; however, the way it's done has unpleasant races - if umount propagation removes the overmount while we'd been trying to set the environment up, we might end up failing if our target path strays into that overmount just before the overmount gets kicked out. Users of do_lock_mount() do not need the target path changed - they have all information in res->{parent,mp}; only one place (in do_move_mount()) currently uses the resulting path->mnt, and that value is trivial to reconstruct by the original value of path->mnt + chosen parent mount. Let's keep the target path unchanged; it avoids a bunch of subtle races and it's not hard to do: do as mount_locked_reader find the prospective parent mount/mountpoint dentry grab references if it's not the original target lock the prospective mountpoint dentry take namespace_sem exclusive if prospective parent/mountpoint would be different now err = -EAGAIN else if location has been unmounted err = -ENOENT else if mountpoint dentry is not allowed to be mounted on err = -ENOENT else if beneath and the top of the pile was the absolute root err = -EINVAL else try to get struct mountpoint (by dentry), set err to 0 on success and -ENO{MEM,ENT} on failure if err != 0 res->parent = ERR_PTR(err) drop locks else res->parent = prospective parent drop temporary references while err == -EAGAIN A somewhat subtle part is that dropping temporary references is allowed. Neither mounts nor dentries should be evicted by a thread that holds namespace_sem. On success we are dropping those references under namespace_sem, so we need to be sure that these are not the last references remaining. However, on success we'd already verified (under namespace_sem) that original target is still mounted and that mount and dentry we are about to drop are still reachable from it via the mount tree. That guarantees that we are not about to drop the last remaining references. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-09-15 21:26:42 -04:00
Al Viro	25423edc78	new helper: topmost_overmount() Returns the final (topmost) mount in the chain of overmounts starting at given mount. Same locking rules as for any mount tree traversal - either the spinlock side of mount_lock, or rcu + sample the seqcount side of mount_lock before the call and recheck afterwards. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-09-15 21:26:05 -04:00
Al Viro	ed8ba4aad7	don't bother passing new_path->dentry to can_move_mount_beneath() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-09-15 21:26:05 -04:00
Al Viro	a2bdb7d8dc	pivot_root(2): use old_mp.mp->m_dentry instead of old.dentry That kills the last place where callers of lock_mount(path, &mp) used path->dentry. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-09-15 21:26:05 -04:00
Al Viro	6bfb6938e2	graft_tree(), attach_recursive_mnt() - pass pinned_mountpoint parent and mountpoint always come from the same struct pinned_mountpoint now. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-09-15 21:26:05 -04:00
Al Viro	ef307f89bf	do_add_mount(): switch to passing pinned_mountpoint instead of mountpoint + path Both callers pass it a mountpoint reference picked from pinned_mountpoint and path it corresponds to. First of all, path->dentry is equal to mp.mp->m_dentry. Furthermore, path->mnt is &mp.parent->mnt, making struct path contents redundant. Pass it the address of that pinned_mountpoint instead; what's more, if we teach it to treat ERR_PTR(error) in ->parent as "bail out with that error" we can simplify the callers even more - do_add_mount() will do the right thing even when called after lock_mount() failure. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-09-15 21:26:05 -04:00
Al Viro	842e12352c	do_move_mount(): use the parent mount returned by do_lock_mount() After successful do_lock_mount() call, mp.parent is set to either real_mount(path->mnt) (for !beneath case) or to ->mnt_parent of that (for beneath). p is set to real_mount(path->mnt) and after several uses it's made equal to mp.parent. All uses prior to that care only about p->mnt_ns and since p->mnt_ns == parent->mnt_ns, we might as well use mp.parent all along. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-09-15 21:26:05 -04:00
Al Viro	2010464cfa	change calling conventions for lock_mount() et.al. 1) pinned_mountpoint gets a new member - struct mount parent. Set only if we locked the sucker; ERR_PTR() - on failed attempt. 2) do_lock_mount() et.al. return void and set ->parent to on success with !beneath - mount corresponding to path->mnt * on success with beneath - the parent of mount corresponding to path->mnt * in case of error - ERR_PTR(-E...). IOW, we get the mount we will be actually mounting upon or ERR_PTR(). 3) we can't use CLASS, since the pinned_mountpoint is placed on hlist during initialization, so we define local macros: LOCK_MOUNT(mp, path) LOCK_MOUNT_MAYBE_BENEATH(mp, path, beneath) LOCK_MOUNT_EXACT(mp, path) All of them declare and initialize struct pinned_mountpoint mp, with unlock_mount done via __cleanup(). Users converted. [ lock_mount() is unused now; removed. Reported-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> ] Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-09-15 21:23:59 -04:00
Al Viro	b28f9eba12	change the calling conventions for vfs_parse_fs_string() Absolute majority of callers are passing the 4th argument equal to strlen() of the 3rd one. Drop the v_size argument, add vfs_parse_fs_qstr() for the cases that want independent length. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-09-04 15:20:51 -04:00
Al Viro	f1f486b841	finish_automount(): use __free() to deal with dropping mnt on failure same story as with do_new_mount_fc(). Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-09-02 19:35:59 -04:00
Al Viro	308a022f41	do_new_mount_fc(): use __free() to deal with dropping mnt on failure do_add_mount() consumes vfsmount on success; just follow it with conditional retain_and_null_ptr() on success and we can switch to __free() for mnt and be done with that - unlock_mount() is in the very end. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-09-02 19:35:59 -04:00
Al Viro	9bf5d48852	finish_automount(): take the lock_mount() analogue into a helper finish_automount() can't use lock_mount() - it treats finding something already mounted as "quitely drop our mount and return 0", not as "mount on top of whatever mounted there". It's been open-coded; let's take it into a helper similar to lock_mount(). "something's already mounted" => -EBUSY, finish_automount() needs to distinguish it from the normal case and it can't happen in other failure cases. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-09-02 19:35:59 -04:00
Al Viro	6bbbc4a04a	pivot_root(2): use __free() to deal with struct path in it preparations for making unlock_mount() a __cleanup(); can't have path_put() inside mount_lock scope. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-09-02 19:35:58 -04:00
Al Viro	76dfde13d6	do_loopback(): use __free(path_put) to deal with old_path preparations for making unlock_mount() a __cleanup(); can't have path_put() inside mount_lock scope. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-09-02 19:35:58 -04:00
Al Viro	11941610b0	finish_automount(): simplify the ELOOP check It's enough to check that dentries match; if path->dentry is equal to m->mnt_root, superblocks will match as well. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-09-02 19:35:58 -04:00
Al Viro	d29da1a8f1	move_mount(2): take sanity checks in 'beneath' case into do_lock_mount() We want to mount beneath the given location. For that operation to make sense, location must be the root of some mount that has something under it. Currently we let it proceed if those requirements are not met, with rather meaningless results, and have that bogosity caught further down the road; let's fail early instead - do_lock_mount() doesn't make sense unless those conditions hold, and checking them there makes things simpler. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-09-02 19:35:58 -04:00
Al Viro	c1ab70be88	do_move_mount(): deal with the checks on old_path early 1) checking that location we want to move does point to root of some mount can be done before anything else; that property is not going to change and having it already verified simplifies the analysis. 2) checking the type agreement between what we are trying to move and what we are trying to move it onto also belongs in the very beginning - do_lock_mount() might end up switching new_path to something that overmounts the original location, but... the same type agreement applies to overmounts, so we could just as well check against the original location. 3) since we know that old_path->dentry is the root of old_path->mnt, there's no point bothering with path_is_overmounted() in can_move_mount_beneath(); it's simply a check for the mount we are trying to move having non-NULL ->overmount. And with that, we can switch can_move_mount_beneath() to taking old instead of old_path, leaving no uses of old_path past the original checks. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-09-02 19:35:58 -04:00
Al Viro	a666bbcf7e	do_move_mount(): trim local variables Both 'parent' and 'ns' are used at most once, no point precalculating those... Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-09-02 19:35:58 -04:00
Al Viro	5423426a79	switch do_new_mount_fc() to fc_mount() Prior to the call of do_new_mount_fc() the caller has just done successful vfs_get_tree(). Then do_new_mount_fc() does several checks on resulting superblock, and either does fc_drop_locked() and returns an error or proceeds to unlock the superblock and call vfs_create_mount(). The thing is, there's no reason to delay that unlock + vfs_create_mount() - the tests do not rely upon the state of ->s_umount and fc_drop_locked() put_fs_context() is equivalent to unlock ->s_umount put_fs_context() Doing vfs_create_mount() before the checks allows us to move vfs_get_tree() from caller to do_new_mount_fc() and collapse it with vfs_create_mount() into an fc_mount() call. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-09-02 19:35:58 -04:00
Al Viro	8281f98a68	current_chrooted(): use guards here a use of __free(path_put) for dropping fs_root is enough to make guard(mount_locked_reader) fit... Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-09-02 19:35:58 -04:00
Al Viro	6b6516c56b	current_chrooted(): don't bother with follow_down_one() All we need here is to follow ->overmount on root mount of namespace... Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-09-02 19:35:57 -04:00
Al Viro	2aec880c1c	path_is_under(): use guards ... and document that locking requirements for is_path_reachable(). There is one questionable caller in do_listmount() where we are not holding mount_lock and might not have the first argument mounted. However, in that case it will immediately return true without having to look at the ancestors. Might be cleaner to move the check into non-LSTM_ROOT case which it really belongs in - there the check is not always true and is_mounted() is guaranteed. Document the locking environments for is_path_reachable() callers: get_peer_under_root() get_dominating_id() do_statmount() do_listmount() Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-09-02 19:35:57 -04:00
Al Viro	2605d86843	mnt_set_expiry(): use guards The reason why it needs only mount_locked_reader is that there's no lockless accesses of expiry lists. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-09-02 19:35:57 -04:00
Al Viro	f80b84358f	has_locked_children(): use guards ... and document the locking requirements of __has_locked_children() Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-09-02 19:35:57 -04:00
Al Viro	6b448d7a7c	check_for_nsfs_mounts(): no need to take locks Currently we are taking mount_writer; what that function needs is either mount_locked_reader (we are not changing anything, we just want to iterate through the subtree) or namespace_shared and a reference held by caller on the root of subtree - that's also enough to stabilize the topology. The thing is, all callers are already holding at least namespace_shared as well as a reference to the root of subtree. Let's make the callers provide locking warranties - don't mess with mount_lock in check_for_nsfs_mounts() itself and document the locking requirements. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-09-02 19:35:57 -04:00
Al Viro	747e91e5b7	mnt_already_visible(): use guards clean fit; namespace_shared due to iterating through ns->mounts. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-09-02 19:35:57 -04:00
Al Viro	61e68af33a	put_mnt_ns(): use guards clean fit; guards can't be weaker due to umount_tree() call. Setting emptied_ns requires namespace_excl, but not anything mount_lock-related. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-09-02 19:35:56 -04:00
Al Viro	550dda45df	mark_mounts_for_expiry(): use guards Clean fit; guards can't be weaker due to umount_tree() calls. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-09-02 19:35:56 -04:00
Al Viro	7b99ee2c5c	do_set_group(): use guards clean fit; namespace_excl to modify propagation graph Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-09-02 19:35:56 -04:00
Al Viro	12cdd1af7a	do_change_type(): use guards clean fit; namespace_excl to modify propagation graph Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-09-02 19:35:56 -04:00
Al Viro	4151c3cc58	__is_local_mountpoint(): use guards clean fit; namespace_shared due to iterating through ns->mounts. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-09-02 19:35:56 -04:00
Al Viro	902e990467	__detach_mounts(): use guards Clean fit for guards use; guards can't be weaker due to umount_tree() calls. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-09-02 19:35:56 -04:00
Al Viro	547af12dcd	fs/namespace.c: allow to drop vfsmount references via __free(mntput) Note that just as path_put, it should never be done in scope of namespace_sem, be it shared or exclusive. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-09-02 19:35:56 -04:00
Al Viro	360600f8ec	fs/namespace.c: fix the namespace_sem guard mess If anything, namespace_lock should be DEFINE_LOCK_GUARD_0, not DEFINE_GUARD. That way we * do not need to feed it a bogus argument * do not get gcc trying to compare an address of static in file variable with -4097 - and, if we are unlucky, trying to keep it in a register, with spills and all such. The same problems apply to grabbing namespace_sem shared. Rename it to namespace_excl, add namespace_shared, convert the existing users: guard(namespace_lock, &namespace_sem) => guard(namespace_excl)() guard(rwsem_read, &namespace_sem) => guard(namespace_shared)() scoped_guard(namespace_lock, &namespace_sem) => scoped_guard(namespace_excl) scoped_guard(rwsem_read, &namespace_sem) => scoped_guard(namespace_shared) Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-09-02 19:35:56 -04:00
Simon Schuster	edd3cb05c0	copy_process: pass clone_flags as u64 across calltree With the introduction of clone3 in commit `7f192e3cd3` ("fork: add clone3") the effective bit width of clone_flags on all architectures was increased from 32-bit to 64-bit, with a new type of u64 for the flags. However, for most consumers of clone_flags the interface was not changed from the previous type of unsigned long. While this works fine as long as none of the new 64-bit flag bits (CLONE_CLEAR_SIGHAND and CLONE_INTO_CGROUP) are evaluated, this is still undesirable in terms of the principle of least surprise. Thus, this commit fixes all relevant interfaces of callees to sys_clone3/copy_process (excluding the architecture-specific copy_thread) to consistently pass clone_flags as u64, so that no truncation to 32-bit integers occurs on 32-bit architectures. Signed-off-by: Simon Schuster <schuster.simon@siemens-energy.com> Link: https://lore.kernel.org/20250901-nios2-implement-clone3-v2-2-53fcf5577d57@siemens-energy.com Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-09-01 15:31:34 +02:00
Guopeng Zhang	41a86f6242	fs: fix indentation style Replace 8 leading spaces with a tab to follow kernel coding style. Signed-off-by: Guopeng Zhang <zhangguopeng@kylinos.cn> Link: https://lore.kernel.org/20250820133424.1667467-1-zhangguopeng@kylinos.cn Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-08-21 10:27:05 +02:00
Lichen Liu	278033a225	fs: Add 'initramfs_options' to set initramfs mount options When CONFIG_TMPFS is enabled, the initial root filesystem is a tmpfs. By default, a tmpfs mount is limited to using 50% of the available RAM for its content. This can be problematic in memory-constrained environments, particularly during a kdump capture. In a kdump scenario, the capture kernel boots with a limited amount of memory specified by the 'crashkernel' parameter. If the initramfs is large, it may fail to unpack into the tmpfs rootfs due to insufficient space. This is because to get X MB of usable space in tmpfs, 2*X MB of memory must be available for the mount. This leads to an OOM failure during the early boot process, preventing a successful crash dump. This patch introduces a new kernel command-line parameter, initramfs_options, which allows passing specific mount options directly to the rootfs when it is first mounted. This gives users control over the rootfs behavior. For example, a user can now specify initramfs_options=size=75% to allow the tmpfs to use up to 75% of the available memory. This can significantly reduce the memory pressure for kdump. Consider a practical example: To unpack a 48MB initramfs, the tmpfs needs 48MB of usable space. With the default 50% limit, this requires a memory pool of 96MB to be available for the tmpfs mount. The total memory requirement is therefore approximately: 16MB (vmlinuz) + 48MB (loaded initramfs) + 48MB (unpacked kernel) + 96MB (for tmpfs) + 12MB (runtime overhead) ≈ 220MB. By using initramfs_options=size=75%, the memory pool required for the 48MB tmpfs is reduced to 48MB / 0.75 = 64MB. This reduces the total memory requirement by 32MB (96MB - 64MB), allowing the kdump to succeed with a smaller crashkernel size, such as 192MB. An alternative approach of reusing the existing rootflags parameter was considered. However, a new, dedicated initramfs_options parameter was chosen to avoid altering the current behavior of rootflags (which applies to the final root filesystem) and to prevent any potential regressions. Also add documentation for the new kernel parameter "initramfs_options" This approach is inspired by prior discussions and patches on the topic. Ref: https://www.lightofdawn.org/blog/?viewDetailed=00128 Ref: https://landley.net/notes-2015.html#01-01-2015 Ref: https://lkml.org/lkml/2021/6/29/783 Ref: https://www.kernel.org/doc/html/latest/filesystems/ramfs-rootfs-initramfs.html#what-is-rootfs Signed-off-by: Lichen Liu <lichliu@redhat.com> Link: https://lore.kernel.org/20250815121459.3391223-1-lichliu@redhat.com Tested-by: Rob Landley <rob@landley.net> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-08-21 10:23:48 +02:00
Linus Torvalds	b19a97d57c	Merge tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull mount fixes from Al Viro: "Fixes for several recent mount-related regressions" * tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: change_mnt_propagation(): calculate propagation source only if we'll need it use uniform permission checks for all mount propagation changes propagate_umount(): only surviving overmounts should be reparented fix the softlockups in attach_recursive_mnt()	2025-08-19 10:12:10 -07:00
Al Viro	cffd044187	use uniform permission checks for all mount propagation changes do_change_type() and do_set_group() are operating on different aspects of the same thing - propagation graph. The latter asks for mounts involved to be mounted in namespace(s) the caller has CAP_SYS_ADMIN for. The former is a mess - originally it didn't even check that mount is mounted. That got fixed, but the resulting check turns out to be too strict for userland - in effect, we check that mount is in our namespace, having already checked that we have CAP_SYS_ADMIN there. What we really need (in both cases) is * only touch mounts that are mounted. That's a must-have constraint - data corruption happens if it get violated. * don't allow to mess with a namespace unless you already have enough permissions to do so (i.e. CAP_SYS_ADMIN in its userns). That's an equivalent of what do_set_group() does; let's extract that into a helper (may_change_propagation()) and use it in both do_set_group() and do_change_type(). Fixes: `12f147ddd6` "do_change_type(): refuse to operate on unmounted/not ours mounts" Acked-by: Andrei Vagin <avagin@gmail.com> Reviewed-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com> Tested-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com> Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-08-19 12:03:23 -04:00
Al Viro	0ddfb62f5d	fix the softlockups in attach_recursive_mnt() In case when we mounting something on top of a large stack of overmounts, all of them being peers of each other, we get quadratic time by the depth of overmount stack. Easily fixed by doing commit_tree() before reparenting the overmount; simplifies commit_tree() as well - it doesn't need to skip the already mounted stuff that had been reparented on top of the new mounts. Since we are holding mount_lock through both reparenting and call of commit_tree(), the order does not matter from the mount hash point of view. Reported-by: "Lai, Yi" <yi1.lai@linux.intel.com> Tested-by: "Lai, Yi" <yi1.lai@linux.intel.com> Reviewed-by: Christian Brauner <brauner@kernel.org> Fixes: `663206854f` "copy_tree(): don't link the mounts via mnt_list" Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-08-19 11:58:18 -04:00
Askar Safin	1e5f0fb41f	vfs: fs/namespace.c: remove ms_flags argument from do_remount ...because it is not used Signed-off-by: Askar Safin <safinaskar@zohomail.com> Link: https://lore.kernel.org/20250811045444.1813009-1-safinaskar@zohomail.com Reviewed-by: Aleksa Sarai <cyphar@cyphar.com> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-08-11 16:08:31 +02:00
Yuntao Wang	593d9e4c3d	fs: fix incorrect lflags value in the move_mount syscall The lflags value used to look up from_path was overwritten by the one used to look up to_path. In other words, from_path was looked up with the wrong lflags value. Fix it. Fixes: `f9fde814de` ("fs: support getname_maybe_null() in move_mount()") Signed-off-by: Yuntao Wang <yuntao.wang@linux.dev> Link: https://lore.kernel.org/20250811052426.129188-1-yuntao.wang@linux.dev [Christian Brauner <brauner@kernel.org>: massage patch] Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-08-11 16:05:53 +02:00
Aleksa Sarai	807602d8cf	vfs: output mount_too_revealing() errors to fscontext It makes little sense for fsmount() to output the warning message when mount_too_revealing() is violated to kmsg. Instead, the warning should be output (with a "VFS" prefix) to the fscontext log. In addition, include the same log message for mount_too_revealing() when doing a regular mount for consistency. With the newest fsopen()-based mount(8) from util-linux, the error messages now look like # mount -t proc proc /tmp mount: /tmp: fsmount() failed: VFS: Mount too revealing. dmesg(1) may have more information after failed mount system call. which could finally result in mount_too_revealing() errors being easier for users to detect and understand. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com> Link: https://lore.kernel.org/20250806-errorfc-mount-too-revealing-v2-2-534b9b4d45bb@cyphar.com Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-08-11 14:52:40 +02:00
Aleksa Sarai	9308366f06	open_tree_attr: do not allow id-mapping changes without OPEN_TREE_CLONE As described in commit `7a54947e72` ('Merge patch series "fs: allow changing idmappings"'), open_tree_attr(2) was necessary in order to allow for a detached mount to be created and have its idmappings changed without the risk of any racing threads operating on it. For this reason, mount_setattr(2) still does not allow for id-mappings to be changed. However, there was a bug in commit `2462651ffa` ("fs: allow changing idmappings") which allowed users to bypass this restriction by calling open_tree_attr(2) without OPEN_TREE_CLONE. can_idmap_mount() prevented this bug from allowing an attached mountpoint's id-mapping from being modified (thanks to an is_anon_ns() check), but this still allows for detached (but visible) mounts to have their be id-mapping changed. This risks the same UAF and locking issues as described in the merge commit, and was likely unintentional. Fixes: `2462651ffa` ("fs: allow changing idmappings") Cc: stable@vger.kernel.org # v6.15+ Signed-off-by: Aleksa Sarai <cyphar@cyphar.com> Link: https://lore.kernel.org/20250808-open_tree_attr-bugfix-idmap-v1-1-0ec7bc05646c@cyphar.com Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-08-11 14:51:49 +02:00
Linus Torvalds	f70d24c230	Merge tag 'vfs-6.17-rc1.nsfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull namespace updates from Christian Brauner: "This contains namespace updates. This time specifically for nsfs: - Userspace heavily relies on the root inode numbers for namespaces to identify the initial namespaces. That's already a hard dependency. So we cannot change that anymore. Move the initial inode numbers to a public header and align the only two namespaces that currently don't do that with all the other namespaces. - The root inode of /proc having a fixed inode number has been part of the core kernel ABI since its inception, and recently some userspace programs (mainly container runtimes) have started to explicitly depend on this behaviour. The main reason this is useful to userspace is that by checking that a suspect /proc handle has fstype PROC_SUPER_MAGIC and is PROCFS_ROOT_INO, they can then use openat2() together with RESOLVE_{NO_{XDEV,MAGICLINK},BENEATH} to ensure that there isn't a bind-mount that replaces some procfs file with a different one. This kind of attack has lead to security issues in container runtimes in the past (such as CVE-2019-19921) and libraries like libpathrs[1] use this feature of procfs to provide safe procfs handling functions" * tag 'vfs-6.17-rc1.nsfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: uapi: export PROCFS_ROOT_INO mntns: use stable inode number for initial mount ns netns: use stable inode number for initial mount ns nsfs: move root inode number to uapi	2025-07-28 12:50:56 -07:00
Linus Torvalds	7879d7aff0	Merge tag 'vfs-6.17-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull misc VFS updates from Christian Brauner: "This contains the usual selections of misc updates for this cycle. Features: - Add ext4 IOCB_DONTCACHE support This refactors the address_space_operations write_begin() and write_end() callbacks to take const struct kiocb * as their first argument, allowing IOCB flags such as IOCB_DONTCACHE to propagate to the filesystem's buffered I/O path. Ext4 is updated to implement handling of the IOCB_DONTCACHE flag and advertises support via the FOP_DONTCACHE file operation flag. Additionally, the i915 driver's shmem write paths are updated to bypass the legacy write_begin/write_end interface in favor of directly calling write_iter() with a constructed synchronous kiocb. Another i915 change replaces a manual write loop with kernel_write() during GEM shmem object creation. Cleanups: - don't duplicate vfs_open() in kernel_file_open() - proc_fd_getattr(): don't bother with S_ISDIR() check - fs/ecryptfs: replace snprintf with sysfs_emit in show function - vfs: Remove unnecessary list_for_each_entry_safe() from evict_inodes() - filelock: add new locks_wake_up_waiter() helper - fs: Remove three arguments from block_write_end() - VFS: change old_dir and new_dir in struct renamedata to dentrys - netfs: Remove unused declaration netfs_queue_write_request() Fixes: - eventpoll: Fix semi-unbounded recursion - eventpoll: fix sphinx documentation build warning - fs/read_write: Fix spelling typo - fs: annotate data race between poll_schedule_timeout() and pollwake() - fs/pipe: set FMODE_NOWAIT in create_pipe_files() - docs/vfs: update references to i_mutex to i_rwsem - fs/buffer: remove comment about hard sectorsize - fs/buffer: remove the min and max limit checks in __getblk_slow() - fs/libfs: don't assume blocksize <= PAGE_SIZE in generic_check_addressable - fs_context: fix parameter name in infofc() macro - fs: Prevent file descriptor table allocations exceeding INT_MAX" * tag 'vfs-6.17-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (24 commits) netfs: Remove unused declaration netfs_queue_write_request() eventpoll: fix sphinx documentation build warning ext4: support uncached buffered I/O mm/pagemap: add write_begin_get_folio() helper function fs: change write_begin/write_end interface to take struct kiocb * drm/i915: Refactor shmem_pwrite() to use kiocb and write_iter drm/i915: Use kernel_write() in shmem object create eventpoll: Fix semi-unbounded recursion vfs: Remove unnecessary list_for_each_entry_safe() from evict_inodes() fs/libfs: don't assume blocksize <= PAGE_SIZE in generic_check_addressable fs/buffer: remove the min and max limit checks in __getblk_slow() fs: Prevent file descriptor table allocations exceeding INT_MAX fs: Remove three arguments from block_write_end() fs/ecryptfs: replace snprintf with sysfs_emit in show function fs: annotate suspected data race between poll_schedule_timeout() and pollwake() docs/vfs: update references to i_mutex to i_rwsem fs/buffer: remove comment about hard sectorsize fs_context: fix parameter name in infofc() macro VFS: change old_dir and new_dir in struct renamedata to dentrys proc_fd_getattr(): don't bother with S_ISDIR() check ...	2025-07-28 11:22:56 -07:00
Al Viro	a7cce09945	statmount_mnt_basic(): simplify the logics for group id We are holding namespace_sem shared and we have not done any group id allocations since we grabbed it. Therefore IS_MNT_SHARED(m) is equivalent to non-zero m->mnt_group_id. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-06-29 19:03:46 -04:00
Al Viro	f6cc2f4e3d	invent_group_ids(): zero ->mnt_group_id always implies !IS_MNT_SHARED() All places where we call set_mnt_shared() are guaranteed to have non-zero ->mnt_group_id - either by explicit test, or by having done successful invent_group_ids() covering the same mount since we'd grabbed namespace_sem. The opposite combination (non-zero ->mnt_group_id and !IS_MNT_SHARED()) is possible - it means that we have allocated group id, but didn't get around to set_mnt_shared() yet; such state is transient - by the time we do namespace_unlock(), we must either do set_mnt_shared() or unroll the group id allocations by cleanup_group_ids(). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-06-29 19:03:46 -04:00
Al Viro	725ab435ff	get rid of CL_SHARE_TO_SLAVE the only difference between it and CL_SLAVE is in this predicate in clone_mnt(): if ((flag & CL_SLAVE) \|\| ((flag & CL_SHARED_TO_SLAVE) && IS_MNT_SHARED(old))) { However, in case of CL_SHARED_TO_SLAVE we have not allocated any mount group ids since the time we'd grabbed namespace_sem, so IS_MNT_SHARED() is equivalent to non-zero ->mnt_group_id. And in case of CL_SLAVE old has come either from the original tree, which had ->mnt_group_id allocated for all nodes or from result of sequence of CL_MAKE_SHARED or CL_MAKE_SHARED\|CL_SLAVE copies, ultimately going back to the original tree. In both cases we are guaranteed that old->mnt_group_id will be non-zero. In other words, the predicate is always equal to (flags & (CL_SLAVE \| CL_SHARED_TO_SLAVE)) && old->mnt_group_id and with that replacement CL_SLAVE and CL_SHARED_TO_SLAVE have exact same behaviour. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-06-29 19:03:46 -04:00
Al Viro	aab771f34e	take freeing of emptied mnt_namespace to namespace_unlock() Freeing of a namespace must be delayed until after we'd dealt with mount notifications (in namespace_unlock()). The reasons are not immediately obvious (they are buried in ->prev_ns handling in mnt_notify()), and having that free_mnt_ns() explicitly called after namespace_unlock() is asking for trouble - it does feel like they should be OK to free as soon as they've been emptied. Make the things more explicit by setting 'emptied_ns' under namespace_sem and having namespace_unlock() free the sucker as soon as it's safe to free. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2025-06-29 19:03:46 -04:00

1 2 3 4 5 ...

924 Commits