linux

mirror of https://github.com/torvalds/linux.git synced 2025-12-07 20:06:24 +00:00

Author	SHA1	Message	Date
Linus Torvalds	bc69ed9752	Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost Pull virtio updates from Michael Tsirkin: "Just a bunch of fixes and cleanups, mostly very simple. Several features were merged through net-next this time around" * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: virtio_pci: drop kernel.h vhost: switch to arrays of feature bits vhost/test: add test specific macro for features virtio: clean up features qword/dword terms vduse: add WQ_PERCPU to alloc_workqueue users virtio_balloon: add WQ_PERCPU to alloc_workqueue users vdpa/pds: use %pe for ERR_PTR() in event handler registration vhost: Fix kthread worker cgroup failure handling virtio: vdpa: Fix reference count leak in octep_sriov_enable() vdpa/mlx5: Fix incorrect error code reporting in query_virtqueues virtio: fix map ops comment virtio: fix virtqueue_set_affinity() docs virtio: standardize Returns documentation style virtio: fix grammar in virtio_map_ops docs virtio: fix grammar in virtio_queue_info docs virtio: fix whitespace in virtio_config_ops virtio: fix typo in virtio_device_ready() comment virtio: fix kernel-doc for mapping/free_coherent functions virtio_vdpa: fix misleading return in void function	2025-12-04 18:59:21 -08:00
Michael S. Tsirkin	dfe44d177a	vhost: switch to arrays of feature bits The current interface where caller has to know in which 64 bit chunk each bit is, is inelegant and fragile. Let's simply use arrays of bits. By using unroll macros text size grows only slightly. Message-ID: <637e182e139980e5930d50b928ba5ac072d628a9.1764225384.git.mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>	2025-11-30 18:02:43 -05:00
Michael S. Tsirkin	350a840110	vhost/test: add test specific macro for features test just uses vhost features with no change, but people tend to copy/paste code, so let's add our own define. Message-ID: <23ca04512a800ee8b3594482492e536020931340.1764225384.git.mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>	2025-11-27 02:03:07 -05:00
Michael S. Tsirkin	9513f25056	virtio: clean up features qword/dword terms virtio pci uses word to mean "16 bits". mmio uses it to mean "32 bits". To avoid confusion, let's avoid the term in core virtio altogether. Just say U64 to mean "64 bit". Fixes: `e7d4c1c5a5` ("virtio: introduce extended features") Cc: Paolo Abeni <pabeni@redhat.com> Acked-by: Jason Wang <jasowang@redhat.com> Message-ID: <ad53b7b6be87fc524f45abaeca0bb05fb3633397.1764225384.git.mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>	2025-11-27 02:03:07 -05:00
Mike Christie	f3f64c2eaf	vhost: Fix kthread worker cgroup failure handling If we fail to attach to a cgroup we are leaking the id. This adds a new goto to free the id. Fixes: `7d9896e9f6` ("vhost: Reintroduce kthread API and add mode selection") Signed-off-by: Mike Christie <michael.christie@oracle.com> Acked-by: Jason Wang <jasowang@redhat.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Message-Id: <20251101194358.13605-1-michael.christie@oracle.com>	2025-11-27 02:03:06 -05:00
Jason Wang	779bcdd4b9	vhost: rewind next_avail_head while discarding descriptors When discarding descriptors with IN_ORDER, we should rewind next_avail_head otherwise it would run out of sync with last_avail_idx. This would cause driver to report "id X is not a head". Fixing this by returning the number of descriptors that is used for each buffer via vhost_get_vq_desc_n() so caller can use the value while discarding descriptors. Fixes: `67a873df0c` ("vhost: basic in order support") Cc: stable@vger.kernel.org Signed-off-by: Jason Wang <jasowang@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Link: https://patch.msgid.link/20251120022950.10117-1-jasowang@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-11-26 14:44:58 -08:00
Jason Wang	58aca3dbc7	vdpa: support virtio_map Virtio core switches from DMA device to virtio_map, let's do that as well for vDPA. Signed-off-by: Jason Wang <jasowang@redhat.com> Message-Id: <20250821064641.5025-8-jasowang@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Tested-by: Lei Yang <leiyang@redhat.com> Reviewed-by: Eugenio Pérez <eperezma@redhat.com>	2025-10-01 07:24:43 -04:00
Michael S. Tsirkin	c0e1116189	vhost: vringh: Fix copy_to_iter return value check The return value of copy_to_iter can't be negative, check whether the copied length is equal to the requested length instead of checking for negative values. Fixes: `309bba39c9` ("vringh: iterate on iotlb_translate to handle large translations") Cc: "Stefano Garzarella" <sgarzare@redhat.com> Cc: zhang jiao <zhangjiao2@cmss.chinamobile.com> Link: https://lore.kernel.org/all/20250910091739.2999-1-zhangjiao2@cmss.chinamobile.com Message-ID: <cd637504a6e3967954a9e80fc1b75e8c0978087b.1758723310.git.mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>	2025-10-01 07:24:31 -04:00
zhang jiao	82a8d0fda5	vhost: vringh: Modify the return value check The return value of copy_from_iter and copy_to_iter can't be negative, check whether the copied lengths are equal. Fixes: `309bba39c9` ("vringh: iterate on iotlb_translate to handle large translations") Cc: "Stefano Garzarella" <sgarzare@redhat.com> Signed-off-by: zhang jiao <zhangjiao2@cmss.chinamobile.com> Message-Id: <20250910091739.2999-1-zhangjiao2@cmss.chinamobile.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>	2025-10-01 07:24:20 -04:00
Jason Wang	e430451613	vhost-net: flush batched before enabling notifications Commit `8c2e6b26ff` ("vhost/net: Defer TX queue re-enable until after sendmsg") tries to defer the notification enabling by moving the logic out of the loop after the vhost_tx_batch() when nothing new is spotted. This caused unexpected side effects as the new logic is reused for several other error conditions. A previous patch reverted `8c2e6b26ff`. Now, bring the performance back up by flushing batched buffers before enabling notifications. Reported-by: Jon Kohler <jon@nutanix.com> Cc: stable@vger.kernel.org Fixes: `8c2e6b26ff` ("vhost/net: Defer TX queue re-enable until after sendmsg") Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Message-Id: <20250917063045.2042-3-jasowang@redhat.com>	2025-09-19 04:15:26 -04:00
Michael S. Tsirkin	4174152771	Revert "vhost/net: Defer TX queue re-enable until after sendmsg" This reverts commit `8c2e6b26ff`. It tries to defer the notification enabling by moving the logic out of the loop after the vhost_tx_batch() when nothing new is spotted. This will bring side effects as the new logic would be reused for several other error conditions. One example is the IOTLB: when there's an IOTLB miss, get_tx_bufs() might return -EAGAIN and exit the loop and see there's still available buffers, so it will queue the tx work again until userspace feed the IOTLB entry correctly. This will slowdown the tx processing and trigger the TX watchdog in the guest as reported in https://lkml.org/lkml/2025/9/10/1596. To fix, revert the change. A follow up patch will bring the performance back in a safe way. Reported-by: Jon Kohler <jon@nutanix.com> Cc: stable@vger.kernel.org Fixes: `8c2e6b26ff` ("vhost/net: Defer TX queue re-enable until after sendmsg") Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Message-Id: <20250917063045.2042-2-jasowang@redhat.com>	2025-09-19 04:15:26 -04:00
Jason Wang	90beccb3e1	vhost-net: unbreak busy polling Commit `67a873df0c` ("vhost: basic in order support") pass the number of used elem to vhost_net_rx_peek_head_len() to make sure it can signal the used correctly before trying to do busy polling. But it forgets to clear the count, this would cause the count run out of sync with handle_rx() and break the busy polling. Fixing this by passing the pointer of the count and clearing it after the signaling the used. Acked-by: Michael S. Tsirkin <mst@redhat.com> Cc: stable@vger.kernel.org Fixes: `67a873df0c` ("vhost: basic in order support") Signed-off-by: Jason Wang <jasowang@redhat.com> Message-Id: <20250917063045.2042-1-jasowang@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>	2025-09-19 04:15:26 -04:00
Alok Tiwari	a318eb8078	vhost-scsi: fix argument order in tport allocation error message The error log in vhost_scsi_make_tport() prints the arguments in the wrong order, producing confusing output. For example, when creating a target with a name in WWNN format such as "fc.port1234", the log looks like: Emulated fc.port1234 Address: FCP, exceeds max: 64 Instead, the message should report the emulated protocol type first, followed by the configfs name as: Emulated FCP Address: fc.port1234, exceeds max: 64 Fix the argument order so the error log is consistent and clear. Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com> Message-Id: <20250913154106.3995856-1-alok.a.tiwari@oracle.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>	2025-09-15 18:25:58 -04:00
Nikolay Kuratov	dd54bcf86c	vhost/net: Protect ubufs with rcu read lock in vhost_net_ubuf_put() When operating on struct vhost_net_ubuf_ref, the following execution sequence is theoretically possible: CPU0 is finalizing DMA operation CPU1 is doing VHOST_NET_SET_BACKEND // ubufs->refcount == 2 vhost_net_ubuf_put() vhost_net_ubuf_put_wait_and_free(oldubufs) vhost_net_ubuf_put_and_wait() vhost_net_ubuf_put() int r = atomic_sub_return(1, &ubufs->refcount); // r = 1 int r = atomic_sub_return(1, &ubufs->refcount); // r = 0 wait_event(ubufs->wait, !atomic_read(&ubufs->refcount)); // no wait occurs here because condition is already true kfree(ubufs); if (unlikely(!r)) wake_up(&ubufs->wait); // use-after-free This leads to use-after-free on ubufs access. This happens because CPU1 skips waiting for wake_up() when refcount is already zero. To prevent that use a read-side RCU critical section in vhost_net_ubuf_put(), as suggested by Hillf Danton. For this lock to take effect, free ubufs with kfree_rcu(). Cc: stable@vger.kernel.org Fixes: `0ad8b480d6` ("vhost: fix ref cnt checking deadlock") Reported-by: Andrey Ryabinin <arbn@yandex-team.com> Suggested-by: Hillf Danton <hdanton@sina.com> Signed-off-by: Nikolay Kuratov <kniv@yandex-team.ru> Message-Id: <20250805130917.727332-1-kniv@yandex-team.ru> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>	2025-08-26 03:38:10 -04:00
Jason Wang	6a20f9fca3	vhost: initialize vq->nheads properly Commit 7918bb2d19c9 ("vhost: basic in order support") introduces vq->nheads to store the number of batched used buffers per used elem but it forgets to initialize the vq->nheads to NULL in vhost_dev_init() this will cause kfree() that would try to free it without be allocated if SET_OWNER is not called. Reported-by: JAEHOON KIM <jhkim@linux.ibm.com> Reported-by: Breno Leitao <leitao@debian.org> Fixes: `45347e79b5` ("vhost: basic in order support") Signed-off-by: Jason Wang <jasowang@redhat.com> Message-Id: <20250729073916.80647-1-jasowang@redhat.com> Reviewed-by: Dawid Osuchowski <dawid.osuchowski@linux.intel.com> Tested-by: Breno Leitao <leitao@debian.org> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Tested-by: Jaehoon Kim <jhkim@linux.ibm.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>	2025-08-05 05:57:40 -04:00
Linus Torvalds	821c9e515d	Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost Pull virtio updates from Michael Tsirkin: - vhost can now support legacy threading if enabled in Kconfig - vsock memory allocation strategies for large buffers have been improved, reducing pressure on kmalloc - vhost now supports the in-order feature. guest bits missed the merge window. - fixes, cleanups all over the place * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: (30 commits) vsock/virtio: Allocate nonlinear SKBs for handling large transmit buffers vsock/virtio: Rename virtio_vsock_skb_rx_put() vhost/vsock: Allocate nonlinear SKBs for handling large receive buffers vsock/virtio: Move SKB allocation lower-bound check to callers vsock/virtio: Rename virtio_vsock_alloc_skb() vsock/virtio: Resize receive buffers so that each SKB fits in a 4K page vsock/virtio: Move length check to callers of virtio_vsock_skb_rx_put() vsock/virtio: Validate length in packet header before skb_put() vhost/vsock: Avoid allocating arbitrarily-sized SKBs vhost_net: basic in_order support vhost: basic in order support vhost: fail early when __vhost_add_used() fails vhost: Reintroduce kthread API and add mode selection vdpa: Fix IDR memory leak in VDUSE module exit vdpa/mlx5: Fix release of uninitialized resources on error path vhost-scsi: Fix check for inline_sg_cnt exceeding preallocated limit virtio: virtio_dma_buf: fix missing parameter documentation vhost: Fix typos vhost: vringh: Remove unused functions vhost: vringh: Remove unused iotlb functions ...	2025-08-01 14:17:48 -07:00
Will Deacon	8ca76151d2	vsock/virtio: Rename virtio_vsock_skb_rx_put() In preparation for using virtio_vsock_skb_rx_put() when populating SKBs on the vsock TX path, rename virtio_vsock_skb_rx_put() to virtio_vsock_skb_put(). No functional change. Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Will Deacon <will@kernel.org> Message-Id: <20250717090116.11987-9-will@kernel.org> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>	2025-08-01 09:11:09 -04:00
Will Deacon	ab9aa2f3af	vhost/vsock: Allocate nonlinear SKBs for handling large receive buffers When receiving a packet from a guest, vhost_vsock_handle_tx_kick() calls vhost_vsock_alloc_linear_skb() to allocate and fill an SKB with the receive data. Unfortunately, these are always linear allocations and can therefore result in significant pressure on kmalloc() considering that the maximum packet size (VIRTIO_VSOCK_MAX_PKT_BUF_SIZE + VIRTIO_VSOCK_SKB_HEADROOM) is a little over 64KiB, resulting in a 128KiB allocation for each packet. Rework the vsock SKB allocation so that, for sizes with page order greater than PAGE_ALLOC_COSTLY_ORDER, a nonlinear SKB is allocated instead with the packet header in the SKB and the receive data in the fragments. Finally, add a debug warning if virtio_vsock_skb_rx_put() is ever called on an SKB with a non-zero length, as this would be destructive for the nonlinear case. Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Will Deacon <will@kernel.org> Message-Id: <20250717090116.11987-8-will@kernel.org> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>	2025-08-01 09:11:09 -04:00
Will Deacon	fac6b82e0f	vsock/virtio: Move SKB allocation lower-bound check to callers virtio_vsock_alloc_linear_skb() checks that the requested size is at least big enough for the packet header (VIRTIO_VSOCK_SKB_HEADROOM). Of the three callers of virtio_vsock_alloc_linear_skb(), only vhost_vsock_alloc_skb() can potentially pass a packet smaller than the header size and, as it already has a check against the maximum packet size, extend its bounds checking to consider the minimum packet size and remove the check from virtio_vsock_alloc_linear_skb(). Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Will Deacon <will@kernel.org> Message-Id: <20250717090116.11987-7-will@kernel.org> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>	2025-08-01 09:11:09 -04:00
Will Deacon	2304c64a28	vsock/virtio: Rename virtio_vsock_alloc_skb() In preparation for nonlinear allocations for large SKBs, rename virtio_vsock_alloc_skb() to virtio_vsock_alloc_linear_skb() to indicate that it returns linear SKBs unconditionally and switch all callers over to this new interface for now. No functional change. Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Will Deacon <will@kernel.org> Message-Id: <20250717090116.11987-6-will@kernel.org> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>	2025-08-01 09:11:09 -04:00
Will Deacon	87dbae5e36	vsock/virtio: Move length check to callers of virtio_vsock_skb_rx_put() virtio_vsock_skb_rx_put() only calls skb_put() if the length in the packet header is not zero even though skb_put() handles this case gracefully. Remove the functionally redundant check from virtio_vsock_skb_rx_put() and, on the assumption that this is a worthwhile optimisation for handling credit messages, augment the existing length checks in virtio_transport_rx_work() to elide the call for zero-length payloads. Since the callers all have the length, extend virtio_vsock_skb_rx_put() to take it as an additional parameter rather than fish it back out of the packet header. Note that the vhost code already has similar logic in vhost_vsock_alloc_skb(). Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Will Deacon <will@kernel.org> Message-Id: <20250717090116.11987-4-will@kernel.org> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>	2025-08-01 09:11:09 -04:00
Will Deacon	10a886aaed	vhost/vsock: Avoid allocating arbitrarily-sized SKBs vhost_vsock_alloc_skb() returns NULL for packets advertising a length larger than VIRTIO_VSOCK_MAX_PKT_BUF_SIZE in the packet header. However, this is only checked once the SKB has been allocated and, if the length in the packet header is zero, the SKB may not be freed immediately. Hoist the size check before the SKB allocation so that an iovec larger than VIRTIO_VSOCK_MAX_PKT_BUF_SIZE + the header size is rejected outright. The subsequent check on the length field in the header can then simply check that the allocated SKB is indeed large enough to hold the packet. Cc: <stable@vger.kernel.org> Fixes: `71dc9ec9ac` ("virtio/vsock: replace virtio_vsock_pkt with sk_buff") Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Will Deacon <will@kernel.org> Message-Id: <20250717090116.11987-2-will@kernel.org> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>	2025-08-01 09:11:09 -04:00
Jason Wang	45347e79b5	vhost_net: basic in_order support This patch introduces basic in-order support for vhost-net. By recording the number of batched buffers in an array when calling `vhost_add_used_and_signal_n()`, we can reduce the number of userspace accesses. Note that the vhost-net batching logic is kept as we still count the number of buffers there. Testing Results: With testpmd: - TX: txonly mode + vhost_net with XDP_DROP on TAP shows a 17.5% improvement, from 4.75 Mpps to 5.35 Mpps. - RX: No obvious improvements were observed. With virtio-ring in-order experimental code in the guest: - TX: pktgen in the guest + XDP_DROP on TAP shows a 19% improvement, from 5.2 Mpps to 6.2 Mpps. - RX: pktgen on TAP with vhost_net + XDP_DROP in the guest achieves a 6.1% improvement, from 3.47 Mpps to 3.61 Mpps. Acked-by: Jonah Palmer <jonah.palmer@oracle.com> Acked-by: Eugenio Pérez <eperezma@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Message-Id: <20250714084755.11921-4-jasowang@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Tested-by: Lei Yang <leiyang@redhat.com>	2025-08-01 09:11:09 -04:00
Jason Wang	67a873df0c	vhost: basic in order support This patch adds basic in order support for vhost. Two optimizations are implemented in this patch: 1) Since driver uses descriptor in order, vhost can deduce the next avail ring head by counting the number of descriptors that has been used in next_avail_head. This eliminate the need to access the available ring in vhost. 2) vhost_add_used_and_singal_n() is extended to accept the number of batched buffers per used elem. While this increases the times of userspace memory access but it helps to reduce the chance of used ring access of both the driver and vhost. Vhost-net will be the first user for this. Acked-by: Jonah Palmer <jonah.palmer@oracle.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Message-Id: <20250714084755.11921-3-jasowang@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Tested-by: Lei Yang <leiyang@redhat.com>	2025-08-01 09:11:09 -04:00
Jason Wang	b4ba1207d4	vhost: fail early when __vhost_add_used() fails This patch fails vhost_add_used_n() early when __vhost_add_used() fails to make sure used idx is not updated with stale used ring information. Reported-by: Eugenio Pérez <eperezma@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Message-Id: <20250714084755.11921-2-jasowang@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Tested-by: Lei Yang <leiyang@redhat.com>	2025-08-01 09:11:09 -04:00
Cindy Lu	7d9896e9f6	vhost: Reintroduce kthread API and add mode selection Since commit `6e890c5d50` ("vhost: use vhost_tasks for worker threads"), the vhost uses vhost_task and operates as a child of the owner thread. This is required for correct CPU usage accounting, especially when using containers. However, this change has caused confusion for some legacy userspace applications, and we didn't notice until it's too late. Unfortunately, it's too late to revert - we now have userspace depending both on old and new behaviour :( To address the issue, reintroduce kthread mode for vhost workers and provide a configuration to select between kthread and task worker. - Add 'fork_owner' parameter to vhost_dev to let users select kthread or task mode. Default mode is task mode(VHOST_FORK_OWNER_TASK). - Reintroduce kthread mode support: * Bring back the original vhost_worker() implementation, and renamed to vhost_run_work_kthread_list(). * Add cgroup support for the kthread * Introduce struct vhost_worker_ops: - Encapsulates create / stop / wake‑up callbacks. - vhost_worker_create() selects the proper ops according to inherit_owner. - Userspace configuration interface: * New IOCTLs: - VHOST_SET_FORK_FROM_OWNER lets userspace select task mode (VHOST_FORK_OWNER_TASK) or kthread mode (VHOST_FORK_OWNER_KTHREAD) - VHOST_GET_FORK_FROM_OWNER reads the current worker mode * Expose module parameter 'fork_from_owner_default' to allow system administrators to configure the default mode for vhost workers * Kconfig option CONFIG_VHOST_ENABLE_FORK_OWNER_CONTROL controls whether these IOCTLs and the parameter are available - The VHOST_NEW_WORKER functionality requires fork_owner to be set to true, with validation added to ensure proper configuration This partially reverts or improves upon: commit `6e890c5d50` ("vhost: use vhost_tasks for worker threads") commit `1cdaafa1b8` ("vhost: replace single worker pointer with xarray") Fixes: `6e890c5d50` ("vhost: use vhost_tasks for worker threads"), Signed-off-by: Cindy Lu <lulu@redhat.com> Message-Id: <20250714071333.59794-2-lulu@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Jason Wang <jasowang@redhat.com> Tested-by: Lei Yang <leiyang@redhat.com>	2025-08-01 09:11:09 -04:00
Alok Tiwari	400cad513c	vhost-scsi: Fix check for inline_sg_cnt exceeding preallocated limit The condition comparing ret to VHOST_SCSI_PREALLOC_SGLS was incorrect, as ret holds the result of kstrtouint() (typically 0 on success), not the parsed value. Update the check to use cnt, which contains the actual user-provided value. prevents silently accepting values exceeding the maximum inline_sg_cnt. Fixes: `bca939d5bc` ("vhost-scsi: Dynamically allocate scatterlists") Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com> Reviewed-by: Mike Christie <michael.christie@oracle.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Message-Id: <20250628183405.3979538-1-alok.a.tiwari@oracle.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Jason Wang <jasowang@redhat.com>	2025-08-01 09:11:08 -04:00
Alok Tiwari	8a0d18a934	vhost: Fix typos Fix multiple typos and improve comment clarity across vhost.c. Spelling errors: "thead" -> "thread", "RUNNUNG" -> "RUNNING" and "available". Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com> Message-Id: <20250615173933.1610324-1-alok.a.tiwari@oracle.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Simon Horman <horms@kernel.org>	2025-08-01 09:11:08 -04:00
Dr. David Alan Gilbert	6e9ef6937c	vhost: vringh: Remove unused functions The functions: vringh_abandon_kern() vringh_abandon_user() vringh_iov_pull_kern() and vringh_iov_push_kern() were all added in 2013 by commit `f87d0fbb57` ("vringh: host-side implementation of virtio rings.") but have remained unused. Remove them and the two helper functions they used. Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Message-Id: <20250617001838.114457-3-linux@treblig.org> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Eugenio Pérez <eperezma@redhat.com> Tested-by: Lei Yang <leiyang@redhat.com> Reviewed-by: Simon Horman <horms@kernel.org>	2025-08-01 09:11:08 -04:00
Dr. David Alan Gilbert	569c392e19	vhost: vringh: Remove unused iotlb functions The functions: vringh_abandon_iotlb() vringh_notify_disable_iotlb() and vringh_notify_enable_iotlb() were added in 2020 by commit `9ad9c49cfe` ("vringh: IOTLB support") but have remained unused. Remove them. Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Reviewed-by: Simon Horman <horms@kernel.org> Message-Id: <20250617001838.114457-2-linux@treblig.org> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Eugenio Pérez <eperezma@redhat.com> Tested-by: Lei Yang <leiyang@redhat.com>	2025-08-01 09:11:08 -04:00
Mike Christie	69cd720a8a	vhost-scsi: Fix log flooding with target does not exist errors As part of the normal initiator side scanning the guest's scsi layer will loop over all possible targets and send an inquiry. Since the max number of targets for virtio-scsi is 256, this can result in 255 error messages about targets not existing if you only have a single target. When there's more than 1 vhost-scsi device each with a single target, then you get N * 255 log messages. It looks like the log message was added by accident in: commit `3f8ca2e115` ("vhost/scsi: Extract common handling code from control queue handler") when we added common helpers. Then in: commit `09d7583294` ("vhost/scsi: Use common handling code in request queue handler") we converted the scsi command processing path to use the new helpers so we started to see the extra log messages during scanning. The patches were just making some code common but added the vq_err call and I'm guessing the patch author forgot to enable the vq_err call (vq_err is implemented by pr_debug which defaults to off). So this patch removes the call since it's expected to hit this path during device discovery. Fixes: `09d7583294` ("vhost/scsi: Use common handling code in request queue handler") Signed-off-by: Mike Christie <michael.christie@oracle.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Message-Id: <20250611210113.10912-1-michael.christie@oracle.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>	2025-08-01 09:11:08 -04:00
Alok Tiwari	652abad085	vhost-scsi: Fix typos and formatting in comments and logs This patch corrects several minor typos and formatting issues. Changes include: Fixing misspellings like in comments - "explict" -> "explicit" - "infight" -> "inflight", - "with generate" -> "will generate" formatting in logs - Correcting log formatting specifier from "%dd" to "%d" - Adding a missing space in the sysfs emit string to prevent misinterpreted output like "X86_64on ". changing to "X86_64 on " - Cleaning up stray semicolons in struct definition endings These changes improve code readability and consistency. no functionality changes. Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com> Message-Id: <20250611143932.2443796-1-alok.a.tiwari@oracle.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Mike Christie <michael.christie@oracle.com>	2025-08-01 09:11:08 -04:00
Linus Torvalds	63eb28bb14	Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm Pull kvm updates from Paolo Bonzini: "ARM: - Host driver for GICv5, the next generation interrupt controller for arm64, including support for interrupt routing, MSIs, interrupt translation and wired interrupts - Use FEAT_GCIE_LEGACY on GICv5 systems to virtualize GICv3 VMs on GICv5 hardware, leveraging the legacy VGIC interface - Userspace control of the 'nASSGIcap' GICv3 feature, allowing userspace to disable support for SGIs w/o an active state on hardware that previously advertised it unconditionally - Map supporting endpoints with cacheable memory attributes on systems with FEAT_S2FWB and DIC where KVM no longer needs to perform cache maintenance on the address range - Nested support for FEAT_RAS and FEAT_DoubleFault2, allowing the guest hypervisor to inject external aborts into an L2 VM and take traps of masked external aborts to the hypervisor - Convert more system register sanitization to the config-driven implementation - Fixes to the visibility of EL2 registers, namely making VGICv3 system registers accessible through the VGIC device instead of the ONE_REG vCPU ioctls - Various cleanups and minor fixes LoongArch: - Add stat information for in-kernel irqchip - Add tracepoints for CPUCFG and CSR emulation exits - Enhance in-kernel irqchip emulation - Various cleanups RISC-V: - Enable ring-based dirty memory tracking - Improve perf kvm stat to report interrupt events - Delegate illegal instruction trap to VS-mode - MMU improvements related to upcoming nested virtualization s390x - Fixes x86: - Add CONFIG_KVM_IOAPIC for x86 to allow disabling support for I/O APIC, PIC, and PIT emulation at compile time - Share device posted IRQ code between SVM and VMX and harden it against bugs and runtime errors - Use vcpu_idx, not vcpu_id, for GA log tag/metadata, to make lookups O(1) instead of O(n) - For MMIO stale data mitigation, track whether or not a vCPU has access to (host) MMIO based on whether the page tables have MMIO pfns mapped; using VFIO is prone to false negatives - Rework the MSR interception code so that the SVM and VMX APIs are more or less identical - Recalculate all MSR intercepts from scratch on MSR filter changes, instead of maintaining shadow bitmaps - Advertise support for LKGS (Load Kernel GS base), a new instruction that's loosely related to FRED, but is supported and enumerated independently - Fix a user-triggerable WARN that syzkaller found by setting the vCPU in INIT_RECEIVED state (aka wait-for-SIPI), and then putting the vCPU into VMX Root Mode (post-VMXON). Trying to detect every possible path leading to architecturally forbidden states is hard and even risks breaking userspace (if it goes from valid to valid state but passes through invalid states), so just wait until KVM_RUN to detect that the vCPU state isn't allowed - Add KVM_X86_DISABLE_EXITS_APERFMPERF to allow disabling interception of APERF/MPERF reads, so that a "properly" configured VM can access APERF/MPERF. This has many caveats (APERF/MPERF cannot be zeroed on vCPU creation or saved/restored on suspend and resume, or preserved over thread migration let alone VM migration) but can be useful whenever you're interested in letting Linux guests see the effective physical CPU frequency in /proc/cpuinfo - Reject KVM_SET_TSC_KHZ for vm file descriptors if vCPUs have been created, as there's no known use case for changing the default frequency for other VM types and it goes counter to the very reason why the ioctl was added to the vm file descriptor. And also, there would be no way to make it work for confidential VMs with a "secure" TSC, so kill two birds with one stone - Dynamically allocation the shadow MMU's hashed page list, and defer allocating the hashed list until it's actually needed (the TDP MMU doesn't use the list) - Extract many of KVM's helpers for accessing architectural local APIC state to common x86 so that they can be shared by guest-side code for Secure AVIC - Various cleanups and fixes x86 (Intel): - Preserve the host's DEBUGCTL.FREEZE_IN_SMM when running the guest. Failure to honor FREEZE_IN_SMM can leak host state into guests - Explicitly check vmcs12.GUEST_DEBUGCTL on nested VM-Enter to prevent L1 from running L2 with features that KVM doesn't support, e.g. BTF x86 (AMD): - WARN and reject loading kvm-amd.ko instead of panicking the kernel if the nested SVM MSRPM offsets tracker can't handle an MSR (which is pretty much a static condition and therefore should never happen, but still) - Fix a variety of flaws and bugs in the AVIC device posted IRQ code - Inhibit AVIC if a vCPU's ID is too big (relative to what hardware supports) instead of rejecting vCPU creation - Extend enable_ipiv module param support to SVM, by simply leaving IsRunning clear in the vCPU's physical ID table entry - Disable IPI virtualization, via enable_ipiv, if the CPU is affected by erratum #1235, to allow (safely) enabling AVIC on such CPUs - Request GA Log interrupts if and only if the target vCPU is blocking, i.e. only if KVM needs a notification in order to wake the vCPU - Intercept SPEC_CTRL on AMD if the MSR shouldn't exist according to the vCPU's CPUID model - Accept any SNP policy that is accepted by the firmware with respect to SMT and single-socket restrictions. An incompatible policy doesn't put the kernel at risk in any way, so there's no reason for KVM to care - Drop a superfluous WBINVD (on all CPUs!) when destroying a VM and use WBNOINVD instead of WBINVD when possible for SEV cache maintenance - When reclaiming memory from an SEV guest, only do cache flushes on CPUs that have ever run a vCPU for the guest, i.e. don't flush the caches for CPUs that can't possibly have cache lines with dirty, encrypted data Generic: - Rework irqbypass to track/match producers and consumers via an xarray instead of a linked list. Using a linked list leads to O(n^2) insertion times, which is hugely problematic for use cases that create large numbers of VMs. Such use cases typically don't actually use irqbypass, but eliminating the pointless registration is a future problem to solve as it likely requires new uAPI - Track irqbypass's "token" as "struct eventfd_ctx " instead of a "void ", to avoid making a simple concept unnecessarily difficult to understand - Decouple device posted IRQs from VFIO device assignment, as binding a VM to a VFIO group is not a requirement for enabling device posted IRQs - Clean up and document/comment the irqfd assignment code - Disallow binding multiple irqfds to an eventfd with a priority waiter, i.e. ensure an eventfd is bound to at most one irqfd through the entire host, and add a selftest to verify eventfd:irqfd bindings are globally unique - Add a tracepoint for KVM_SET_MEMORY_ATTRIBUTES to help debug issues related to private <=> shared memory conversions - Drop guest_memfd's .getattr() implementation as the VFS layer will call generic_fillattr() if inode_operations.getattr is NULL - Fix issues with dirty ring harvesting where KVM doesn't bound the processing of entries in any way, which allows userspace to keep KVM in a tight loop indefinitely - Kill off kvm_arch_{start,end}_assignment() and x86's associated tracking, now that KVM no longer uses assigned_device_count as a heuristic for either irqbypass usage or MDS mitigation Selftests: - Fix a comment typo - Verify KVM is loaded when getting any KVM module param so that attempting to run a selftest without kvm.ko loaded results in a SKIP message about KVM not being loaded/enabled (versus some random parameter not existing) - Skip tests that hit EACCES when attempting to access a file, and print a "Root required?" help message. In most cases, the test just needs to be run with elevated permissions" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (340 commits) Documentation: KVM: Use unordered list for pre-init VGIC registers RISC-V: KVM: Avoid re-acquiring memslot in kvm_riscv_gstage_map() RISC-V: KVM: Use find_vma_intersection() to search for intersecting VMAs RISC-V: perf/kvm: Add reporting of interrupt events RISC-V: KVM: Enable ring-based dirty memory tracking RISC-V: KVM: Fix inclusion of Smnpm in the guest ISA bitmap RISC-V: KVM: Delegate illegal instruction fault to VS mode RISC-V: KVM: Pass VMID as parameter to kvm_riscv_hfence_xyz() APIs RISC-V: KVM: Factor-out g-stage page table management RISC-V: KVM: Add vmid field to struct kvm_riscv_hfence RISC-V: KVM: Introduce struct kvm_gstage_mapping RISC-V: KVM: Factor-out MMU related declarations into separate headers RISC-V: KVM: Use ncsr_xyz() in kvm_riscv_vcpu_trap_redirect() RISC-V: KVM: Implement kvm_arch_flush_remote_tlbs_range() RISC-V: KVM: Don't flush TLB when PTE is unchanged RISC-V: KVM: Replace KVM_REQ_HFENCE_GVMA_VMID_ALL with KVM_REQ_TLB_FLUSH RISC-V: KVM: Rename and move kvm_riscv_local_tlb_sanitize() RISC-V: KVM: Drop the return value of kvm_riscv_vcpu_aia_init() RISC-V: KVM: Check kvm_riscv_vcpu_alloc_vector_context() return value KVM: arm64: selftests: Add FEAT_RAS EL2 registers to get-reg-list ...	2025-07-30 17:14:01 -07:00
Jakub Kicinski	b430f6c38d	Merge branch 'virtio_udp_tunnel_08_07_2025' of https://github.com/pabeni/linux-devel Paolo Abeni says: ==================== virtio: introduce GSO over UDP tunnel Some virtualized deployments use UDP tunnel pervasively and are impacted negatively by the lack of GSO support for such kind of traffic in the virtual NIC driver. The virtio_net specification recently introduced support for GSO over UDP tunnel, this series updates the virtio implementation to support such a feature. Currently the kernel virtio support limits the feature space to 64, while the virtio specification allows for a larger number of features. Specifically the GSO-over-UDP-tunnel-related virtio features use bits 65-69. The first four patches in this series rework the virtio and vhost feature support to cope with up to 128 bits. The limit is set by a define and could be easily raised in future, as needed. This implementation choice is aimed at keeping the code churn as limited as possible. For the same reason, only the virtio_net driver is reworked to leverage the extended feature space; all other virtio/vhost drivers are unaffected, but could be upgraded to support the extended features space in a later time. The last four patches bring in the actual GSO over UDP tunnel support. As per specification, some additional fields are introduced into the virtio net header to support the new offload. The presence of such fields depends on the negotiated features. New helpers are introduced to convert the UDP-tunneled skb metadata to an extended virtio net header and vice versa. Such helpers are used by the tun and virtio_net driver to cope with the newly supported offloads. Tested with basic stream transfer with all the possible permutations of host kernel/qemu/guest kernel with/without GSO over UDP tunnel support. ==================== Link: https://patch.msgid.link/cover.1751874094.git.pabeni@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-07-10 13:32:35 -07:00
Paolo Abeni	bbca931fce	vhost/net: enable gso over UDP tunnel support. Vhost net need to know the exact virtio net hdr size to be able to copy such header correctly. Teach it about the newly defined UDP tunnel-related option and update the hdr size computation accordingly. Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-07-08 18:07:28 +02:00
Paolo Abeni	333c515d18	vhost-net: allow configuring extended features Use the extended feature type for 'acked_features' and implement two new ioctls operation allowing the user-space to set/query an unbounded amount of features. The actual number of processed features is limited by VIRTIO_FEATURES_MAX and attempts to set features above such limit fail with EOPNOTSUPP. Note that: the legacy ioctls implicitly truncate the negotiated features to the lower 64 bits range and the 'acked_backend_features' field don't need conversion, as the only negotiated feature there is in the low 64 bit range. Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-07-08 18:05:23 +02:00
Jason Wang	97b2409f28	vhost-net: reduce one userspace copy when building XDP buff We used to do twice copy_from_iter() to copy virtio-net and packet separately. This introduce overheads for userspace access hardening as well as SMAP (for x86 it's stac/clac). So this patch tries to use one copy_from_iter() to copy them once and move the virtio-net header afterwards to reduce overheads. Testpmd + vhost_net shows 10% improvement from 5.45Mpps to 6.0Mpps. Signed-off-by: Jason Wang <jasowang@redhat.com> Link: https://patch.msgid.link/20250701010352.74515-2-jasowang@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-07-02 15:29:46 -07:00
Jason Wang	4d313f2bd2	tun: remove unnecessary tun_xdp_hdr structure With f95f0f95cfb7("net, xdp: Introduce xdp_init_buff utility routine"), buffer length could be stored as frame size so there's no need to have a dedicated tun_xdp_hdr structure. We can simply store virtio net header instead. Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Link: https://patch.msgid.link/20250701010352.74515-1-jasowang@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-07-02 15:29:46 -07:00
Sean Christopherson	23b54381ce	irqbypass: Require producers to pass in Linux IRQ number during registration Pass in the Linux IRQ associated with an IRQ bypass producer instead of relying on the caller to set the field prior to registration, as there's no benefit to relying on callers to do the right thing. Take care to set producer->irq before __connect(), as KVM expects the IRQ to be valid as soon as a connection is possible. Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Alex Williamson <alex.williamson@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Link: https://lore.kernel.org/r/20250516230734.2564775-9-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-20 13:52:41 -07:00
Sean Christopherson	2b521d86ee	irqbypass: Take ownership of producer/consumer token tracking Move ownership of IRQ bypass token tracking into irqbypass.ko, and explicitly require callers to pass an eventfd_ctx structure instead of a completely opaque token. Relying on producers and consumers to set the token appropriately is error prone, and hiding the fact that the token must be an eventfd_ctx pointer (for all intents and purposes) unnecessarily obfuscates the code and makes it more brittle. Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Alex Williamson <alex.williamson@redhat.com> Link: https://lore.kernel.org/r/20250516230734.2564775-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>	2025-06-20 13:52:38 -07:00
Linus Torvalds	8ca154e491	Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost Pull virtio updates from Michael Tsirkin: - A new virtio RTC driver - vhost scsi now logs write descriptors so migration works - Some hardening work in virtio core - An old spec compliance issue fixed in vhost net - A couple of cleanups, fixes in vringh, virtio-pci, vdpa * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: virtio: reject shm region if length is zero virtio_rtc: Add RTC class driver virtio_rtc: Add Arm Generic Timer cross-timestamping virtio_rtc: Add PTP clocks virtio_rtc: Add module and driver core vringh: use bvec_kmap_local vhost: vringh: Use matching allocation type in resize_iovec() virtio-pci: Fix result size returned for the admin command completion vdpa/octeon_ep: Control PCI dev enabling manually vhost-scsi: log event queue write descriptors vhost-scsi: log control queue write descriptors vhost-scsi: log I/O queue write descriptors vhost-scsi: adjust vhost_scsi_get_desc() to log vring descriptors vhost: modify vhost_log_write() for broader users	2025-05-29 08:15:35 -07:00
Christoph Hellwig	169294a14b	vringh: use bvec_kmap_local Use the bvec_kmap_local helper rather than digging into the bvec internals. Signed-off-by: Christoph Hellwig <hch@lst.de> Message-Id: <20250501142244.2888227-1-hch@lst.de> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>	2025-05-27 10:27:53 -04:00
Kees Cook	8b3f9967b1	vhost: vringh: Use matching allocation type in resize_iovec() In preparation for making the kmalloc family of allocators type aware, we need to make sure that the returned type from the allocation matches the type of the variable being assigned. (Before, the allocator would always return "void ", which can be implicitly cast to any pointer type.) The assigned type is "struct kvec ", but the returned type will be "struct iovec ". These have the same allocation size, so there is no bug: struct kvec { void iov_base; /* and that should never hold a userland pointer / size_t iov_len; }; struct iovec { void __user iov_base; /* BSD uses caddr_t (1003.1g requires void ) / __kernel_size_t iov_len; /* Must be size_t (1003.1g) */ }; Adjust the allocation type to match the assignment. Signed-off-by: Kees Cook <kees@kernel.org> Message-Id: <20250426062214.work.334-kees@kernel.org> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>	2025-05-27 10:27:53 -04:00
Dongli Zhang	ac9dcca236	vhost-scsi: log event queue write descriptors Log write descriptors for the event queue, leveraging vhost_get_vq_desc() to retrieve the array of write descriptors to obtain the log buffer. There is only one path for event queue. Suggested-by: Joao Martins <joao.m.martins@oracle.com> Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com> Reviewed-by: Mike Christie <michael.christie@oracle.com> Message-Id: <20250403063028.16045-9-dongli.zhang@oracle.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>	2025-05-18 17:25:24 -04:00
Dongli Zhang	a94c96a352	vhost-scsi: log control queue write descriptors Log write descriptors for the control queue, leveraging vhost_scsi_get_desc() and vhost_get_vq_desc() to retrieve the array of write descriptors to obtain the log buffer. For Task Management Requests, similar to the I/O queue, store the log buffer during the submission path and log it in the completion or error handling path. For Asynchronous Notifications, only the submission path is involved. Suggested-by: Joao Martins <joao.m.martins@oracle.com> Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com> Message-Id: <20250403063028.16045-8-dongli.zhang@oracle.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Mike Christie <michael.christie@oracle.com>	2025-05-18 17:25:24 -04:00
Dongli Zhang	c2c5c259aa	vhost-scsi: log I/O queue write descriptors Log write descriptors for the I/O queue, leveraging vhost_scsi_get_desc() and vhost_get_vq_desc() to retrieve the array of write descriptors to obtain the log buffer. In addition, introduce a vhost-scsi specific function to log vring descriptors. In this function, the 'partial' argument is set to false, and the 'len' argument is set to 0, because vhost-scsi always logs all pages shared by a vring descriptor. Add WARN_ON_ONCE() since vhost-scsi doesn't support VIRTIO_F_ACCESS_PLATFORM. The per-cmd log buffer is allocated on demand in the submission path after VHOST_F_LOG_ALL is set. Return -ENOMEM on allocation failure, in order to send SAM_STAT_TASK_SET_FULL to the guest. It isn't reclaimed in the completion path. Instead, it is reclaimed when VHOST_F_LOG_ALL is removed, or during VHOST_SCSI_SET_ENDPOINT when all commands are destroyed. Store the log buffer during the submission path and log it in the completion path. Logging is also required in the error handling path of the submission process. Suggested-by: Joao Martins <joao.m.martins@oracle.com> Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com> Message-Id: <20250403063028.16045-7-dongli.zhang@oracle.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Mike Christie <michael.christie@oracle.com>	2025-05-18 17:25:24 -04:00
Dongli Zhang	e5e6b15b0d	vhost-scsi: adjust vhost_scsi_get_desc() to log vring descriptors Adjust vhost_scsi_get_desc() to facilitate logging of vring descriptors. Add new arguments to allow passing the log buffer and length to vhost_get_vq_desc(). In addition, reset 'log_num' since vhost_get_vq_desc() may reset it only after certain condition checks. Suggested-by: Joao Martins <joao.m.martins@oracle.com> Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com> Reviewed-by: Mike Christie <michael.christie@oracle.com> Message-Id: <20250403063028.16045-6-dongli.zhang@oracle.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>	2025-05-18 17:25:24 -04:00
Dongli Zhang	41cff026cf	vhost: modify vhost_log_write() for broader users Currently, the only user of vhost_log_write() is vhost-net. The 'len' argument prevents logging of pages that are not tainted by the RX path. Adjustments are needed since more drivers (i.e. vhost-scsi) begin using vhost_log_write(). So far vhost-net RX path may only partially use pages shared via the last vring descriptor. Unlike vhost-net, vhost-scsi always logs all pages shared via vring descriptors. To accommodate this, use (len == U64_MAX) to indicate whether the driver would log all pages of vring descriptors, or only pages that are tainted by the driver. In addition, removes BUG(). Suggested-by: Joao Martins <joao.m.martins@oracle.com> Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com> Message-Id: <20250403063028.16045-5-dongli.zhang@oracle.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>	2025-05-18 17:25:24 -04:00
Jon Kohler	8c2e6b26ff	vhost/net: Defer TX queue re-enable until after sendmsg In handle_tx_copy, TX batching processes packets below ~PAGE_SIZE and batches up to 64 messages before calling sock->sendmsg. Currently, when there are no more messages on the ring to dequeue, handle_tx_copy re-enables kicks on the ring before firing off the batch sendmsg. However, sock->sendmsg incurs a non-zero delay, especially if it needs to wake up a thread (e.g., another vhost worker). If the guest submits additional messages immediately after the last ring check and disablement, it triggers an EPT_MISCONFIG vmexit to attempt to kick the vhost worker. This may happen while the worker is still processing the sendmsg, leading to wasteful exit(s). This is particularly problematic for single-threaded guest submission threads, as they must exit, wait for the exit to be processed (potentially involving a TTWU), and then resume. In scenarios like a constant stream of UDP messages, this results in a sawtooth pattern where the submitter frequently vmexits, and the vhost-net worker alternates between sleeping and waking. A common solution is to configure vhost-net busy polling via userspace (e.g., qemu poll-us). However, treating the sendmsg as the "busy" period by keeping kicks disabled during the final sendmsg and performing one additional ring check afterward provides a significant performance improvement without any excess busy poll cycles. If messages are found in the ring after the final sendmsg, requeue the TX handler. This ensures fairness for the RX handler and allows vhost_run_work_list to cond_resched() as needed. Test Case TX VM: taskset -c 2 iperf3 -c rx-ip-here -t 60 -p 5200 -b 0 -u -i 5 RX VM: taskset -c 2 iperf3 -s -p 5200 -D 6.12.0, each worker backed by tun interface with IFF_NAPI setup. Note: TCP side is largely unchanged as that was copy bound 6.12.0 unpatched EPT_MISCONFIG/second: 5411 Datagrams/second: ~382k Interval Transfer Bitrate Lost/Total Datagrams 0.00-30.00 sec 15.5 GBytes 4.43 Gbits/sec 0/11481630 (0%) sender 6.12.0 patched EPT_MISCONFIG/second: 58 (~93x reduction) Datagrams/second: ~650k (~1.7x increase) Interval Transfer Bitrate Lost/Total Datagrams 0.00-30.00 sec 26.4 GBytes 7.55 Gbits/sec 0/19554720 (0%) sender Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Jon Kohler <jon@nutanix.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Link: https://patch.msgid.link/20250501020428.1889162-1-jon@nutanix.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-05-05 18:18:41 -07:00
Dongli Zhang	58465d8607	vhost-scsi: Fix vhost_scsi_send_status() Although the support of VIRTIO_F_ANY_LAYOUT + VIRTIO_F_VERSION_1 was signaled by the commit `664ed90e62` ("vhost/scsi: Set VIRTIO_F_ANY_LAYOUT + VIRTIO_F_VERSION_1 feature bits"), vhost_scsi_send_bad_target() still assumes the response in a single descriptor. Similar issue in vhost_scsi_send_bad_target() has been fixed in previous commit. In addition, similar issue for vhost_scsi_complete_cmd_work() has been fixed by the commit `6dd88fd59d` ("vhost-scsi: unbreak any layout for response"). Fixes: `3ca51662f8` ("vhost-scsi: Add better resource allocation failure handling") Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com> Acked-by: Jason Wang <jasowang@redhat.com> Reviewed-by: Mike Christie <michael.christie@oracle.com> Message-Id: <20250403063028.16045-4-dongli.zhang@oracle.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>	2025-04-18 10:08:11 -04:00

1 2 3 4 5 ...

1067 Commits