Pull capabilities update from Serge Hallyn:
"Ryan Foster had sent a patch to add testing of the
rootid_owns_currentns() function. That patch pointed out
that this function was not as clear as it should be. Fix it:
- Clarify the intent of the function in the name
- Split the function so that the base functionality is easier to test
from a kunit test"
* tag 'caps-pr-20251204' of git://git.kernel.org/pub/scm/linux/kernel/git/sergeh/linux:
Clarify the rootid_owns_currentns
Pull more drm updates from Dave Airlie:
"There was some additional intel code for color operations we wanted to
land. However I discovered I missed a pull for the xe vfio driver
which I had sorted into 6.20 in my brain, until Thomas mentioned it.
This contains the xe vfio code, a bunch of xe fixes that were waiting
and the i915 color management support. I'd like to include it as part
of keeping the two main vendors on the same page and giving a good
cross-driver experience for userspace when it starts using it.
vfio:
- add a vfio_pci variant driver for Intel
xe/i915 display:
- add plane color management support
xe:
- Add scope-based cleanup helper for runtime PM
- vfio xe driver prerequisites and exports
- fix vfio link error
- Fix a memory leak
- Fix a 64-bit division
- vf migration fix
- LRC pause fix"
* tag 'drm-next-2025-12-05' of https://gitlab.freedesktop.org/drm/kernel: (25 commits)
drm/i915/color: Enable Plane Color Pipelines
drm/i915/color: Add 3D LUT to color pipeline
drm/i915/color: Add registers for 3D LUT
drm/i915/color: Program Plane Post CSC Registers
drm/i915/color: Program Pre-CSC registers
drm/i915/color: Add framework to program PRE/POST CSC LUT
drm/i915: Add register definitions for Plane Post CSC
drm/i915: Add register definitions for Plane Degamma
drm/i915/color: Add plane CTM callback for D12 and beyond
drm/i915/color: Preserve sign bit when int_bits is Zero
drm/i915/color: Add framework to program CSC
drm/i915/color: Create a transfer function color pipeline
drm/i915/color: Add helper to create intel colorop
drm/i915: Add intel_color_op
drm/i915/display: Add identifiers for driver specific blocks
drm/xe/pf: fix VFIO link error
drm/xe: Protect against unset LRC when pausing submissions
drm/xe/vf: Start re-emission from first unsignaled job during VF migration
drm/xe/pf: Use div_u64 when calculating GGTT profile
drm/xe: Fix memory leak when handling pagefault vma
...
Pull tpm updates from Jarkko Sakkinen:
"This contains changes to unify TPM return code translation between
trusted_tpm2 and TPM driver itself. Other than that the changes are
either bug fixes or minor imrovements"
* tag 'tpmdd-next-6.19-rc1-v4' of git://git.kernel.org/pub/scm/linux/kernel/git/jarkko/linux-tpmdd:
KEYS: trusted: Use tpm_ret_to_err() in trusted_tpm2
tpm: Use -EPERM as fallback error code in tpm_ret_to_err
tpm: Cap the number of PCR banks
tpm: Remove tpm_find_get_ops
tpm: add WQ_PERCPU to alloc_workqueue users
tpm_crb: add missing loc parameter to kerneldoc
tpm_crb: Fix a spelling mistake
selftests: tpm2: Fix ill defined assertions
Pull ata updates from Niklas Cassel:
- Add DT binding for the Eswin EIC7700 SoC SATA Controller (Yulin Lu)
- Allow 'iommus' property in the Synopsys DWC AHCI SATA controller DT
binding (Rob Herring)
- Replace deprecated strcpy with strscpy in the pata_it821x driver
(Thorsten Blum)
- Add Iomega Clik! PCMCIA ATA/ATAPI Adapter PCMCIA ID to the
pata_pcmcia driver (René Rebe)
- Add ATA_QUIRK_NOLPM quirk for two Silicon Motion SSDs with broken LPM
support (me)
- Add flag WQ_PERCPU to the workqueue in the libata-sff helper library
to explicitly request the use of the per-CPU behavior (Marco
Crivellari)
* tag 'ata-6.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/libata/linux:
ata: libata-core: Disable LPM on Silicon Motion MD619{H,G}XCLDE3TC
ata: pata_pcmcia: Add Iomega Clik! PCMCIA ATA/ATAPI Adapter
ata: libata-sff: add WQ_PERCPU to alloc_workqueue users
dt-bindings: ata: snps,dwc-ahci: Allow 'iommus' property
ata: pata_it821x: Replace deprecated strcpy with strscpy in it821x_display_disk
dt-bindings: ata: eswin: Document for EIC7700 SoC ahci
Pull virtio updates from Michael Tsirkin:
"Just a bunch of fixes and cleanups, mostly very simple. Several
features were merged through net-next this time around"
* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
virtio_pci: drop kernel.h
vhost: switch to arrays of feature bits
vhost/test: add test specific macro for features
virtio: clean up features qword/dword terms
vduse: add WQ_PERCPU to alloc_workqueue users
virtio_balloon: add WQ_PERCPU to alloc_workqueue users
vdpa/pds: use %pe for ERR_PTR() in event handler registration
vhost: Fix kthread worker cgroup failure handling
virtio: vdpa: Fix reference count leak in octep_sriov_enable()
vdpa/mlx5: Fix incorrect error code reporting in query_virtqueues
virtio: fix map ops comment
virtio: fix virtqueue_set_affinity() docs
virtio: standardize Returns documentation style
virtio: fix grammar in virtio_map_ops docs
virtio: fix grammar in virtio_queue_info docs
virtio: fix whitespace in virtio_config_ops
virtio: fix typo in virtio_device_ready() comment
virtio: fix kernel-doc for mapping/free_coherent functions
virtio_vdpa: fix misleading return in void function
Pull rdma updates from Jason Gunthorpe:
"This has another new RDMA driver 'bng_en' for latest generation
Broadcom NICs. There might be one more new driver still to come.
Otherwise it is a fairly quite cycle. Summary:
- Minor driver bug fixes and updates to cxgb4, rxe, rdmavt, bnxt_re,
mlx5
- Many bug fix patches for irdma
- WQ_PERCPU annotations and system_dfl_wq changes
- Improved mlx5 support for "other eswitches" and multiple PFs
- 1600Gbps link speed reporting support. Four Digits Now!
- New driver bng_en for latest generation Broadcom NICs
- Bonding support for hns
- Adjust mlx5's hmm based ODP to work with the very large address
space created by the new 5 level paging default on x86
- Lockdep fixups in rxe and siw"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (65 commits)
RDMA/rxe: reclassify sockets in order to avoid false positives from lockdep
RDMA/siw: reclassify sockets in order to avoid false positives from lockdep
RDMA/bng_re: Remove prefetch instruction
RDMA/core: Reduce cond_resched() frequency in __ib_umem_release
RDMA/irdma: Fix SRQ shadow area address initialization
RDMA/irdma: Remove doorbell elision logic
RDMA/irdma: Do not set IBK_LOCAL_DMA_LKEY for GEN3+
RDMA/irdma: Do not directly rely on IB_PD_UNSAFE_GLOBAL_RKEY
RDMA/irdma: Add missing mutex destroy
RDMA/irdma: Fix SIGBUS in AEQ destroy
RDMA/irdma: Add a missing kfree of struct irdma_pci_f for GEN2
RDMA/irdma: Fix data race in irdma_free_pble
RDMA/irdma: Fix data race in irdma_sc_ccq_arm
RDMA/mlx5: Add support for 1600_8x lane speed
RDMA/core: Add new IB rate for XDR (8x) support
IB/mlx5: Reduce IMR KSM size when 5-level paging is enabled
RDMA/bnxt_re: Pass correct flag for dma mr creation
RDMA/bnxt_re: Fix the inline size for GenP7 devices
RDMA/hns: Support reset recovery for bond
RDMA/hns: Support link state reporting for bond
...
Pull iommufd updates from Jason Gunthorpe:
"This is a pretty consequential cycle for iommufd, though this pull is
not too big. It is based on a shared branch with VFIO that introduces
VFIO_DEVICE_FEATURE_DMA_BUF a DMABUF exporter for VFIO device's MMIO
PCI BARs. This was a large multiple series journey over the last year
and a half.
Based on that work IOMMUFD gains support for VFIO DMABUF's in its
existing IOMMU_IOAS_MAP_FILE, which closes the last major gap to
support PCI peer to peer transfers within VMs.
In Joerg's iommu tree we have the "generic page table" work which aims
to consolidate all the duplicated page table code in every iommu
driver into a single algorithm. This will be used by iommufd to
implement unique page table operations to start adding new features
and improve performance.
In here:
- Expand IOMMU_IOAS_MAP_FILE to accept a DMABUF exported from VFIO.
This is the first step to broader DMABUF support in iommufd, right
now it only works with VFIO. This closes the last functional gap
with classic VFIO type 1 to safely support PCI peer to peer DMA by
mapping the VFIO device's MMIO into the IOMMU.
- Relax SMMUv3 restrictions on nesting domains to better support
qemu's sequence to have an identity mapping before the vSID is
established"
* tag 'for-linus-iommufd' of git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd:
iommu/arm-smmu-v3-iommufd: Allow attaching nested domain for GBPA cases
iommufd/selftest: Add some tests for the dmabuf flow
iommufd: Accept a DMABUF through IOMMU_IOAS_MAP_FILE
iommufd: Have iopt_map_file_pages convert the fd to a file
iommufd: Have pfn_reader process DMABUF iopt_pages
iommufd: Allow MMIO pages in a batch
iommufd: Allow a DMABUF to be revoked
iommufd: Do not map/unmap revoked DMABUFs
iommufd: Add DMABUF to iopt_pages
vfio/pci: Add vfio_pci_dma_buf_iommufd_map()
Pull VFIO updates from Alex Williamson:
- Move libvfio selftest artifacts in preparation of more tightly
coupled integration with KVM selftests (David Matlack)
- Fix comment typo in mtty driver (Chu Guangqing)
- Support for new hardware revision in the hisi_acc vfio-pci variant
driver where the migration registers can now be accessed via the PF.
When enabled for this support, the full BAR can be exposed to the
user (Longfang Liu)
- Fix vfio cdev support for VF token passing, using the correct size
for the kernel structure, thereby actually allowing userspace to
provide a non-zero UUID token. Also set the match token callback for
the hisi_acc, fixing VF token support for this this vfio-pci variant
driver (Raghavendra Rao Ananta)
- Introduce internal callbacks on vfio devices to simplify and
consolidate duplicate code for generating VFIO_DEVICE_GET_REGION_INFO
data, removing various ioctl intercepts with a more structured
solution (Jason Gunthorpe)
- Introduce dma-buf support for vfio-pci devices, allowing MMIO regions
to be exposed through dma-buf objects with lifecycle managed through
move operations. This enables low-level interactions such as a
vfio-pci based SPDK drivers interacting directly with dma-buf capable
RDMA devices to enable peer-to-peer operations. IOMMUFD is also now
able to build upon this support to fill a long standing feature gap
versus the legacy vfio type1 IOMMU backend with an implementation of
P2P support for VM use cases that better manages the lifecycle of the
P2P mapping (Leon Romanovsky, Jason Gunthorpe, Vivek Kasireddy)
- Convert eventfd triggering for error and request signals to use RCU
mechanisms in order to avoid a 3-way lockdep reported deadlock issue
(Alex Williamson)
- Fix a 32-bit overflow introduced via dma-buf support manifesting with
large DMA buffers (Alex Mastro)
- Convert nvgrace-gpu vfio-pci variant driver to insert mappings on
fault rather than at mmap time. This conversion serves both to make
use of huge PFNMAPs but also to both avoid corrected RAS events
during reset by now being subject to vfio-pci-core's use of
unmap_mapping_range(), and to enable a device readiness test after
reset (Ankit Agrawal)
- Refactoring of vfio selftests to support multi-device tests and split
code to provide better separation between IOMMU and device objects.
This work also enables a new test suite addition to measure parallel
device initialization latency (David Matlack)
* tag 'vfio-v6.19-rc1' of https://github.com/awilliam/linux-vfio: (65 commits)
vfio: selftests: Add vfio_pci_device_init_perf_test
vfio: selftests: Eliminate INVALID_IOVA
vfio: selftests: Split libvfio.h into separate header files
vfio: selftests: Move vfio_selftests_*() helpers into libvfio.c
vfio: selftests: Rename vfio_util.h to libvfio.h
vfio: selftests: Stop passing device for IOMMU operations
vfio: selftests: Move IOVA allocator into iova_allocator.c
vfio: selftests: Move IOMMU library code into iommu.c
vfio: selftests: Rename struct vfio_dma_region to dma_region
vfio: selftests: Upgrade driver logging to dev_err()
vfio: selftests: Prefix logs with device BDF where relevant
vfio: selftests: Eliminate overly chatty logging
vfio: selftests: Support multiple devices in the same container/iommufd
vfio: selftests: Introduce struct iommu
vfio: selftests: Rename struct vfio_iommu_mode to iommu_mode
vfio: selftests: Allow passing multiple BDFs on the command line
vfio: selftests: Split run.sh into separate scripts
vfio: selftests: Move run.sh into scripts directory
vfio/nvgrace-gpu: wait for the GPU mem to be ready
vfio/nvgrace-gpu: Inform devmem unmapped after reset
...
Pull iommu updates from Joerg Roedel:
- Introduction of the generic IO page-table framework with support for
Intel and AMD IOMMU formats from Jason.
This has good potential for unifying more IO page-table
implementations and making future enhancements more easy. But this
also needed quite some fixes during development. All known issues
have been fixed, but my feeling is that there is a higher potential
than usual that more might be needed.
- Intel VT-d updates:
- Use right invalidation hint in qi_desc_iotlb()
- Reduce the scope of INTEL_IOMMU_FLOPPY_WA
- ARM-SMMU updates:
- Qualcomm device-tree binding updates for Kaanapali and Glymur SoCs
and a new clock for the TBU.
- Fix error handling if level 1 CD table allocation fails.
- Permit more than the architectural maximum number of SMRs for
funky Qualcomm mis-implementations of SMMUv2.
- Mediatek driver:
- MT8189 iommu support
- Move ARM IO-pgtable selftests to kunit
- Device leak fixes for a couple of drivers
- Random smaller fixes and improvements
* tag 'iommu-updates-v6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux: (81 commits)
iommupt/vtd: Support mgaw's less than a 4 level walk for first stage
iommupt/vtd: Allow VT-d to have a larger table top than the vasz requires
powerpc/pseries/svm: Make mem_encrypt.h self contained
genpt: Make GENERIC_PT invisible
iommupt: Avoid a compiler bug with sw_bit
iommu/arm-smmu-qcom: Enable use of all SMR groups when running bare-metal
iommupt: Fix unlikely flows in increase_top()
iommu/amd: Propagate the error code returned by __modify_irte_ga() in modify_irte_ga()
MAINTAINERS: Update my email address
iommu/arm-smmu-v3: Fix error check in arm_smmu_alloc_cd_tables
dt-bindings: iommu: qcom_iommu: Allow 'tbu' clock
iommu/vt-d: Restore previous domain::aperture_end calculation
iommu/vt-d: Fix unused invalidation hint in qi_desc_iotlb
iommu/vt-d: Set INTEL_IOMMU_FLOPPY_WA depend on BLK_DEV_FD
iommu/tegra: fix device leak on probe_device()
iommu/sun50i: fix device leak on of_xlate()
iommu/omap: simplify probe_device() error handling
iommu/omap: fix device leaks on probe_device()
iommu/mediatek-v1: add missing larb count sanity check
iommu/mediatek-v1: fix device leaks on probe()
...
Pull compute express link (CXL) updates from Dave Jiang:
"The additions of note are adding CXL region remove support for locked
CXL decoders, adding unit testing support for XOR address translation,
and adding unit testing support for extended linear cache.
Misc:
- Remove incorrect page-allocator quirk section in documentation
- Remove unused devm_cxl_port_enumerate_dports() function
- Fix typo in cdat.c code comment
- Replace use of system_wq with system_percpu_wq
- Add locked CXL decoder support for region removal
- Return when generic target updated
- Rename region_res_match_cxl_range() to spa_maps_hpa()
- Clarify comment in spa_maps_hpa()
Enable unit testing for XOR address translation of SPA to DPA and vice versa:
- Refactor address translation funcs for testing in cxl_region
- Make the XOR calculations available for testing
- Add cxl_translate module for address translation testing in
cxl_test
Extended Linear Cache changes:
- Add extended linear cache size sysfs attribute
- Adjust failure emission of extended linear cache detection in
cxl_acpi
- Added extended linear cache unit testing support in cxl_test
Preparation refactor patches for PRM translation support:
- Simplify cxl_rd_ops allocation and handling
- Group xor arithmetric setup code in a single block
- Remove local variable @inc in cxl_port_setup_targets()"
* tag 'cxl-for-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl: (22 commits)
cxl/test: Assign overflow_err_count from log->nr_overflow
cxl/test: Remove ret_limit race condition in mock_get_event()
cxl/test: remove unused mock function for cxl_rcd_component_reg_phys()
cxl/test: Add support for acpi extended linear cache
cxl/test: Add cxl_test CFMWS support for extended linear cache
cxl/test: Standardize CXL auto region size
cxl/region: Remove local variable @inc in cxl_port_setup_targets()
cxl/acpi: Group xor arithmetric setup code in a single block
cxl: Simplify cxl_rd_ops allocation and handling
cxl: Clarify comment in spa_maps_hpa()
cxl: Rename region_res_match_cxl_range() to spa_maps_hpa()
acpi/hmat: Return when generic target is updated
cxl: Add handling of locked CXL decoder
cxl/region: Add support to indicate region has extended linear cache
cxl: Adjust extended linear cache failure emission in cxl_acpi
cxl/test: Add cxl_translate module for address translation testing
cxl/acpi: Make the XOR calculations available for testing
cxl/region: Refactor address translation funcs for testing
cxl/pci: replace use of system_wq with system_percpu_wq
cxl: fix typos in cdat.c comments
...
Add helpers to program the 3D LUT registers and arm them.
LUT_3D_READY in LUT_3D_CLT is cleared off by the HW once
the LUT buffer is loaded into it's internal working RAM.
So by the time we try to load/commit new values, we expect
it to be cleared off. If not, log an error and return
without writing new values. Do it only when writing with MMIO.
There is no way to read register within DSB execution.
v2:
- Add information regarding LUT_3D_READY to commit message (Jani)
- Log error instead of a drm_warn and return without committing changes
if 3DLUT HW is not ready to accept new values.
- Refactor intel_color_crtc_has_3dlut()
Also remove Gen10 check (Suraj)
v3:
- Addressed review comments (Suraj)
Reviewed-by: Suraj Kandpal <suraj.kandpal@intel.com>
Signed-off-by: Chaitanya Kumar Borah <chaitanya.kumar.borah@intel.com>
Signed-off-by: Uma Shankar <uma.shankar@intel.com>
Link: https://patch.msgid.link/20251203085211.3663374-15-uma.shankar@intel.com
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
Extract the LUT and program plane post csc registers.
v2: Add DSB support
v3: Add support for single segment 1D LUT
v4:
- s/drm_color_lut_32/drm_color_lut32 (Simon)
- Move declaration to beginning of the function (Suraj)
- Remove multisegmented code, add it later
- Remove dead code for SDR planes, add it later
v5:
- Fix iterator issues
v6: Removed redundant variable (Suraj)
Reviewed-by: Suraj Kandpal <suraj.kandpal@intel.com>
Signed-off-by: Uma Shankar <uma.shankar@intel.com>
Signed-off-by: Chaitanya Kumar Borah <chaitanya.kumar.borah@intel.com>
Link: https://patch.msgid.link/20251203085211.3663374-13-uma.shankar@intel.com
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
Add callback to program Pre-CSC LUT for TGL and beyond
v2: Add DSB support
v3: Add support for single segment 1D LUT color op
v4:
- s/drm_color_lut_32/drm_color_lut32/ (Simon)
- Change commit message (Suraj)
- Improve comments (Suraj)
- Remove multisegmented programming, to be added later
- Remove dead code for SDR planes, add when needed
BSpec: 50411, 50412, 50413, 50414
Reviewed-by: Suraj Kandpal <suraj.kandpal@intel.com>
Signed-off-by: Uma Shankar <uma.shankar@intel.com>
Signed-off-by: Chaitanya Kumar Borah <chaitanya.kumar.borah@intel.com>
Link: https://patch.msgid.link/20251203085211.3663374-12-uma.shankar@intel.com
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
Add framework to program CSC. It enables copying of matrix from UAPI
to intel plane state. Also add helper functions which will eventually
program values to hardware.
Add a crtc state variable to track plane color change.
v2:
- Add crtc_state->plane_color_changed
- Improve comments (Suraj)
- s/intel_plane_*_color/intel_plane_color_* (Suraj)
v3:
- align parameters with open braces (Suraj)
- Improve commit message (Suraj)
v4:
- Re-arrange variable declaration (Suraj)
Reviewed-by: Suraj Kandpal <suraj.kandpal@intel.com>
Signed-off-by: Chaitanya Kumar Borah <chaitanya.kumar.borah@intel.com>
Signed-off-by: Uma Shankar <uma.shankar@intel.com>
Link: https://patch.msgid.link/20251203085211.3663374-6-uma.shankar@intel.com
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
Add a color pipeline with three colorops in the sequence
1D LUT - 3x4 CTM - 1D LUT
This pipeline can be used to do any color space conversion or HDR
tone mapping
v2: Change namespace to drm_plane_colorop*
v3: Use simpler/pre-existing colorops for first iteration
v4:
- s/*_tf_*/*_color_* (Jani)
- Refactor to separate files (Jani)
- Add missing space in comment (Suraj)
- Consolidate patch that adds/attaches pipeline property
v5:
- Limit MAX_COLOR_PIPELINES to 2.(Suraj)
Increase it as and when we add more pipelines.
- Remove redundant initialization code (Suraj)
v6:
- Use drm_plane_create_color_pipeline_property() (Arun)
Now MAX_COLOR_PIPELINES is 1
Reviewed-by: Suraj Kandpal <suraj.kandpal@intel.com>
Signed-off-by: Uma Shankar <uma.shankar@intel.com>
Signed-off-by: Chaitanya Kumar Borah <chaitanya.kumar.borah@intel.com>
Link: https://patch.msgid.link/20251203085211.3663374-5-uma.shankar@intel.com
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
The Makefile logic for building xe_sriov_vfio.o was added incorrectly,
as setting CONFIG_XE_VFIO_PCI=m means it doesn't get included into a
built-in xe driver:
ERROR: modpost: "xe_sriov_vfio_stop_copy_enter" [drivers/vfio/pci/xe/xe-vfio-pci.ko] undefined!
ERROR: modpost: "xe_sriov_vfio_stop_copy_exit" [drivers/vfio/pci/xe/xe-vfio-pci.ko] undefined!
ERROR: modpost: "xe_sriov_vfio_suspend_device" [drivers/vfio/pci/xe/xe-vfio-pci.ko] undefined!
ERROR: modpost: "xe_sriov_vfio_wait_flr_done" [drivers/vfio/pci/xe/xe-vfio-pci.ko] undefined!
ERROR: modpost: "xe_sriov_vfio_error" [drivers/vfio/pci/xe/xe-vfio-pci.ko] undefined!
ERROR: modpost: "xe_sriov_vfio_resume_data_enter" [drivers/vfio/pci/xe/xe-vfio-pci.ko] undefined!
ERROR: modpost: "xe_sriov_vfio_resume_device" [drivers/vfio/pci/xe/xe-vfio-pci.ko] undefined!
ERROR: modpost: "xe_sriov_vfio_resume_data_exit" [drivers/vfio/pci/xe/xe-vfio-pci.ko] undefined!
ERROR: modpost: "xe_sriov_vfio_data_write" [drivers/vfio/pci/xe/xe-vfio-pci.ko] undefined!
ERROR: modpost: "xe_sriov_vfio_migration_supported" [drivers/vfio/pci/xe/xe-vfio-pci.ko] undefined!
WARNING: modpost: suppressed 3 unresolved symbol warnings because there were too many)
Check for CONFIG_XE_VFIO_PCI being enabled in the Makefile to decide whether to
include the object instead.
Fixes: bd45d46ffc ("drm/xe/pf: Export helpers for VFIO")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Link: https://patch.msgid.link/20251204094154.1029357-1-arnd@kernel.org
(cherry picked from commit ef7de33544a7a6783d7afe09496da362d1e90ba1)
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Using -EFAULT as the tpm_ret_to_err() fallback error code causes makes it
incompatible on how trusted keys transmute TPM return codes.
Change the fallback as -EPERM in order to gain compatibility with trusted
keys. In addition, map TPM_RC_HASH to -EINVAL in order to be compatible
with tpm2_seal_trusted() return values.
Signed-off-by: Jarkko Sakkinen <jarkko.sakkinen@opinsys.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
tpm2_get_pcr_allocation() does not cap any upper limit for the number of
banks. Cap the limit to eight banks so that out of bounds values coming
from external I/O cause on only limited harm.
Cc: stable@vger.kernel.org # v5.10+
Fixes: bcfff8384f ("tpm: dynamically allocate the allocated_banks array")
Tested-by: Lai Yi <yi1.lai@linux.intel.com>
Reviewed-by: Jonathan McDowell <noodles@meta.com>
Reviewed-by: Roberto Sassu <roberto.sassu@huawei.com>
Signed-off-by: Jarkko Sakkinen <jarkko.sakkinen@opinsys.com>
tpm_find_get_ops() looks for the first valid TPM if the caller passes in
NULL. All internal users have been converted to either associate
themselves with a TPM directly, or call tpm_default_chip() as part of
their setup. Remove the no longer necessary tpm_find_get_ops().
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Jonathan McDowell <noodles@meta.com>
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
Currently if a user enqueues a work item using schedule_delayed_work() the
used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to
schedule_work() that is using system_wq and queue_work(), that makes use
again of WORK_CPU_UNBOUND.
This lack of consistency cannot be addressed without refactoring the API.
alloc_workqueue() treats all queues as per-CPU by default, while unbound
workqueues must opt-in via WQ_UNBOUND.
This default is suboptimal: most workloads benefit from unbound queues,
allowing the scheduler to place worker threads where they’re needed and
reducing noise when CPUs are isolated.
This continues the effort to refactor workqueue APIs, which began with
the introduction of new workqueues and a new alloc_workqueue flag in:
commit 128ea9f6cc ("workqueue: Add system_percpu_wq and system_dfl_wq")
commit 930c2ea566 ("workqueue: Add new WQ_PERCPU flag")
This change adds a new WQ_PERCPU flag to explicitly request
alloc_workqueue() to be per-cpu when WQ_UNBOUND has not been specified.
With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND),
any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND
must now use WQ_PERCPU.
Once migration is complete, WQ_UNBOUND can be removed and unbound will
become the implicit default.
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
Update the kerneldoc parameter definitions for __crb_go_idle
and __crb_cmd_ready to include the loc parameter.
Signed-off-by: Stuart Yoder <stuart.yoder@arm.com>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
The spelling of the word "requrest" is incorrect; it should be "request".
Signed-off-by: Chu Guangqing <chuguangqing@inspur.com>
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
Remove parentheses around assert statements in Python. With parentheses,
assert always evaluates to True, making the checks ineffective.
Signed-off-by: Maurice Hieronymus <mhi@mailbox.org>
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
The LRC software ring tail is reset to the first unsignaled pending
job's head.
Fix the re-emission logic to begin submitting from the first unsignaled
job detected, rather than scanning all pending jobs, which can cause
imbalance.
v2:
- Include missing local changes
v3:
- s/skip_replay/restore_replay (Tomasz)
Fixes: c25c1010df ("drm/xe/vf: Replay GuC submission state on pause / unpause")
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Tomasz Lis <tomasz.lis@intel.com>
Link: https://patch.msgid.link/20251121152750.240557-1-matthew.brost@intel.com
(cherry picked from commit 00937fe1921ab346b6f6a4beaa5c38e14733caa3)
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
All of the necessary building blocks are now in place to support SR-IOV
VF migration.
Flip the enable/disable logic to match VF code and disable the feature
only for platforms that don't meet the necessary prerequisites.
To allow more testing and experiments, on DEBUG builds any missing
prerequisites will be ignored.
Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Link: https://patch.msgid.link/20251127093934.1462188-2-michal.winiarski@intel.com
Signed-off-by: Michał Winiarski <michal.winiarski@intel.com>
(cherry picked from commit 01c724aa7bf84e9d081a56e0cbf1d282678ce144)
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Add a scope-based helpers for runtime PM that may be used to simplify
cleanup logic and potentially avoid goto-based cleanup.
For example, using
guard(xe_pm_runtime)(xe);
will get runtime PM and cause a corresponding put to occur automatically
when the current scope is exited. 'xe_pm_runtime_noresume' can be used
as a guard replacement for the corresponding 'noresume' variant.
There's also an xe_pm_runtime_ioctl conditional guard that can be used
as a replacement for xe_runtime_ioctl():
ACQUIRE(xe_pm_runtime_ioctl, pm)(xe);
if ((ret = ACQUIRE_ERR(xe_pm_runtime_ioctl, &pm)) < 0)
/* failed */
In a few rare cases (such as gt_reset_worker()) we need to ensure that
runtime PM is dropped when the function is exited by any means
(including error paths), but the function does not need to acquire
runtime PM because that has already been done earlier by a different
function. For these special cases, an 'xe_pm_runtime_release_only'
guard can be used to handle the release without doing an acquisition.
These guards will be used in future patches to eliminate some of our
goto-based cleanup.
v2:
- Specify success condition for xe_pm runtime_ioctl as _RET >= 0 so
that positive values will be properly identified as success and
trigger destructor cleanup properly.
v3:
- Add comments to the kerneldoc for the existing 'get' functions
indicating that scope-based handling should be preferred where
possible. (Gustavo)
Cc: Gustavo Sousa <gustavo.sousa@intel.com>
Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Reviewed-by: Gustavo Sousa <gustavo.sousa@intel.com>
Link: https://patch.msgid.link/20251118164338.3572146-31-matthew.d.roper@intel.com
Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
(cherry picked from commit 59e7528dbfd52efbed05e0f11b2143217a12bc74)
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Add a new VFIO selftest for measuring the time it takes to run
vfio_pci_device_init() in parallel for one or more devices.
This test serves as manual regression test for the performance
improvement of commit e908f58b6b ("vfio/pci: Separate SR-IOV VF
dev_set"). For example, when running this test with 64 VFs under the
same PF:
Before:
$ ./vfio_pci_device_init_perf_test -r vfio_pci_device_init_perf_test.iommufd.init 0000:1a:00.0 0000:1a:00.1 ...
...
Wall time: 6.653234463s
Min init time (per device): 0.101215344s
Max init time (per device): 6.652755941s
Avg init time (per device): 3.377609608s
After:
$ ./vfio_pci_device_init_perf_test -r vfio_pci_device_init_perf_test.iommufd.init 0000:1a:00.0 0000:1a:00.1 ...
...
Wall time: 0.122978332s
Min init time (per device): 0.108121915s
Max init time (per device): 0.122762761s
Avg init time (per device): 0.113816748s
This test does not make any assertions about performance, since any such
assertion is likely to be flaky due to system differences and random
noise. However this test can be fed into automation to detect
regressions, and can be used by developers in the future to measure
performance optimizations.
Suggested-by: Aaron Lewis <aaronlewis@google.com>
Reviewed-by: Alex Mastro <amastro@fb.com>
Tested-by: Alex Mastro <amastro@fb.com>
Reviewed-by: Raghavendra Rao Ananta <rananta@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Link: https://lore.kernel.org/r/20251126231733.3302983-19-dmatlack@google.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
Split out the contents of libvfio.h into separate header files, but keep
libvfio.h as the top-level include that all tests can use.
Put all new header files into a libvfio/ subdirectory to avoid future
name conflicts in include paths when libvfio is used by other selftests
like KVM.
No functional change intended.
Reviewed-by: Alex Mastro <amastro@fb.com>
Tested-by: Alex Mastro <amastro@fb.com>
Reviewed-by: Raghavendra Rao Ananta <rananta@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Link: https://lore.kernel.org/r/20251126231733.3302983-17-dmatlack@google.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
Drop the struct vfio_pci_device wrappers for IOMMU map/unmap functions
and require tests to directly call iommu_map(), iommu_unmap(), etc. This
results in more concise code, and also makes it clear the map operations
are happening on a struct iommu, not necessarily on a specific device,
especially when multi-device tests are introduced.
Do the same for iova_allocator_init() as that function only needs the
struct iommu, not struct vfio_pci_device.
Reviewed-by: Alex Mastro <amastro@fb.com>
Tested-by: Alex Mastro <amastro@fb.com>
Reviewed-by: Raghavendra Rao Ananta <rananta@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Link: https://lore.kernel.org/r/20251126231733.3302983-14-dmatlack@google.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
Move the IOVA allocator into its own file, to provide better separation
between the allocator and the struct vfio_pci_device helper code.
The allocator could go into iommu.c, but it is standalone enough that a
separate file seems cleaner. This also continues the trend of having a
.c for every major object in VFIO selftests (vfio_pci_device.c,
vfio_pci_driver.c, iommu.c, and now iova_allocator.c).
No functional change intended.
Reviewed-by: Alex Mastro <amastro@fb.com>
Tested-by: Alex Mastro <amastro@fb.com>
Reviewed-by: Raghavendra Rao Ananta <rananta@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Link: https://lore.kernel.org/r/20251126231733.3302983-13-dmatlack@google.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
Rename struct vfio_dma_region to dma_region. This is in preparation for
separating the VFIO PCI device library code from the IOMMU library code.
This name change also better reflects the fact that DMA mappings can be
managed by either VFIO or IOMMUFD. i.e. the "vfio_" prefix is
misleading.
Reviewed-by: Alex Mastro <amastro@fb.com>
Tested-by: Alex Mastro <amastro@fb.com>
Reviewed-by: Raghavendra Rao Ananta <rananta@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Link: https://lore.kernel.org/r/20251126231733.3302983-11-dmatlack@google.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
Upgrade various logging in the VFIO selftests drivers from dev_info() to
dev_err(). All of these logs indicate scenarios that may be unexpected.
For example, the logging during probing indicates matching devices but
that aren't supported by the driver. And the memcpy errors can indicate
a problem if the caller was not trying to do something like exercise I/O
fault handling. Exercising I/O fault handling is certainly a valid thing
to do, but the driver can't infer the caller's expectations, so better
to just log with dev_err().
Suggested-by: Raghavendra Rao Ananta <rananta@google.com>
Reviewed-by: Alex Mastro <amastro@fb.com>
Tested-by: Alex Mastro <amastro@fb.com>
Reviewed-by: Raghavendra Rao Ananta <rananta@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Link: https://lore.kernel.org/r/20251126231733.3302983-10-dmatlack@google.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
Eliminate overly chatty logs that are printed during almost every test.
These logs are adding more noise than value. If a test cares about this
information it can log it itself. This is especially true as the VFIO
selftests gains support for multiple devices in a single test (which
multiplies all these logs).
Reviewed-by: Alex Mastro <amastro@fb.com>
Tested-by: Alex Mastro <amastro@fb.com>
Reviewed-by: Raghavendra Rao Ananta <rananta@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Link: https://lore.kernel.org/r/20251126231733.3302983-8-dmatlack@google.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
Support tests that want to add multiple devices to the same
container/iommufd by decoupling struct vfio_pci_device from
struct iommu.
Multi-devices tests can now put multiple devices in the same
container/iommufd like so:
iommu = iommu_init(iommu_mode);
device1 = vfio_pci_device_init(bdf1, iommu);
device2 = vfio_pci_device_init(bdf2, iommu);
device3 = vfio_pci_device_init(bdf3, iommu);
...
vfio_pci_device_cleanup(device3);
vfio_pci_device_cleanup(device2);
vfio_pci_device_cleanup(device1);
iommu_cleanup(iommu);
To account for the new separation of vfio_pci_device and iommu, update
existing tests to initialize and cleanup a struct iommu.
Reviewed-by: Alex Mastro <amastro@fb.com>
Tested-by: Alex Mastro <amastro@fb.com>
Reviewed-by: Raghavendra Rao Ananta <rananta@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Link: https://lore.kernel.org/r/20251126231733.3302983-7-dmatlack@google.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
Add support for passing multiple device BDFs to a test via the command
line. This is a prerequisite for multi-device tests.
Single-device tests can continue using vfio_selftests_get_bdf(), which
will continue to return argv[argc - 1] (if it is a BDF string), or the
environment variable $VFIO_SELFTESTS_BDF otherwise.
For multi-device tests, a new helper called vfio_selftests_get_bdfs() is
introduced which will return an array of all BDFs found at the end of
argv[], as well as the number of BDFs found (passed back to the caller
via argument). The array of BDFs returned does not need to be freed by
the caller.
The environment variable VFIO_SELFTESTS_BDF continues to support only a
single BDF for the time being.
Reviewed-by: Alex Mastro <amastro@fb.com>
Tested-by: Alex Mastro <amastro@fb.com>
Reviewed-by: Raghavendra Rao Ananta <rananta@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Link: https://lore.kernel.org/r/20251126231733.3302983-4-dmatlack@google.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
Split run.sh into separate scripts (setup.sh, run.sh, cleanup.sh) to
enable multi-device testing, and prepare for VFIO selftests
automatically detecting which devices to use for testing by storing
device metadata on the filesystem.
- setup.sh takes one or more BDFs as arguments and sets up each device.
Metadata about each device is stored on the filesystem in the
directory:
${TMPDIR:-/tmp}/vfio-selftests-devices
Within this directory is a directory for each BDF, and then files in
those directories that cleanup.sh uses to cleanup the device.
- run.sh runs a selftest by passing it the BDFs of all set up devices.
- cleanup.sh takes zero or more BDFs as arguments and cleans up each
device. If no BDFs are provided, it cleans up all devices.
This split enables multi-device testing by allowing multiple BDFs to be
set up and passed into tests:
For example:
$ tools/testing/selftests/vfio/scripts/setup.sh <BDF1> <BDF2>
$ tools/testing/selftests/vfio/scripts/setup.sh <BDF3>
$ tools/testing/selftests/vfio/scripts/run.sh echo
<BDF1> <BDF2> <BDF3>
$ tools/testing/selftests/vfio/scripts/cleanup.sh
In the future, VFIO selftests can automatically detect set up devices by
inspecting ${TMPDIR:-/tmp}/vfio-selftests-devices. This will avoid the
need for the run.sh script.
Reviewed-by: Alex Mastro <amastro@fb.com>
Tested-by: Alex Mastro <amastro@fb.com>
Reviewed-by: Raghavendra Rao Ananta <rananta@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Link: https://lore.kernel.org/r/20251126231733.3302983-3-dmatlack@google.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
Move run.sh in a new sub-directory scripts/. This directory will be used
to house various helper scripts to be used by humans and automation for
running VFIO selftests.
Opportunistically also switch run.sh from TEST_PROGS_EXTENDED to
TEST_FILES. The former is for actual test executables that are just not
run by default. TEST_FILES is a better fit for helper scripts.
No functional change intended.
Reviewed-by: Alex Mastro <amastro@fb.com>
Tested-by: Alex Mastro <amastro@fb.com>
Reviewed-by: Raghavendra Rao Ananta <rananta@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Link: https://lore.kernel.org/r/20251126231733.3302983-2-dmatlack@google.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
Speculative prefetches from CPU to GPU memory until the GPU is
ready after reset can cause harmless corrected RAS events to
be logged on Grace systems. It is thus preferred that the
mapping not be re-established until the GPU is ready post reset.
The GPU readiness can be checked through BAR0 registers similar
to the checking at the time of device probe.
It can take several seconds for the GPU to be ready. So it is
desirable that the time overlaps as much of the VM startup as
possible to reduce impact on the VM bootup time. The GPU
readiness state is thus checked on the first fault/huge_fault
request or read/write access which amortizes the GPU readiness
time.
The first fault and read/write checks the GPU state when the
reset_done flag - which denotes whether the GPU has just been
reset. The memory_lock is taken across map/access to avoid
races with GPU reset.
Also check if the memory is enabled, before waiting for GPU
to be ready. Otherwise the readiness check would block for 30s.
Lastly added PM handling wrapping on read/write access.
Cc: Shameer Kolothum <skolothumtho@nvidia.com>
Cc: Alex Williamson <alex@shazbot.org>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Vikram Sethi <vsethi@nvidia.com>
Reviewed-by: Shameer Kolothum <skolothumtho@nvidia.com>
Suggested-by: Alex Williamson <alex@shazbot.org>
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
Link: https://lore.kernel.org/r/20251127170632.3477-7-ankita@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
Introduce a new flag reset_done to notify that the GPU has just
been reset and the mapping to the GPU memory is zapped.
Implement the reset_done handler to set this new variable. It
will be used later in the patches to wait for the GPU memory
to be ready before doing any mapping or access.
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Reviewed-by: Shameer Kolothum <skolothumtho@nvidia.com>
Suggested-by: Alex Williamson <alex@shazbot.org>
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
Link: https://lore.kernel.org/r/20251127170632.3477-6-ankita@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
Split the function that check for the GPU device being ready on
the probe.
Move the code to wait for the GPU to be ready through BAR0 register
reads to a separate function. This would help reuse the code.
This also fixes a bug where the return status in case of timeout
gets overridden by return from pci_enable_device. With the fix,
a timeout generate an error as initially intended.
Fixes: d85f69d520 ("vfio/nvgrace-gpu: Check the HBM training and C2C link status")
Reviewed-by: Zhi Wang <zhiw@nvidia.com>
Reviewed-by: Shameer Kolothum <skolothumtho@nvidia.com>
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
Link: https://lore.kernel.org/r/20251127170632.3477-5-ankita@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
NVIDIA's Grace based systems have large device memory. The device
memory is mapped as VM_PFNMAP in the VMM VMA. The nvgrace-gpu
module could make use of the huge PFNMAP support added in mm [1].
To make use of the huge pfnmap support, fault/huge_fault ops
based mapping mechanism needs to be implemented. Currently nvgrace-gpu
module relies on remap_pfn_range to do the mapping during VM bootup.
Replace it to instead rely on fault and use vfio_pci_vmf_insert_pfn
to setup the mapping.
Moreover to enable huge pfnmap, nvgrace-gpu module is updated by
adding huge_fault ops implementation. The implementation establishes
mapping according to the order request. Note that if the PFN or the
VMA address is unaligned to the order, the mapping fallbacks to
the PTE level.
Link: https://lore.kernel.org/all/20240826204353.2228736-1-peterx@redhat.com/ [1]
Cc: Shameer Kolothum <skolothumtho@nvidia.com>
Cc: Alex Williamson <alex@shazbot.org>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Vikram Sethi <vsethi@nvidia.com>
Reviewed-by: Zhi Wang <zhiw@nvidia.com>
Reviewed-by: Shameer Kolothum <skolothumtho@nvidia.com>
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
Link: https://lore.kernel.org/r/20251127170632.3477-3-ankita@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
Refactor vfio_pci_mmap_huge_fault to take out the implementation
to map the VMA to the PTE/PMD/PUD as a separate function.
Export the new function to be used by nvgrace-gpu module.
Move the alignment check code to verify that pfn and VMA VA is
aligned to the page order to the header file and make it inline.
No functional change is intended.
Cc: Shameer Kolothum <skolothumtho@nvidia.com>
Cc: Alex Williamson <alex@shazbot.org>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Reviewed-by: Shameer Kolothum <skolothumtho@nvidia.com>
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
Link: https://lore.kernel.org/r/20251127170632.3477-2-ankita@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
fill_sg_entry() splits large DMA buffers into multiple scatter-gather
entries, each holding up to UINT_MAX bytes. When calculating the DMA
address for entries beyond the second one, the expression (i * UINT_MAX)
causes integer overflow due to 32-bit arithmetic.
This manifests when the input arg length >= 8 GiB results in looping for
i >= 2.
Fix by casting i to dma_addr_t before multiplication.
Fixes: 3aa31a8bb1 ("dma-buf: provide phys_vec to scatter-gather mapping routine")
Signed-off-by: Alex Mastro <amastro@fb.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Leon Romanovsky <leon@kernel.org>
Link: https://lore.kernel.org/r/20251125-dma-buf-overflow-v1-1-b70ea1e6c4ba@fb.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
Thanks to a device generating an ACS violation during bus reset,
lockdep reported the following circular locking issue:
CPU0: SET_IRQS (MSI/X): holds igate, acquires memory_lock
CPU1: HOT_RESET: holds memory_lock, acquires pci_bus_sem
CPU2: AER: holds pci_bus_sem, acquires igate
This results in a potential 3-way deadlock.
Remove the pci_bus_sem->igate leg of the triangle by using RCU
to peek at the eventfd rather than locking it with igate.
Fixes: 3be3a074cf ("vfio-pci: Don't use device_lock around AER interrupt setup")
Signed-off-by: Alex Williamson <alex.williamson@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20251124223623.2770706-1-alex@shazbot.org
Signed-off-by: Alex Williamson <alex@shazbot.org>
VT-d second stage HW specifies both the maximum IOVA and the supported
table walk starting points. Weirdly there is HW that only supports a 4
level walk but has a maximum IOVA that only needs 3.
The current code miscalculates this and creates a wrongly sized page table
which ultimately fails the compatibility check for number of levels.
This is fixed by allowing the page table to be created with both a vasz
and top_level input. The vasz will set the aperture for the domain while
the top_level will set the page table geometry.
Add top_level to vtdss and correct the logic in VT-d to generate the right
top_level and vasz from mgaw and sagaw.
Fixes: d373449d8e ("iommu/vt-d: Use the generic iommu page table")
Reported-by: Calvin Owens <calvin@wbinvd.org>
Closes: https://lore.kernel.org/r/8f257d2651eb8a4358fcbd47b0145002e5f1d638.1764237717.git.calvin@wbinvd.org
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Tested-by: Calvin Owens <calvin@wbinvd.org>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Add the missing forward declarations and includes so it does not have
implicit dependencies. mem_encrypt.h is a public header imported by
drivers. Users should not have to guess what include files are needed.
Resolves a kbuild splat:
In file included from drivers/iommu/generic_pt/fmt/iommu_amdv1.c:15:
In file included from drivers/iommu/generic_pt/fmt/iommu_template.h:36:
In file included from drivers/iommu/generic_pt/fmt/amdv1.h:23:
In file included from include/linux/mem_encrypt.h:17:
>> arch/powerpc/include/asm/mem_encrypt.h:13:49: warning: declaration of 'struct device' will not be visible outside of this function [-Wvisibility]
13 | static inline bool force_dma_unencrypted(struct device *dev)
Fixes: 879ced2bab ("iommupt: Add the AMD IOMMU v1 page table format")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202511161358.rS5pSb3U-lkp@intel.com/
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
There is no point in asking the user about the Generic Radix Page
Table API:
- All IOMMU drivers that use this API already select GENERIC_PT when
needed,
- Most users probably do not know what to answer anyway.
Fixes: 7c5b184db7 ("genpt: Generic Page Table base API")
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
gcc 13, in some cases, gets confused if the __builtin_constant_p() is
inside the switch. It thinks that bitnr can have the value max+1 and
fails. Lift the check outside the switch to avoid it.
Fixes: ef7bfe5bbf ("iommupt/x86: Support SW bits and permit PT_FEAT_DMA_INCOHERENT")
Fixes: 5448c1558f ("iommupt: Add the Intel VT-d second stage page table format")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202511242012.I7g504Ab-lkp@intel.com/
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
The prefetch instruction is meant to speed up access to memory referenced
by its address argument. In the bng_re code path, it has no effect,
because the pointer refers to a function that is executed immediately
afterward.
The issue was identified due to the following kbuild compilation error:
drivers/infiniband/hw/bng_re/bng_fw.c: In function 'bng_re_creq_irq':
drivers/infiniband/hw/bng_re/bng_fw.c:278:9: error: implicit
declaration of function 'prefetch' [-Wimplicit-function-declaration]
278 | prefetch(bng_re_get_qe(hwq, sw_cons, NULL));
| ^~~~~~~~
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202511260607.Kuxn4NnN-lkp@intel.com/
Fixes: 4f830cd8d7 ("RDMA/bng_re: Add infrastructure for enabling Firmware channel")
Link: https://patch.msgid.link/20251126-remove-prefetch-v1-1-fcac22007ea7@nvidia.com
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Currently if a user enqueues a work item using schedule_delayed_work() the
used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to
schedule_work() that is using system_wq and queue_work(), that makes use
again of WORK_CPU_UNBOUND.
This lack of consistency cannot be addressed without refactoring the API.
alloc_workqueue() treats all queues as per-CPU by default, while unbound
workqueues must opt-in via WQ_UNBOUND.
This default is suboptimal: most workloads benefit from unbound queues,
allowing the scheduler to place worker threads where they’re needed and
reducing noise when CPUs are isolated.
This continues the effort to refactor workqueue APIs, which began with
the introduction of new workqueues and a new alloc_workqueue flag in:
commit 128ea9f6cc ("workqueue: Add system_percpu_wq and system_dfl_wq")
commit 930c2ea566 ("workqueue: Add new WQ_PERCPU flag")
This change adds a new WQ_PERCPU flag to explicitly request
alloc_workqueue() to be per-cpu when WQ_UNBOUND has not been specified.
With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND),
any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND
must now use WQ_PERCPU.
Once migration is complete, WQ_UNBOUND can be removed and unbound will
become the implicit default.
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Message-Id: <20251107154917.313090-3-marco.crivellari@suse.com>
Currently if a user enqueues a work item using schedule_delayed_work() the
used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to
schedule_work() that is using system_wq and queue_work(), that makes use
again of WORK_CPU_UNBOUND.
This lack of consistency cannot be addressed without refactoring the API.
alloc_workqueue() treats all queues as per-CPU by default, while unbound
workqueues must opt-in via WQ_UNBOUND.
This default is suboptimal: most workloads benefit from unbound queues,
allowing the scheduler to place worker threads where they’re needed and
reducing noise when CPUs are isolated.
This continues the effort to refactor workqueue APIs, which began with
the introduction of new workqueues and a new alloc_workqueue flag in:
commit 128ea9f6cc ("workqueue: Add system_percpu_wq and system_dfl_wq")
commit 930c2ea566 ("workqueue: Add new WQ_PERCPU flag")
This change adds a new WQ_PERCPU flag to explicitly request
alloc_workqueue() to be per-cpu when WQ_UNBOUND has not been specified.
With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND),
any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND
must now use WQ_PERCPU.
Once migration is complete, WQ_UNBOUND can be removed and unbound will
become the implicit default.
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Message-Id: <20251107154917.313090-2-marco.crivellari@suse.com>
Use %pe instead of %ps when printing ERR_PTR() values. %ps is intended
for string pointers, while %pe correctly prints symbolic error names
for error pointers returned via ERR_PTR().
This shows the returned error value more clearly.
Fixes: 67f27b8b3a ("pds_vdpa: subscribe to the pds_core events")
Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com>
Reviewed-by: Brett Creeley <brett.creeley@amd.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Message-Id: <20251018174705.1511982-1-alok.a.tiwari@oracle.com>
pci_get_device() will increase the reference count for the returned
pci_dev, and also decrease the reference count for the input parameter
from if it is not NULL.
If we break the loop in with 'vf_pdev' not NULL. We
need to call pci_dev_put() to decrease the reference count.
Found via static anlaysis and this is similar to commit c508eb042d
("perf/x86/intel/uncore: Fix reference count leak in sad_cfg_iio_topology()")
Fixes: 8b6c724cda ("virtio: vdpa: vDPA driver for Marvell OCTEON DPU devices")
Cc: stable@vger.kernel.org
Signed-off-by: Miaoqian Lin <linmq006@gmail.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Message-Id: <20251027060737.33815-1-linmq006@gmail.com>
When query_virtqueues() fails, the error log prints the variable err
instead of cmd->err. Since err may still be zero at this point, the
log message can misleadingly report a success value 0 even though the
command actually failed.
Even worse, once err is set to the first failure, subsequent logs
print that same stale value. This makes the error reporting appear
one step behind the actual failing queue index, which is confusing
and misleading.
Fix the log to report cmd->err, which reflects the real failure code
returned by the firmware.
Fixes: 1fcdf43ea6 ("vdpa/mlx5: Use async API for vq query command")
Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Message-Id: <20250929134258.80956-1-alok.a.tiwari@oracle.com>
Remove colons after "Returns" in virtio_map_ops function
documentation - both to avoid triggering an htmldoc warning
and for consistency with virtio_config_ops.
This affects map_page, alloc, need_sync, and max_mapping_size.
Fixes: bee8c7c24b ("virtio: introduce map ops in virtio core")
Message-Id: <c262893fa21f4b1265147ef864574a9bd173348f.1763026134.git.mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Documentation build reported:
WARNING: ./drivers/virtio/virtio_ring.c:3174 function parameter 'vaddr' not described in 'virtqueue_map_free_coherent'
WARNING: ./drivers/virtio/virtio_ring.c:3308 expecting prototype for virtqueue_mapping_error(). Prototype was for virtqueue_map_mapping_error() instead
The kernel-doc block for virtqueue_map_free_coherent() omitted the @vaddr parameter, and
the kernel-doc header for virtqueue_map_mapping_error() used the wrong function name
(virtqueue_mapping_error) instead of the actual function name.
This change updates:
- the function name in the comment to virtqueue_map_mapping_error()
- adds the missing @vaddr description in the comment for virtqueue_map_free_coherent()
Fixes: b41cb3bcf6 ("virtio: rename dma helpers")
Signed-off-by: Kriish Sharma <kriish.sharma2006@gmail.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Message-Id: <20251110202920.2250244-1-kriish.sharma2006@gmail.com>
virtio_vdpa_set_status() is declared as returning void, but it used
"return vdpa_set_status()" Since vdpa_set_status() also returns
void, the return statement is unnecessary and misleading.
Remove it.
Fixes: c043b4a8cf ("virtio: introduce a vDPA based transport")
Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com>
Message-Id: <20251001191653.1713923-1-alok.a.tiwari@oracle.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Jason Gunthorpe says:
====================
This series is the start of adding full DMABUF support to
iommufd. Currently it is limited to only work with VFIO's DMABUF exporter.
It sits on top of Leon's series to add a DMABUF exporter to VFIO:
https://lore.kernel.org/all/20251120-dmabuf-vfio-v9-0-d7f71607f371@nvidia.com/
The existing IOMMU_IOAS_MAP_FILE is enhanced to detect DMABUF fd's, but
otherwise works the same as it does today for a memfd. The user can select
a slice of the FD to map into the ioas and if the underliyng alignment
requirements are met it will be placed in the iommu_domain.
Though limited, it is enough to allow a VMM like QEMU to connect MMIO BAR
memory from VFIO to an iommu_domain controlled by iommufd. This is used
for PCI Peer to Peer support in VMs, and is the last feature that the VFIO
type 1 container has that iommufd couldn't do.
The VFIO type1 version extracts raw PFNs from VMAs, which has no lifetime
control and is a use-after-free security problem.
Instead iommufd relies on revokable DMABUFs. Whenever VFIO thinks there
should be no access to the MMIO it can shoot down the mapping in iommufd
which will unmap it from the iommu_domain. There is no automatic remap,
this is a safety protocol so the kernel doesn't get stuck. Userspace is
expected to know it is doing something that will revoke the dmabuf and
map/unmap it around the activity. Eg when QEMU goes to issue FLR it should
do the map/unmap to iommufd.
Since DMABUF is missing some key general features for this use case it
relies on a "private interconnect" between VFIO and iommufd via the
vfio_pci_dma_buf_iommufd_map() call.
The call confirms the DMABUF has revoke semantics and delivers a phys_addr
for the memory suitable for use with iommu_map().
Medium term there is a desire to expand the supported DMABUFs to include
GPU drivers to support DPDK/SPDK type use cases so future series will work
to add a general concept of revoke and a general negotiation of
interconnect to remove vfio_pci_dma_buf_iommufd_map().
I also plan another series to modify iommufd's vfio_compat to
transparently pull a dmabuf out of a VFIO VMA to emulate more of the uAPI
of type1.
The latest series for interconnect negotation to exchange a phys_addr is:
https://lore.kernel.org/r/20251027044712.1676175-1-vivek.kasireddy@intel.com
And the discussion for design of revoke is here:
https://lore.kernel.org/dri-devel/20250114173103.GE5556@nvidia.com/
====================
Based on a shared branch with vfio.
* iommufd_dmabuf:
iommufd/selftest: Add some tests for the dmabuf flow
iommufd: Accept a DMABUF through IOMMU_IOAS_MAP_FILE
iommufd: Have iopt_map_file_pages convert the fd to a file
iommufd: Have pfn_reader process DMABUF iopt_pages
iommufd: Allow MMIO pages in a batch
iommufd: Allow a DMABUF to be revoked
iommufd: Do not map/unmap revoked DMABUFs
iommufd: Add DMABUF to iopt_pages
vfio/pci: Add vfio_pci_dma_buf_iommufd_map()
vfio/nvgrace: Support get_dmabuf_phys
vfio/pci: Add dma-buf export support for MMIO regions
vfio/pci: Enable peer-to-peer DMA transactions by default
vfio/pci: Share the core device pointer while invoking feature functions
vfio: Export vfio device get and put registration helpers
dma-buf: provide phys_vec to scatter-gather mapping routine
PCI/P2PDMA: Document DMABUF model
PCI/P2PDMA: Provide an access to pci_p2pdma_map_type() function
PCI/P2PDMA: Refactor to separate core P2P functionality from memory allocation
PCI/P2PDMA: Simplify bus address mapping API
PCI/P2PDMA: Separate the mmap() support from the core logic
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
A vDEVICE has been a hard requirement for attaching a nested domain to the
device. This makes sense when installing a guest STE, since a vSID must be
present and given to the kernel during the vDEVICE allocation.
But, when CR0.SMMUEN is disabled, VM doesn't really need a vSID to program
the vSMMU behavior as GBPA will take effect, in which case the vSTE in the
nested domain could have carried the bypass or abort configuration in GBPA
register. Thus, having such a hard requirement doesn't work well for GBPA.
Skip vmaster allocation in arm_smmu_attach_prepare_vmaster() for an abort
or bypass vSTE. Note that device on this attachment won't report vevents.
Update the uAPI doc accordingly.
Link: https://patch.msgid.link/r/20251103172755.2026145-1-nicolinc@nvidia.com
Tested-by: Shameer Kolothum <skolothumtho@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Pranjal Shrivastava <praan@google.com>
Tested-by: Shuai Xue <xueshuai@linux.alibaba.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The current implementation calls cond_resched() for every SG entry
in __ib_umem_release(), which can increase needless overhead.
This patch introduces RESCHED_LOOP_CNT_THRESHOLD (0x1000) to limit
how often cond_resched() is called. The function now yields the CPU
once every 4096 iterations, and yield at the very first iteration
for lots of small umem case, to reduce scheduling overhead.
Fixes: d056bc45b6 ("RDMA/core: Prevent soft lockup during large user memory region cleanup")
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Link: https://patch.msgid.link/20251126025147.2627-1-lirongqing@baidu.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
The HW disables bounds checking for MRs with a length of zero, so
the driver will only allow a zero length MR if the "all_memory"
flag is set, and this flag is only set if IB_PD_UNSAFE_GLOBAL_RKEY
is set for the PD.
This means that the "get_dma_mr" method will currently fail unless
the IB_PD_UNSAFE_GLOBAL_RKEY flag is set. This has not been an issue
because the "get_dma_mr" method is only ever invoked if the device
does not support the local DMA key or if IB_PD_UNSAFE_GLOBAL_RKEY
is set, and so far, all IRDMA HW supports the local DMA lkey.
However, some new HW does not support the local DMA lkey, so the
"get_dma_mr" method needs to work without IB_PD_UNSAFE_GLOBAL_RKEY
being set.
To support HW that does not allow the local DMA lkey, the logic has
been changed to pass an explicit flag to indicate when a dma_mr is
being created so that the zero length will be allowed.
Also, the "all_memory" flag has been forced to false for normal MR
allocation since these MRs are never supposed to provide global
unsafe rkey semantics anyway; only the MR created with "get_dma_mr"
should support this.
Fixes: bb6d73d9ad ("RDMA/irdma: Prevent zero-length STAG registration")
Signed-off-by: Jacob Moroni <jmoroni@google.com>
Signed-off-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com>
Link: https://patch.msgid.link/20251125025350.180-7-tatyana.e.nikolova@intel.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Adds a lock around irdma_sc_ccq_arm body to prevent inter-thread data race.
Fixes data race in irdma_sc_ccq_arm() reported by KCSAN:
BUG: KCSAN: data-race in irdma_sc_ccq_arm [irdma] / irdma_sc_ccq_arm [irdma]
read to 0xffff9d51b4034220 of 8 bytes by task 255 on cpu 11:
irdma_sc_ccq_arm+0x36/0xd0 [irdma]
irdma_cqp_ce_handler+0x300/0x310 [irdma]
cqp_compl_worker+0x2a/0x40 [irdma]
process_one_work+0x402/0x7e0
worker_thread+0xb3/0x6d0
kthread+0x178/0x1a0
ret_from_fork+0x2c/0x50
write to 0xffff9d51b4034220 of 8 bytes by task 89 on cpu 3:
irdma_sc_ccq_arm+0x7e/0xd0 [irdma]
irdma_cqp_ce_handler+0x300/0x310 [irdma]
irdma_wait_event+0xd4/0x3e0 [irdma]
irdma_handle_cqp_op+0xa5/0x220 [irdma]
irdma_hw_flush_wqes+0xb1/0x300 [irdma]
irdma_flush_wqes+0x22e/0x3a0 [irdma]
irdma_cm_disconn_true+0x4c7/0x5d0 [irdma]
irdma_disconnect_worker+0x35/0x50 [irdma]
process_one_work+0x402/0x7e0
worker_thread+0xb3/0x6d0
kthread+0x178/0x1a0
ret_from_fork+0x2c/0x50
value changed: 0x0000000000024000 -> 0x0000000000034000
Fixes: 3f49d68425 ("RDMA/irdma: Implement HW Admin Queue OPs")
Signed-off-by: Krzysztof Czurylo <krzysztof.czurylo@intel.com>
Signed-off-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com>
Link: https://patch.msgid.link/20251125025350.180-2-tatyana.e.nikolova@intel.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Some platforms (e.g. SC8280XP and X1E) support more than 128 stream
matching groups. This is more than what is defined as maximum by the ARM
SMMU architecture specification. Commit 1226113473 ("iommu/arm-smmu-qcom:
Limit the SMR groups to 128") disabled use of the additional groups because
they don't exhibit the same behavior as the architecture supported ones.
It seems like this is just another quirk of the hypervisor: When running
bare-metal without the hypervisor, the additional groups appear to behave
just like all others. The boot firmware uses some of the additional groups,
so ignoring them in this situation leads to stream match conflicts whenever
we allocate a new SMR group for the same SID.
The workaround exists primarily because the bypass quirk detection fails
when using a S2CR register from the additional matching groups, so let's
perform the test with the last reliable S2CR (127) and then limit the
number of SMR groups only if we detect that we are running below the
hypervisor (because of the bypass quirk).
Fixes: 1226113473 ("iommu/arm-smmu-qcom: Limit the SMR groups to 128")
Signed-off-by: Stephan Gerhold <stephan.gerhold@linaro.org>
Signed-off-by: Will Deacon <will@kernel.org>
When connected to VFIO, the only DMABUF exporter that is accepted, the
move_notify callback will be made when VFIO wants to remove access to the
MMIO. This is being called revoke.
Wire up revoke to go through all the iommu_domain's that have mapped the
DMABUF and unmap them.
The locking here is unpleasant, since the existing locking scheme was
designed to come from the iopt through the area to the pages we cannot use
pages as starting point for the locking. There is no way to obtain the
domains_rwsem before obtaining the pages mutex to reliably use the
existing domains_itree.
Solve this problem by adding a new tracking structure just for DMABUF
revoke. Record a linked list of areas and domains inside the pages
mutex. Clean the entries on the list during revoke. The map/unmaps are now
all done under a pages mutex while updating the tracking linked list so
nothing can get out of sync. Only one lock is required for revoke
processing.
Link: https://patch.msgid.link/r/4-v2-b2c110338e3f+5c2-iommufd_dmabuf_jgg@nvidia.com
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Shuai Xue <xueshuai@linux.alibaba.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Once a DMABUF is revoked the domain will be unmapped under the pages
mutex. Double unmapping will trigger a WARN, and mapping while revoked
will fail.
Check for revoked DMABUFs along all the map and unmap paths to resolve
this. Ensure that map/unmap is always done under the pages mutex so it is
synchronized with the revoke notifier.
If a revoke happens between allocating the iopt_pages and the population
to a domain then the population will succeed, and leave things unmapped as
though revoke had happened immediately after.
Currently there is no way to repopulate the domains. Userspace is expected
to know if it is going to do something that would trigger revoke (eg if it
is about to do a FLR) then it should go and remove the DMABUF mappings
before and put the back after. The revoke is only to protect the kernel
from mis-behaving userspace.
Link: https://patch.msgid.link/r/3-v2-b2c110338e3f+5c2-iommufd_dmabuf_jgg@nvidia.com
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Shuai Xue <xueshuai@linux.alibaba.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Add IOPT_ADDRESS_DMABUF to the iopt_pages and the basic infrastructure to
create an iopt_pages from a struct dma_buf *.
DMABUF pages are not supported for accesses, and for now can only be used
with the VFIO DMABUF exporter.
The overall flow will be similar to memfd where the user can pass in a
DMABUF file descriptor to IOMMU_IOAS_MAP_FILE and create an area and
pages. Like other areas it can be copied and otherwise manipulated, though
there is little point in doing so.
There is no pinned page accounting done for DMABUF maps.
The DMABUF attachment exists so long as the dmabuf is mapped into an IOAS,
even if the IOAS is not mapped to any domains.
Link: https://patch.msgid.link/r/2-v2-b2c110338e3f+5c2-iommufd_dmabuf_jgg@nvidia.com
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Shuai Xue <xueshuai@linux.alibaba.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
This function is used to establish the "private interconnect" between the
VFIO DMABUF exporter and the iommufd DMABUF importer. This is intended to
be a temporary API until the core DMABUF interface is improved to natively
support a private interconnect and revocable negotiation.
This function should only be called by iommufd when trying to map a
DMABUF. For now iommufd will only support VFIO DMABUFs.
The following improvements are needed in the DMABUF API to generically
support more exporters with iommufd/kvm type importers that cannot use the
DMA API:
1) Revoke semantics. VFIO needs to be able to prevent access to the MMIO
during FLR, and so it will use dma_buf_move_notify() to prevent
access. iommmufd does not support fault handling so it cannot
implement the full move_notify. Instead if revoke is negotiated the
exporter promises not to use move_notify() unless the importer can
experiance failures. iommufd will unmap the dmabuf from the iommu page
tables while it is revoked.
2) Private interconnect negotiation. iommufd will only be able to map
a "private interconnect" that provides a phys_addr_t and a
struct p2pdma_provider * to describe the memory. It cannot use a DMA
mapped scatterlist since it is directly calling iommu_map().
3) NULL device during dma_buf_dynamic_attach(). Since iommufd doesn't use
the DMA API it doesn't have a DMAable struct device to pass here.
Link: https://patch.msgid.link/r/1-v2-b2c110338e3f+5c2-iommufd_dmabuf_jgg@nvidia.com
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Shuai Xue <xueshuai@linux.alibaba.com>
Acked-by: Alex Williamson <alex@shazbot.org>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Since increase_top() does it's own READ_ONCE() on top_of_table, the
caller's prior READ_ONCE() could be inconsistent and the first time
through the loop we may actually already have the right level if two
threads are racing map.
In this case new_level will be left uninitialized.
Further all the exits from the loop have to either commit to the new top
or free any memory allocated so the early return must be a goto err_free.
Make it so the only break from the loop always sets new_level to the right
value and all other exits go to err_free. Use pts.level (the pts
represents the top we are stacking) within the loop instead of new_level.
Fixes: dcd6a011a8 ("iommupt: Add map_pages op")
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/r/aRwgNW9PiW2j-Qwo@stanley.mountain
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
The return type of __modify_irte_ga() is int, but modify_irte_ga()
treats it as a bool. Casting the int to bool discards the error code.
To fix the issue, change the type of ret to int in modify_irte_ga().
Fixes: 57cdb720ea ("iommu/amd: Do not flush IRTE when only updating isRun and destination fields")
Cc: stable@vger.kernel.org
Signed-off-by: Jinhui Guo <guojinhui.liam@bytedance.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
In arm_smmu_alloc_cd_tables(), the error check following the
dma_alloc_coherent() for cd_table->l2.l1tab incorrectly tests
cd_table->l2.l2ptrs.
This means an allocation failure for l1tab goes undetected, causing
the function to return 0 (success) erroneously.
Correct the check to test cd_table->l2.l1tab.
Fixes: e3b1be2e73 ("iommu/arm-smmu-v3: Reorganize struct arm_smmu_ctx_desc_cfg")
Signed-off-by: Daniel Mentz <danielmentz@google.com>
Signed-off-by: Ryan Huang <tzukui@google.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Pranjal Shrivastava <praan@google.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Will Deacon <will@kernel.org>
Some IOMMUs on some platforms (there doesn't seem to be a good denominator
for this) require the presence of a third clock, specifically relating
to the instance's Translation Buffer Unit (TBU).
Stephan Gerhold noted [1] that according to Qualcomm Snapdragon 410E
Processor (APQ8016E) Technical Reference Manual, SMMU chapter, section
"8.8.3.1.2 Clock gating", which reads:
For APPS TCU/TBU (TBU to TCU interface is asynchronous)
Software should turn ON clock to APPS TCU
- During APPS TCU register programming sequence
For GPU TCU/TBU (TBU to TCU interface is synchronous)
Software should turn ON clock to GPU TBU
- During GPU TLB invalidation sequence <=====================
Software should turn ON clock to GPU TCU
- During GPU TCU register programming sequence
- While GPU master clock is Active
The clock should be turned on at least during TLB invalidation on the
GPU SMMU instance. This is corroborated by Commit 5bc1cf1466
("iommu/qcom: add optional 'tbu' clock for TLB invalidate").
This is also not to be confused with qcom,sdm845-tbu, which is a
description of a debug interface, absent on the generation of hardware
that this binding describes.
Allow this clock.
[1] https://lore.kernel.org/linux-arm-msm/aPX_cKtial56AgvU@linaro.org/
Reviewed-by: Rob Herring <robh@kernel.org>
Signed-off-by: Konrad Dybcio <konrad.dybcio@linaro.org>
Reviewed-by: Bjorn Andersson <andersson@kernel.org>
Signed-off-by: Will Deacon <will@kernel.org>
Enabling 5-level paging (LA57) increases TASK_SIZE on x86_64 from 2^47
to 2^56. This affects implicit ODP, which uses TASK_SIZE to calculate
the number of IMR KSM entries.
As a result, the number of entries and the memory usage for KSM mkeys
increase drastically:
- With 2^47 TASK_SIZE: 0x20000 entries (~2MB)
- With 2^56 TASK_SIZE: 0x4000000 entries (~1GB)
This issue could happen previously on systems with LA57 manually
enabled, but now commit 7212b58d6d ("x86/mm/64: Make 5-level paging
support unconditional") enables LA57 by default on all supported
systems. This makes the issue impact widespread.
To mitigate this, increase the size each MTT entry maps from 1GB to 16GB
when 5-level paging is enabled. This reduces the number of KSM entries
and lowers the memory usage on LA57 systems from 1GB to 64MB per IMR.
As now 'mlx5_imr_mtt_size' is larger than 32 bits, we move to use u64
instead of int as part of populate_klm() to prevent overflow of the
'step' variable.
In addition, as populate_klm() actually handles KSM and not KLM, as it's
used only by implicit ODP, we renamed its signature and the internal
structures accordingly while dropping the byte_count handling which is
not relevant in KSM. The page size in KSM is fixed for all the entries
and come from the log_page_size of the mkey.
Note:
On platforms where the calculated value for 'mlx5_imr_ksm_page_shift' is
higher than the max firmware cap to be changed over UMR, or that the
calculated value for 'log_va_pages' is higher than what we may expect,
the implicit ODP cap will be simply turned off.
Co-developed-by: Or Har-Toov <ohartoov@nvidia.com>
Signed-off-by: Or Har-Toov <ohartoov@nvidia.com>
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Reviewed-by: Michael Guralnik <michaelgur@nvidia.com>
Signed-off-by: Edward Srouji <edwards@nvidia.com>
Link: https://patch.msgid.link/20251120-reduce-ksm-v1-1-6864bfc814dc@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
When conditions are met, schedule a delayed work in bond event handler
to perform bonding operation according to the bond state. In the case
of changing slave number or link state, re-set the netdev for the bond
ibdev after the modification is complete, since these two operations
may not call hns_roce_set_bond_netdev() in hns_roce_init().
The delayed work will be paused when there is a driver reset or exit
to avoid concurrency.
Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Link: https://patch.msgid.link/20251112093510.3696363-7-huangjunxian6@hisilicon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Implement hns_roce_slave_init() and hns_roce_slave_uninit() for device
init/uninit in bonding cases. The former is used to initialize a slave
ibdev (when the slave is unlinked from a bond) or a bond ibdev, while
the latter does the opposite. Most of the process is the same as
regular device init/uninit, while some bonding‑specific steps below are
also added.
In bond device init flow, choose one slave to re-initialize as the
main_hr_dev of the bond, and it will be the only device presented for
multiple slaves. During registration, set and active netdev to the
ibdev based on the link state of the slaves. When this main_hr_dev
slave is being unlinked while the bond is still valid, choose a new
slave from the rest and initialize it as the new bond device.
In uninit flow, add a bond cleanup process, restore all the other
slaves and clean up bond resource. This is only for the case where
the port of main_hr_dev is directly removed without unlinking it
from bond.
Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Link: https://patch.msgid.link/20251112093510.3696363-6-huangjunxian6@hisilicon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Register netdev notifier for two bonding events NETDEV_CHANGEUPPER
and NETDEV_CHANGELOWERSTATE.
In NETDEV_CHANGEUPPER event handler, check some rules about the HW
constraints when trying to link a new slave to the masteri, and
store some bonding information from the notifier. In unlinking case,
simply check the number of the rest slaves to decide whether the
bond is still supported.
In NETDEV_CHANGELOWERSTATE event handler, not much is done. It
simply sets the bond state when the bond is ready, which will be
used later.
Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Link: https://patch.msgid.link/20251112093510.3696363-4-huangjunxian6@hisilicon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Call vfio_pci_core_fill_phys_vec() with the proper physical ranges for the
synthetic BAR 2 and BAR 4 regions. Otherwise use the normal flow based on
the PCI bar.
This demonstrates a DMABUF that follows the region info report to only
allow mapping parts of the region that are mmapable. Since the BAR is
power of two sized and the "CXL" region is just page aligned the there can
be a padding region at the end that is not mmaped or passed into the
DMABUF.
The "CXL" ranges that are remapped into BAR 2 and BAR 4 areas are not PCI
MMIO, they actually run over the CXL-like coherent interconnect and for
the purposes of DMA behave identically to DRAM. We don't try to model this
distinction between true PCI BAR memory that takes a real PCI path and the
"CXL" memory that takes a different path in the p2p framework for now.
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Tested-by: Alex Mastro <amastro@fb.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Acked-by: Ankit Agrawal <ankita@nvidia.com>
Reviewed-by: Ankit Agrawal <ankita@nvidia.com>
Link: https://lore.kernel.org/r/20251120-dmabuf-vfio-v9-11-d7f71607f371@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
Add support for exporting PCI device MMIO regions through dma-buf,
enabling safe sharing of non-struct page memory with controlled
lifetime management. This allows RDMA and other subsystems to import
dma-buf FDs and build them into memory regions for PCI P2P operations.
The implementation provides a revocable attachment mechanism using
dma-buf move operations. MMIO regions are normally pinned as BARs
don't change physical addresses, but access is revoked when the VFIO
device is closed or a PCI reset is issued. This ensures kernel
self-defense against potentially hostile userspace.
Currently VFIO can take MMIO regions from the device's BAR and map
them into a PFNMAP VMA with special PTEs. This mapping type ensures
the memory cannot be used with things like pin_user_pages(), hmm, and
so on. In practice only the user process CPU and KVM can safely make
use of these VMA. When VFIO shuts down these VMAs are cleaned by
unmap_mapping_range() to prevent any UAF of the MMIO beyond driver
unbind.
However, VFIO type 1 has an insecure behavior where it uses
follow_pfnmap_*() to fish a MMIO PFN out of a VMA and program it back
into the IOMMU. This has a long history of enabling P2P DMA inside
VMs, but has serious lifetime problems by allowing a UAF of the MMIO
after the VFIO driver has been unbound.
Introduce DMABUF as a new safe way to export a FD based handle for the
MMIO regions. This can be consumed by existing DMABUF importers like
RDMA or DRM without opening an UAF. A following series will add an
importer to iommufd to obsolete the type 1 code and allow safe
UAF-free MMIO P2P in VM cases.
DMABUF has a built in synchronous invalidation mechanism called
move_notify. VFIO keeps track of all drivers importing its MMIO and
can invoke a synchronous invalidation callback to tell the importing
drivers to DMA unmap and forget about the MMIO pfns. This process is
being called revoke. This synchronous invalidation fully prevents any
lifecycle problems. VFIO will do this before unbinding its driver
ensuring there is no UAF of the MMIO beyond the driver lifecycle.
Further, VFIO has additional behavior to block access to the MMIO
during things like Function Level Reset. This is because some poor
platforms may experience a MCE type crash when touching MMIO of a PCI
device that is undergoing a reset. Today this is done by using
unmap_mapping_range() on the VMAs. Extend that into the DMABUF world
and temporarily revoke the MMIO from the DMABUF importers during FLR
as well. This will more robustly prevent an errant P2P from possibly
upsetting the platform.
A DMABUF FD is a preferred handle for MMIO compared to using something
like a pgmap because:
- VFIO is supported, including its P2P feature, on archs that don't
support pgmap
- PCI devices have all sorts of BAR sizes, including ones smaller
than a section so a pgmap cannot always be created
- It is undesirable to waste a lot of memory for struct pages,
especially for a case like a GPU with ~100GB of BAR size
- We want a synchronous revoke semantic to support FLR with light
hardware requirements
Use the P2P subsystem to help generate the DMA mapping. This is a
significant upgrade over the abuse of dma_map_resource() that has
historically been used by DMABUF exporters. Experience with an OOT
version of this patch shows that real systems do need this. This
approach deals with all the P2P scenarios:
- Non-zero PCI bus_offset
- ACS flags routing traffic to the IOMMU
- ACS flags that bypass the IOMMU - though vfio noiommu is required
to hit this.
There will be further work to formalize the revoke semantic in
DMABUF. For now this acts like a move_notify dynamic exporter where
importer fault handling will get a failure when they attempt to map.
This means that only fully restartable fault capable importers can
import the VFIO DMABUFs. A future revoke semantic should open this up
to more HW as the HW only needs to invalidate, not handle restartable
faults.
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Acked-by: Ankit Agrawal <ankita@nvidia.com>
Link: https://lore.kernel.org/r/20251120-dmabuf-vfio-v9-10-d7f71607f371@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
Make sure that all VFIO PCI devices have peer-to-peer capabilities
enables, so we would be able to export their MMIO memory through DMABUF,
VFIO has always supported P2P mappings with itself. VFIO type 1
insecurely reads PFNs directly out of a VMA's PTEs and programs them
into the IOMMU allowing any two VFIO devices to perform P2P to each
other.
All existing VMMs use this capability to export P2P into a VM where
the VM could setup any kind of DMA it likes. Projects like DPDK/SPDK
are also known to make use of this, though less frequently.
As a first step to more properly integrating VFIO with the P2P
subsystem unconditionally enable P2P support for VFIO PCI devices. The
struct p2pdma_provider will act has a handle to the P2P subsystem to
do things like DMA mapping.
While real PCI devices have to support P2P (they can't even tell if an
IOVA is P2P or not) there may be fake PCI devices that may trigger
some kind of catastrophic system failure. To date VFIO has never
tripped up on such a case, but if one is discovered the plan is to add
a PCI quirk and have pcim_p2pdma_init() fail. This will fully block
the broken device throughout any users of the P2P subsystem in the
kernel.
Thus P2P through DMABUF will follow the historical VFIO model and be
unconditionally enabled by vfio-pci.
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Tested-by: Alex Mastro <amastro@fb.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Acked-by: Ankit Agrawal <ankita@nvidia.com>
Link: https://lore.kernel.org/r/20251120-dmabuf-vfio-v9-9-d7f71607f371@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
Add dma_buf_phys_vec_to_sgt() and dma_buf_free_sgt() helpers to convert
an array of MMIO physical address ranges into scatter-gather tables with
proper DMA mapping.
These common functions are a starting point and support any PCI
drivers creating mappings from their BAR's MMIO addresses. VFIO is one
case, as shortly will be RDMA. We can review existing DRM drivers to
refactor them separately. We hope this will evolve into routines to
help common DRM that include mixed CPU and MMIO mappings.
Compared to the dma_map_resource() abuse this implementation handles
the complicated PCI P2P scenarios properly, especially when an IOMMU
is enabled:
- Direct bus address mapping without IOVA allocation for
PCI_P2PDMA_MAP_BUS_ADDR, using pci_p2pdma_bus_addr_map(). This
happens if the IOMMU is enabled but the PCIe switch ACS flags allow
transactions to avoid the host bridge.
Further, this handles the slightly obscure, case of MMIO with a
phys_addr_t that is different from the physical BAR programming
(bus offset). The phys_addr_t is converted to a dma_addr_t and
accommodates this effect. This enables certain real systems to
work, especially on ARM platforms.
- Mapping through host bridge with IOVA allocation and DMA_ATTR_MMIO
attribute for MMIO memory regions (PCI_P2PDMA_MAP_THRU_HOST_BRIDGE).
This happens when the IOMMU is enabled and the ACS flags are forcing
all traffic to the IOMMU - ie for virtualization systems.
- Cases where P2P is not supported through the host bridge/CPU. The
P2P subsystem is the proper place to detect this and block it.
Helper functions fill_sg_entry() and calc_sg_nents() handle the
scatter-gather table construction, splitting large regions into
UINT_MAX-sized chunks to fit within sg->length field limits.
Since the physical address based DMA API forbids use of the CPU list
of the scatterlist this will produce a mangled scatterlist that has
a fully zero-length and NULL'd CPU list. The list is 0 length,
all the struct page pointers are NULL and zero sized. This is stronger
and more robust than the existing mangle_sg_table() technique. It is
a future project to migrate DMABUF as a subsystem away from using
scatterlist for this data structure.
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Tested-by: Alex Mastro <amastro@fb.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Acked-by: Christian König <christian.koenig@amd.com>
Acked-by: Ankit Agrawal <ankita@nvidia.com>
Link: https://lore.kernel.org/r/20251120-dmabuf-vfio-v9-6-d7f71607f371@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
Provide an access to pci_p2pdma_map_type() function to allow subsystems
to determine the appropriate mapping type for P2PDMA transfers between
a provider and target device.
The pci_p2pdma_map_type() function is the core P2P layer version of
the existing public, but struct page focused, pci_p2pdma_state()
function. It returns the same result. It is required to use the p2p
subsystem from drivers that don't use the struct page layer.
Like __pci_p2pdma_update_state() it is not an exported function. The
idea is that only subsystem code will implement mapping helpers for
taking in phys_addr_t lists, this is deliberately not made accessible
to every driver to prevent abuse.
Following patches will use this function to implement a shared DMA
mapping helper for DMABUF.
Tested-by: Alex Mastro <amastro@fb.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Acked-by: Ankit Agrawal <ankita@nvidia.com>
Link: https://lore.kernel.org/r/20251120-dmabuf-vfio-v9-4-d7f71607f371@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
Refactor the PCI P2PDMA subsystem to separate the core peer-to-peer DMA
functionality from the optional memory allocation layer. This creates a
two-tier architecture:
The core layer provides P2P mapping functionality for physical addresses
based on PCI device MMIO BARs and integrates with the DMA API for
mapping operations. This layer is required for all P2PDMA users.
The optional upper layer provides memory allocation capabilities
including gen_pool allocator, struct page support, and sysfs interface
for user space access.
This separation allows subsystems like DMABUF to use only the core P2P
mapping functionality without the overhead of memory allocation features
they don't need. The core functionality is now available through the
new pcim_p2pdma_provider() function that returns a p2pdma_provider
structure.
Tested-by: Alex Mastro <amastro@fb.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Acked-by: Ankit Agrawal <ankita@nvidia.com>
Link: https://lore.kernel.org/r/20251120-dmabuf-vfio-v9-3-d7f71607f371@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
Update the pci_p2pdma_bus_addr_map() function to take a direct pointer
to the p2pdma_provider structure instead of the pci_p2pdma_map_state.
This simplifies the API by removing the need for callers to extract
the provider from the state structure.
The change updates all callers across the kernel (block layer, IOMMU,
DMA direct, and HMM) to pass the provider pointer directly, making
the code more explicit and reducing unnecessary indirection. This
also removes the runtime warning check since callers now have direct
control over which provider they use.
Tested-by: Alex Mastro <amastro@fb.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Acked-by: Ankit Agrawal <ankita@nvidia.com>
Link: https://lore.kernel.org/r/20251120-dmabuf-vfio-v9-2-d7f71607f371@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
Currently the P2PDMA code requires a pgmap and a struct page to
function. The was serving three important purposes:
- DMA API compatibility, where scatterlist required a struct page as
input
- Life cycle management, the percpu_ref is used to prevent UAF during
device hot unplug
- A way to get the P2P provider data through the pci_p2pdma_pagemap
The DMA API now has a new flow, and has gained phys_addr_t support, so
it no longer needs struct pages to perform P2P mapping.
Lifecycle management can be delegated to the user, DMABUF for instance
has a suitable invalidation protocol that does not require struct page.
Finding the P2P provider data can also be managed by the caller
without need to look it up from the phys_addr.
Split the P2PDMA code into two layers. The optional upper layer,
effectively, provides a way to mmap() P2P memory into a VMA by
providing struct page, pgmap, a genalloc and sysfs.
The lower layer provides the actual P2P infrastructure and is wrapped
up in a new struct p2pdma_provider. Rework the mmap layer to use new
p2pdma_provider based APIs.
Drivers that do not want to put P2P memory into VMA's can allocate a
struct p2pdma_provider after probe() starts and free it before
remove() completes. When DMA mapping the driver must convey the struct
p2pdma_provider to the DMA mapping code along with a phys_addr of the
MMIO BAR slice to map. The driver must ensure that no DMA mapping
outlives the lifetime of the struct p2pdma_provider.
The intended target of this new API layer is DMABUF. There is usually
only a single p2pdma_provider for a DMABUF exporter. Most drivers can
establish the p2pdma_provider during probe, access the single instance
during DMABUF attach and use that to drive the DMA mapping.
DMABUF provides an invalidation mechanism that can guarantee all DMA
is halted and the DMA mappings are undone prior to destroying the
struct p2pdma_provider. This ensures there is no UAF through DMABUFs
that are lingering past driver removal.
The new p2pdma_provider layer cannot be used to create P2P memory that
can be mapped into VMA's, be used with pin_user_pages(), O_DIRECT, and
so on. These use cases must still use the mmap() layer. The
p2pdma_provider layer is principally for DMABUF-like use cases where
DMABUF natively manages the life cycle and access instead of
vmas/pin_user_pages()/struct page.
In addition, remove the bus_off field from pci_p2pdma_map_state since
it duplicates information already available in the pgmap structure.
The bus_offset is only used in one location (pci_p2pdma_bus_addr_map)
and is always identical to pgmap->bus_offset.
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Tested-by: Alex Mastro <amastro@fb.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Acked-by: Ankit Agrawal <ankita@nvidia.com>
Link: https://lore.kernel.org/r/20251120-dmabuf-vfio-v9-1-d7f71607f371@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
Commit d373449d8e ("iommu/vt-d: Use the generic iommu page table")
changed the calculation of domain::aperture_end. Previously, it was
calculated as:
domain->domain.geometry.aperture_end =
__DOMAIN_MAX_ADDR(domain->gaw - 1);
where domain->gaw was limited to less than MGAW.
Currently, it is calculated purely based on the max level of the page
table that the hardware supports. This is incorrect as stated in Section
3.6 of the VT-d spec:
"Software using first-stage translation structures to translate an IO
Virtual Address (IOVA) must use canonical addresses. Additionally,
software must limit addresses to less than the minimum of MGAW and the
lower canonical address width implied by FSPM (i.e., 47-bit when FSPM
is 4-level and 56-bit when FSPM is 5-level)."
Restore the previous calculation method for domain::aperture_end to avoid
violating the spec. Incorrect aperture calculation causes GPU hangs
without generating VT-d faults on some Intel client platforms.
Fixes: d373449d8e ("iommu/vt-d: Use the generic iommu page table")
Reported-by: Chaitanya Kumar Borah <chaitanya.kumar.borah@intel.com>
Closes: https://lore.kernel.org/r/4f15cf3b-6fad-4cd8-87e5-6d86c0082673@intel.com
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Suggested-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
INTEL_IOMMU_FLOPPY_WA workaround was introduced to create direct mappings
for first 16MB for floppy devices as the floppy drivers were not using
dma apis. We need not do this direct map if floppy driver is not
enabled.
INTEL_IOMMU_FLOPPY_WA is generally not a good idea. Iommu will be
mapping pages in this address range while kernel would also be
allocating from this range(mostly on memory stress). A misbehaving
device using this domain will have access to the pages that the
kernel might be actively using. We noticed this while running a test
that was trying to figure out if any pages used by kernel is in iommu
page tables.
This patch reduces the scope of the above issue by disabling the
workaround when floppy driver is not enabled. But we would still need to
fix the floppy driver to use dma apis so that we need not do direct map
without reserving the pages. Or the other option is to reserve this
memory range in firmware so that kernel will not use the pages.
Fixes: d850c2ee5f ("iommu/vt-d: Expose ISA direct mapping region via iommu_get_resv_regions")
Fixes: 49a0429e53 ("Intel IOMMU: Iommu floppy workaround")
Signed-off-by: Vineeth Pillai (Google) <vineeth@bitbyteword.org>
Link: https://lore.kernel.org/r/20251002161625.1155133-1-vineeth@bitbyteword.org
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Split most of the rootid_owns_currentns() functionality
into a more generic rootid_owns_ns() function which
will be easier to write tests for.
Rename the functions and variables to make clear that
the ids being tested could be any uid.
Signed-off-by: Serge Hallyn <serge@hallyn.com>
CC: Ryan Foster <foster.ryan.r@gmail.com>
CC: Christian Brauner <brauner@kernel.org>
---
v2: change the function parameter documentation to mollify the bot.
mock_get_event() uses an uninitialized local variable, nr_overflow, to
populate the overflow_err_count field. That results in incorrect
overflow_err_count values in mocked cxl_overflow trace events, such as
this case where the records are reported as 0 and should be non-zero:
[] cxl_overflow: memdev=mem7 host=cxl_mem.6 serial=7: log=Failure : 0 records from 1763228189130895685 to 1763228193130896180
Fix by using log->nr_overflow and remove the unused local variable.
A follow-up change was considered in cxl_mem_get_records_log() to
confirm that the overflow_err_count is non-zero when the overflow flag
is set [1]. Since the driver has no functional dependency on this
constraint, and a device that violates this specific requirement does
not cause incorrect driver behavior, no validation check is added.
[1] CXL 3.2, Table 8-65 Get Event Records Output Payload
Signed-off-by: Alison Schofield <alison.schofield@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>> ---
Link: https://patch.msgid.link/20251116013036.1713313-1-alison.schofield@intel.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Commit 364ee9f326 ("cxl/test: Enhance event testing") changed the
loop iterator in mock_get_event() from a static constant,
CXL_TEST_EVENT_CNT, to a dynamic global variable, ret_limit. The
intent was to vary the number of events returned per call to simulate
events occurring while logs are being read.
However, ret_limit is modified without synchronization. When multiple
threads call mock_get_event() concurrently, one thread may read
ret_limit, another thread may increment it, and the first thread's
loop condition and size calculation see and use the updated value.
This is visible during cxl_test module load when all memdevs are
initializing simultaneously, which includes getting event records. It
is not tied to the cxl-events.sh unit test specifically, as that
operates on a single memdev.
While no actual harm results (the buffer is always large enough and
the record count fields correctly reflect what was written), this is
a correctness issue. The race creates an inconsistent state within
mock_get_event() and adding variability based on a race appears
unintended.
Make ret_limit a local variable populated from an atomic counter. Each
call gets a stable value that won't change during execution. That
preserves the intended behavior of varying the return counts across
calls while eliminating the race condition.
This implementation uses "+ 1" to produce the full range of 1 to
CXL_TEST_EVENT_RET_MAX (4) records. Previously only 1, 2, 3 were
produced.
Signed-off-by: Alison Schofield <alison.schofield@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>> ---
Link: https://patch.msgid.link/20251116013819.1713780-1-alison.schofield@intel.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Extended linear cache unit testing support
- Standardize CXL auto region size
- Add cxl_test CFMWS support for extended linear cache
- Add support for acpi extended linear cache
Add a module parameter to allow activation of extended linear cache
on the auto region for cxl_test. The current platform implementation
for extended linear cache is 1:1 of DRAM and CXL memory. A CFMWS is
created with the size of both memory together where DRAM takes the
first part of the memory range and CXL covers the second part. The
current CXL auto region on cxl_test consists of 2 256M devices that
creates a 512M region. The new extended linear cache setup will have
512M DRAM and 512M CXL memory for a total of 1G CFMWS. The hardware
decoders must have their starting offset moved to after the DRAM region
to handle the CXL regions.
[ dj: Fixup commenting style. (Jonathan) ]
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Tested-by: Alison Schofield <alison.schofield@intel.com>
Reviewed-by: Alison Schofield <alison.schofield@intel.com>
Reviewed-by: Fabio M. De Francesco <fabio.m.de.francesco@linux.intel.com>
Link: https://patch.msgid.link/20251117144611.903692-3-dave.jiang@intel.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Make sure to drop the reference taken to the iommu platform device when
looking up its driver data during probe_device().
Note that commit 9826e393e4 ("iommu/tegra-smmu: Fix missing
put_device() call in tegra_smmu_find") fixed the leak in an error path,
but the reference is still leaking on success.
Fixes: 8918465163 ("memory: Add NVIDIA Tegra memory controller support")
Cc: stable@vger.kernel.org # 3.19: 9826e393e4
Cc: Miaoqian Lin <linmq006@gmail.com>
Acked-by: Robin Murphy <robin.murphy@arm.com>
Acked-by: Thierry Reding <treding@nvidia.com>
Signed-off-by: Johan Hovold <johan@kernel.org>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Simplify the probe_device() error handling by dropping the iommu OF node
reference sooner.
Acked-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Johan Hovold <johan@kernel.org>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Make sure to drop the references taken to the iommu platform devices
when looking up their driver data during probe_device().
Note that the arch data device pointer added by commit 604629bcb5
("iommu/omap: add support for late attachment of iommu devices") has
never been used. Remove it to underline that the references are not
needed.
Fixes: 9d5018deec ("iommu/omap: Add support to program multiple iommus")
Fixes: 7d6827748d ("iommu/omap: Fix iommu archdata name for DT-based devices")
Cc: stable@vger.kernel.org # 3.18
Cc: Suman Anna <s-anna@ti.com>
Acked-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Johan Hovold <johan@kernel.org>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
As previously documented by commit 2659392856 ("iommu/mediatek: Add
error path for loop of mm_dts_parse"), the id mapping may not be linear
so the whole larb array needs to be iterated on devicetree parsing
errors.
Simplify the loop by iterating from index zero while dropping the
redundant NULL check for consistency with later cleanups.
Also add back the comment which was removed by commit 462e768b55
("iommu/mediatek: Fix forever loop in error handling") to prevent anyone
from trying to optimise the loop by iterating backwards from 'i'.
Cc: Yong Wu <yong.wu@mediatek.com>
Acked-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Johan Hovold <johan@kernel.org>
Reviewed-by: Yong Wu <yong.wu@mediatek.com>
Reviewed-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
The driver is dropping the references taken to the larb devices during
probe after successful lookup as well as on errors. This can
potentially lead to a use-after-free in case a larb device has not yet
been bound to its driver so that the iommu driver probe defers.
Fix this by keeping the references as expected while the iommu driver is
bound.
Fixes: 2659392856 ("iommu/mediatek: Add error path for loop of mm_dts_parse")
Cc: stable@vger.kernel.org
Cc: Yong Wu <yong.wu@mediatek.com>
Acked-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Johan Hovold <johan@kernel.org>
Reviewed-by: Yong Wu <yong.wu@mediatek.com>
Reviewed-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Make sure to drop the reference taken to the iommu platform device when
looking up its driver data during of_xlate().
Note that commit 1a26044954 ("iommu/exynos: add missing put_device()
call in exynos_iommu_of_xlate()") fixed the leak in a couple of error
paths, but the reference is still leaking on success.
Fixes: aa759fd376 ("iommu/exynos: Add callback for initializing devices from device tree")
Cc: stable@vger.kernel.org # 4.2: 1a26044954
Cc: Yu Kuai <yukuai3@huawei.com>
Acked-by: Robin Murphy <robin.murphy@arm.com>
Acked-by: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Johan Hovold <johan@kernel.org>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Make sure to drop the reference taken to the iommu platform device when
looking up its driver data during of_xlate().
Note that commit e2eae09939 ("iommu/qcom: add missing put_device()
call in qcom_iommu_of_xlate()") fixed the leak in a couple of error
paths, but the reference is still leaking on success and late failures.
Fixes: 0ae349a0f3 ("iommu/qcom: Add qcom_iommu")
Cc: stable@vger.kernel.org # 4.14: e2eae09939
Cc: Rob Clark <robin.clark@oss.qualcomm.com>
Cc: Yu Kuai <yukuai3@huawei.com>
Acked-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Johan Hovold <johan@kernel.org>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
In review comment for v1 of genpt documentation fixes [1], Randy
suggested that pt_test_sw_bit_acquire() parameters description
should be written using "to read". Commit e4dfaf25df ("iommupt:
Describe @bitnr parameter"), however, misunderstood the review by
instead using "to read" on @bitnr parameter on both
pt_test_sw_bit_acquire() and pt_test_sw_bit_release().
Actually correct the description.
[1]: https://lore.kernel.org/linux-doc/9dba0eb7-6f32-41b7-b70b-12379364585f@infradead.org/
Fixes: e4dfaf25df ("iommupt: Describe @bitnr parameter")
Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
A root decoder's callback handlers are collected in struct cxl_rd_ops.
The structure is dynamically allocated, though it contains only a few
pointers in it. This also requires to check two pointes to check for
the existence of a callback.
Simplify the allocation, release and handler check by embedding the
ops statically in struct cxl_root_decoder.
Implementation is equivalent to how struct cxl_root_ops handles the
callbacks.
[ dj: Fix spelling error in commit log. ]
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Signed-off-by: Robert Richter <rrichter@amd.com>
Link: https://patch.msgid.link/20251114075844.1315805-2-rrichter@amd.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Integrate the selftests as part of kunit.
Now instead of the test only being run at boot, it can run:
- With CONFIG_IOMMU_IO_PGTABLE_LPAE_KUNIT_TEST=y
It will automatically run at boot as before.
- Otherwise with CONFIG_IOMMU_IO_PGTABLE_LPAE_KUNIT_TEST=m:
1) on module load:
Once the module load the self test will run
# modprobe io-pgtable-arm-selftests
2) debugfs
With CONFIG_KUNIT_DEBUGFS=y You can run the test with
# echo 1 > /sys/kernel/debug/kunit/io-pgtable-arm-test/run
3) Using kunit.py
You can also use the helper script which uses Qemu in the background
# ./tools/testing/kunit/kunit.py run --build_dir build_kunit_arm64 --arch arm64 \
--make_options LLVM=1 --kunitconfig ./kunit/kunitconfig
[18:01:09] ============= io-pgtable-arm-test (1 subtest) ==============
[18:01:09] [PASSED] arm_lpae_do_selftests
[18:01:09] =============== [PASSED] io-pgtable-arm-test ===============
[18:01:09] ============================================================
Suggested-by: Jason Gunthorpe <jgg@ziepe.ca>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Pranjal Shrivastava <praan@google.com>
Signed-off-by: Mostafa Saleh <smostafa@google.com>
Acked-by: Will Deacon <will@kernel.org>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Remove the __init constraint, as the test will be converted to KUnit,
it can run on-demand after later.
Also, as KUnit can be a module, make this test modular.
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Pranjal Shrivastava <praan@google.com>
Signed-off-by: Mostafa Saleh <smostafa@google.com>
Acked-by: Will Deacon <will@kernel.org>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Clean up the io-pgtable-arm library by moving the selftests out.
Next the tests will be registered with kunit.
This is useful also to factor out kernel specific code out, so
it can compiled as part of the hypervisor object.
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Pranjal Shrivastava <praan@google.com>
Signed-off-by: Mostafa Saleh <smostafa@google.com>
Acked-by: Will Deacon <will@kernel.org>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
At the moment, if the selftest fails it prints a lot of information
about the page table (size, levels...) this requires access to many
internals, which has to be exposed in the next patch moving the
tests out.
Instead, we can simplify the print to only print the fmt and
for each test ias, oas and pgsize_bitmap are already printed.
That is enough to identify the failed case, and the rest can
be deduced from the code.
Signed-off-by: Mostafa Saleh <smostafa@google.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Acked-by: Will Deacon <will@kernel.org>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Current IOMMU driver prints "Completion-wait Time-out" error message with
insufficient information to further debug the issue.
Enhancing the error message as following:
1. Log IOMMU PCI device ID in the error message.
2. With "amd_iommu_dump=1" kernel command line option, dump entire
command buffer entries including Head and Tail offset.
Dump the entire command buffer only on the first 'Completion-wait Time-out'
to avoid dmesg spam.
Signed-off-by: Dheeraj Kumar Srivastava <dheerajkumar.srivastava@amd.com>
Reviewed-by: Ankit Soni <Ankit.Soni@amd.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
When a process exits with numerous large, pinned memory regions consisting
of 4KB pages, the cleanup of the memory region through __ib_umem_release()
may cause soft lockups. This is because unpin_user_page_range_dirty_lock()
is called in a tight loop for unpin and releasing page without yielding the
CPU.
watchdog: BUG: soft lockup - CPU#44 stuck for 26s! [python3:73464]
Kernel panic - not syncing: softlockup: hung tasks
CPU: 44 PID: 73464 Comm: python3 Tainted: G OEL
asm_sysvec_apic_timer_interrupt+0x1b/0x20
RIP: 0010:free_unref_page+0xff/0x190
? free_unref_page+0xe3/0x190
__put_page+0x77/0xe0
put_compound_head+0xed/0x100
unpin_user_page_range_dirty_lock+0xb2/0x180
__ib_umem_release+0x57/0xb0 [ib_core]
ib_umem_release+0x3f/0xd0 [ib_core]
mlx5_ib_dereg_mr+0x2e9/0x440 [mlx5_ib]
ib_dereg_mr_user+0x43/0xb0 [ib_core]
uverbs_free_mr+0x15/0x20 [ib_uverbs]
destroy_hw_idr_uobject+0x21/0x60 [ib_uverbs]
uverbs_destroy_uobject+0x38/0x1b0 [ib_uverbs]
__uverbs_cleanup_ufile+0xd1/0x150 [ib_uverbs]
uverbs_destroy_ufile_hw+0x3f/0x100 [ib_uverbs]
ib_uverbs_close+0x1f/0xb0 [ib_uverbs]
__fput+0x9c/0x280
____fput+0xe/0x20
task_work_run+0x6a/0xb0
do_exit+0x217/0x3c0
do_group_exit+0x3b/0xb0
get_signal+0x150/0x900
arch_do_signal_or_restart+0xde/0x100
exit_to_user_mode_loop+0xc4/0x160
exit_to_user_mode_prepare+0xa0/0xb0
syscall_exit_to_user_mode+0x27/0x50
do_syscall_64+0x63/0xb0
Fix soft lockup issues by incorporating cond_resched() calls within
__ib_umem_release(), and this SG entries are typically grouped in 2MB
chunks on x86_64, adding cond_resched() should has minimal performance
impact.
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Link: https://patch.msgid.link/20251113095317.2628-1-lirongqing@baidu.com
Acked-by: Junxian Huang <huangjunxian6@hisilicon.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Instead of hooking the general ioctl op, have the core code directly
decode VFIO_DEVICE_GET_REGION_INFO and call an op just for it.
This is intended to allow mechanical changes to the drivers to pull their
VFIO_DEVICE_GET_REGION_INFO int oa function. Later patches will improve
the function signature to consolidate more code.
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Pranjal Shrivastava <praan@google.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/1-v2-2a9e24d62f1b+e10a-vfio_get_region_info_op_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
When a decoder is locked, it means that its configuration cannot be
changed. CXL spec r3.2 8.2.4.20.13 discusses the details regarding
locked decoders. Locking happens when bit 8 of the decoder control
register is set and then the decoder is committed afterwards (CXL
spec r3.2 8.2.4.20.7).
Given that the driver creates a virtual decoder for each CFMWS, the
Fixed Device Configuration (bit 4) of the Window Restriction field is
considered as locking for the virtual decoder by the driver.
The current driver code disregards the locked status and a region can
be destroyed regardless of the locking state.
Add a region flag to indicate the region is in a locked configuration.
The driver will considered a region locked if the CFMWS or any decoder
is configured as locked. The consideration is all or nothing regarding
the locked state. It is reasonable to determine the region "locked"
status while the region is being assembled based on the decoders.
Add a check in region commit_store() to intercept when a 0 is written
to the commit sysfs attribute in order to prevent the destruction of a
region when in locked state. This should be the only entry point from user
space to destroy a region.
Add a check is added to cxl_decoder_reset() to prevent resetting a locked
decoder within the kernel driver.
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Link: https://patch.msgid.link/20251105201826.2901915-1-dave.jiang@intel.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
In include/rdma/ib_cm.h:
Correct a typedef's kernel-doc notation by adding the 'typedef' keyword
to it to avoid a warning.
Add a leading " *" to a kernel-doc line to avoid a warning.
Warning: ib_cm.h:289 function parameter 'ib_cm_handler' not described
in 'int'
Warning: ib_cm.h:289 expecting prototype for ib_cm_handler(). Prototype
was for int() instead
Warning: ib_cm.h:484 bad line: connection message in case duplicates
are received.
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Link: https://patch.msgid.link/20251112062908.2711007-1-rdunlap@infradead.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Single patch to expose new link mode for 1600Gbps, utilizing 8 lanes at
200Gbps per lane.
Signed-off-by: Leon Romanovsky <leon@kernel.org>
* mlx5-next:
net/mlx5: Expose definition for 1600Gbps link mode
Currently if a user enqueues a work item using schedule_delayed_work() the
used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to
schedule_work() that is using system_wq and queue_work(), that makes use
again of WORK_CPU_UNBOUND.
This lack of consistency cannot be addressed without refactoring the API.
alloc_workqueue() treats all queues as per-CPU by default, while unbound
workqueues must opt-in via WQ_UNBOUND.
This default is suboptimal: most workloads benefit from unbound queues,
allowing the scheduler to place worker threads where they’re needed and
reducing noise when CPUs are isolated.
This continues the effort to refactor workqueue APIs, which began with
the introduction of new workqueues and a new alloc_workqueue flag in:
commit 128ea9f6cc ("workqueue: Add system_percpu_wq and system_dfl_wq")
commit 930c2ea566 ("workqueue: Add new WQ_PERCPU flag")
This change adds a new WQ_PERCPU flag to explicitly request
alloc_workqueue() to be per-cpu when WQ_UNBOUND has not been specified.
With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND),
any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND
must now use WQ_PERCPU.
Once migration is complete, WQ_UNBOUND can be removed and unbound will
become the implicit default.
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Link: https://patch.msgid.link/20251107133626.190952-1-marco.crivellari@suse.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Currently if a user enqueues a work item using schedule_delayed_work() the
used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to
schedule_work() that is using system_wq and queue_work(), that makes use
again of WORK_CPU_UNBOUND.
This lack of consistency cannot be addressed without refactoring the API.
alloc_workqueue() treats all queues as per-CPU by default, while unbound
workqueues must opt-in via WQ_UNBOUND.
This default is suboptimal: most workloads benefit from unbound queues,
allowing the scheduler to place worker threads where they’re needed and
reducing noise when CPUs are isolated.
This continues the effort to refactor workqueue APIs, which began with
the introduction of new workqueues and a new alloc_workqueue flag in:
commit 128ea9f6cc ("workqueue: Add system_percpu_wq and system_dfl_wq")
commit 930c2ea566 ("workqueue: Add new WQ_PERCPU flag")
This change adds a new WQ_PERCPU flag to explicitly request
alloc_workqueue() to be per-cpu when WQ_UNBOUND has not been specified.
With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND),
any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND
must now use WQ_PERCPU.
Once migration is complete, WQ_UNBOUND can be removed and unbound will
become the implicit default.
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Link: https://patch.msgid.link/20251107133306.187939-1-marco.crivellari@suse.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Refactor the _get_prio() function to remove redundant arguments by
reusing the existing flow table attributes struct instead of passing
attributes separately. This improves code clarity and maintainability.
In addition allows downstream patch to add new parameter without
needing to change __get_prio() arguments.
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Signed-off-by: Edward Srouji <edwards@nvidia.com>
Link: https://patch.msgid.link/20251029-support-other-eswitch-v1-6-98bb707b5d57@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
In case of a LAG configuration change the root namespace core device for
all of the LAG slaves to be the core device of the master device for
RDMA_TRANSPORT namespaces, in order to ensure all tables are created
through the master device.
Once the LAG is disabled revert back to the native core device.
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Signed-off-by: Edward Srouji <edwards@nvidia.com>
Link: https://patch.msgid.link/20251029-support-other-eswitch-v1-4-98bb707b5d57@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
When the device in switchdev mode, the RDMA device manages all the
vports which belong to its representors, which can lead to a situation
where the PF that is used to manage the RDMA device isn't the native PF
of some of the vports it manages.
Add infrastructure to allow the master PF to manage all the hardware
resources for the vports under its management.
Whereas currently the only such resource is RDMA TRANSPORT steering
domains.
That is done by adding new FW argument other_eswitch which is passed by
the driver to the FW to allow the master PF to properly manage vports
belonging to other native PF.
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Some adjustments pointed out by Randy:
"decodes an full 64-bit" -> "decodes the full 64 bit"
Correct the function parameter name for iova_to_phys()
Use the recommended section heading style.
Suggested-by: Randy Dunlap <rdunlap@infradead.org>
Fixes: ab0b572847 ("genpt: Add Documentation/ files")
Fixes: 879ced2bab ("iommupt: Add the AMD IOMMU v1 page table format")
Fixes: 9d4c274cd7 ("iommupt: Add iova_to_phys op")
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Tested-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Sphinx reports kernel-doc warnings when making htmldocs:
WARNING: ./drivers/iommu/generic_pt/pt_common.h:361 function parameter 'bitnr' not described in 'pt_test_sw_bit_acquire'
WARNING: ./drivers/iommu/generic_pt/pt_common.h:371 function parameter 'bitnr' not described in 'pt_set_sw_bit_release'
Describe @bitnr to squash them.
Fixes: bcc64b57b4 ("iommupt: Add basic support for SW bits in the page table")
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Stephen Rothwell reports htmldocs warning when merging iommu tree:
Documentation/driver-api/generic_pt.rst:32: WARNING: Literal block expected; none found. [docutils]
This is because of duplicate double colon code block markers: one after
generic_pt/fmt/iommu_amdv1.c and the one in its preceding paragraph. The
resulting htmldocs, however, only marks the include listing (after the
former) up as it should be.
Drop the latter to fix the warning.
Fixes: ab0b572847 ("genpt: Add Documentation/ files")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Closes: https://lore.kernel.org/linux-next/20251106143925.578e411b@canb.auug.org.au/
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Tested-by: Randy Dunlap <rdunlap@infradead.org>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Currently if a user enqueue a work item using schedule_delayed_work() the
used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to
schedule_work() that is using system_wq and queue_work(), that makes use
again of WORK_CPU_UNBOUND.
This lack of consistency cannot be addressed without refactoring the API.
alloc_workqueue() treats all queues as per-CPU by default, while unbound
workqueues must opt-in via WQ_UNBOUND.
This default is suboptimal: most workloads benefit from unbound queues,
allowing the scheduler to place worker threads where they’re needed and
reducing noise when CPUs are isolated.
This continues the effort to refactor workqueue APIs, which began with
the introduction of new workqueues and a new alloc_workqueue flag in:
commit 128ea9f6cc ("workqueue: Add system_percpu_wq and system_dfl_wq")
commit 930c2ea566 ("workqueue: Add new WQ_PERCPU flag")
This change adds a new WQ_PERCPU flag to explicitly request
alloc_workqueue() to be per-cpu when WQ_UNBOUND has not been specified.
With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND),
any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND
must now use WQ_PERCPU.
Once migration is complete, WQ_UNBOUND can be removed and unbound will
become the implicit default.
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Niklas Cassel <cassel@kernel.org>
For the cases where user includes a non-zero value in 'token_uuid_ptr'
field of 'struct vfio_device_bind_iommufd', the copy_struct_from_user()
in vfio_df_ioctl_bind_iommufd() fails with -E2BIG. For the 'minsz' passed,
copy_struct_from_user() expects the newly introduced field to be zero-ed,
which would be incorrect in this case.
Fix this by passing the actual size of the kernel struct. If working
with a newer userspace, copy_struct_from_user() would copy the
'token_uuid_ptr' field, and if working with an old userspace, it would
zero out this field, thus still retaining backward compatibility.
Fixes: 86624ba3b5 ("vfio/pci: Do vf_token checks for VFIO_DEVICE_BIND_IOMMUFD")
Cc: stable@vger.kernel.org
Signed-off-by: Raghavendra Rao Ananta <rananta@google.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20251031170603.2260022-2-rananta@google.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
Correct the kernel-doc comments format to avoid around 35 kernel-doc
warnings:
- use struct keyword to introduce struct kernel-doc comments
- use correct variable name for some struct members
- use correct function name in comments for some functions
- fix spelling in a few comments
- use a ':' instead of '-' to separate struct members from their
descriptions
- add a function name heading for rvt_div_mtu()
This leaves one struct member that is not described:
rdmavt_qp.h:206: warning: Function parameter or struct member 'wq'
not described in 'rvt_krwq'
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Link: https://patch.msgid.link/20251105045127.106822-1-rdunlap@infradead.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Currently if a user enqueue a work item using schedule_delayed_work() the
used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to
schedule_work() that is using system_wq and queue_work(), that makes use
again of WORK_CPU_UNBOUND.
This lack of consistentcy cannot be addressed without refactoring the API.
alloc_workqueue() treats all queues as per-CPU by default, while unbound
workqueues must opt-in via WQ_UNBOUND.
This default is suboptimal: most workloads benefit from unbound queues,
allowing the scheduler to place worker threads where they’re needed and
reducing noise when CPUs are isolated.
This change adds a new WQ_PERCPU flag to explicitly request
alloc_workqueue() to be per-cpu when WQ_UNBOUND has not been specified.
With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND),
any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND
must now use WQ_PERCPU.
Once migration is complete, WQ_UNBOUND can be removed and unbound will
become the implicit default.
CC: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Link: https://patch.msgid.link/20251101163121.78400-6-marco.crivellari@suse.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Currently if a user enqueue a work item using schedule_delayed_work() the
used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to
schedule_work() that is using system_wq and queue_work(), that makes use
again of WORK_CPU_UNBOUND.
This lack of consistentcy cannot be addressed without refactoring the API.
alloc_workqueue() treats all queues as per-CPU by default, while unbound
workqueues must opt-in via WQ_UNBOUND.
This default is suboptimal: most workloads benefit from unbound queues,
allowing the scheduler to place worker threads where they’re needed and
reducing noise when CPUs are isolated.
This change adds a new WQ_PERCPU flag to explicitly request
alloc_workqueue() to be per-cpu when WQ_UNBOUND has not been specified.
With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND),
any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND
must now use WQ_PERCPU.
Once migration is complete, WQ_UNBOUND can be removed and unbound will
become the implicit default.
CC: Yishai Hadas <yishaih@nvidia.com>
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Link: https://patch.msgid.link/20251101163121.78400-5-marco.crivellari@suse.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Currently if a user enqueue a work item using schedule_delayed_work() the
used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to
schedule_work() that is using system_wq and queue_work(), that makes use
again of WORK_CPU_UNBOUND.
This lack of consistentcy cannot be addressed without refactoring the API.
alloc_workqueue() treats all queues as per-CPU by default, while unbound
workqueues must opt-in via WQ_UNBOUND.
This default is suboptimal: most workloads benefit from unbound queues,
allowing the scheduler to place worker threads where they’re needed and
reducing noise when CPUs are isolated.
This change adds a new WQ_PERCPU flag to explicitly request
alloc_workqueue() to be per-cpu when WQ_UNBOUND has not been specified.
With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND),
any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND
must now use WQ_PERCPU.
Once migration is complete, WQ_UNBOUND can be removed and unbound will
become the implicit default.
CC: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Link: https://patch.msgid.link/20251101163121.78400-4-marco.crivellari@suse.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Currently if a user enqueue a work item using schedule_delayed_work() the
used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to
schedule_work() that is using system_wq and queue_work(), that makes use
again of WORK_CPU_UNBOUND.
This lack of consistentcy cannot be addressed without refactoring the API.
alloc_workqueue() treats all queues as per-CPU by default, while unbound
workqueues must opt-in via WQ_UNBOUND.
This default is suboptimal: most workloads benefit from unbound queues,
allowing the scheduler to place worker threads where they’re needed and
reducing noise when CPUs are isolated.
This change adds a new WQ_PERCPU flag to explicitly request
alloc_workqueue() to be per-cpu when WQ_UNBOUND has not been specified.
With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND),
any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND
must now use WQ_PERCPU.
Once migration is complete, WQ_UNBOUND can be removed and unbound will
become the implicit default.
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Link: https://patch.msgid.link/20251101163121.78400-3-marco.crivellari@suse.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Currently if a user enqueue a work item using schedule_delayed_work() the
used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to
schedule_work() that is using system_wq and queue_work(), that makes use
again of WORK_CPU_UNBOUND.
This lack of consistency cannot be addressed without refactoring the API.
system_unbound_wq should be the default workqueue so as not to enforce
locality constraints for random work whenever it's not required.
Adding system_dfl_wq to encourage its use when unbound work should be used.
The old system_unbound_wq will be kept for a few release cycles.
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Link: https://patch.msgid.link/20251101163121.78400-2-marco.crivellari@suse.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
On new platforms greater than QM_HW_V3, the migration region has been
relocated from the VF to the PF. The VF's own configuration space is
restored to the complete 64KB, and there is no need to divide the
size of the BAR configuration space equally. The driver should be
modified accordingly to adapt to the new hardware device.
On the older hardware platform QM_HW_V3, the live migration configuration
region is placed in the latter 32K portion of the VF's BAR2 configuration
space. On the new hardware platform QM_HW_V4, the live migration
configuration region also exists in the same 32K area immediately following
the VF's BAR2, just like on QM_HW_V3.
However, access to this region is now controlled by hardware. Additionally,
a copy of the live migration configuration region is present in the PF's
BAR2 configuration space. On the new hardware platform QM_HW_V4, when an
older version of the driver is loaded, it behaves like QM_HW_V3 and uses
the configuration region in the VF, ensuring that the live migration
function continues to work normally. When the new version of the driver is
loaded, it directly uses the configuration region in the PF. Meanwhile,
hardware configuration disables the live migration configuration region
in the VF's BAR2: reads return all 0xF values, and writes are silently
ignored.
Signed-off-by: Longfang Liu <liulongfang@huawei.com>
Reviewed-by: Shameer Kolothum <shameerkolothum@gmail.com>
Link: https://lore.kernel.org/r/20251030015744.131771-3-liulongfang@huawei.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
On new platforms greater than QM_HW_V3, the configuration region for the
live migration function of the accelerator device is no longer
placed in the VF, but is instead placed in the PF.
Therefore, the configuration region of the live migration function
needs to be opened when the QM driver is loaded. When the QM driver
is uninstalled, the driver needs to clear this configuration.
Signed-off-by: Longfang Liu <liulongfang@huawei.com>
Reviewed-by: Shameer Kolothum <shameerkolothum@gmail.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Link: https://lore.kernel.org/r/20251030015744.131771-2-liulongfang@huawei.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
Currently a incoherent walk domain cannot be attached to a coherent
capable iommu. Kevin says HW probably doesn't exist with such a mixture,
but making the driver support it makes logical sense anyhow.
When building the PASID entry the PWSNP (Page Walk Snoop) bit tells the HW
if it should issue snoops. If the page table is cache flushed because of
PT_FEAT_DMA_INCOHERENT then it is fine to set this bit to 0 even if the HW
supports 1.
Weaken the compatible check to permit a coherent instance to accept an
incoherent table and fix the PASID table construction to set PWSNP from
PT_FEAT_DMA_INCOHERENT.
SVA always sets PWSNP.
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Replace the VT-d iommu_domain implementation of the VT-d second stage and
first stage page tables with the iommupt VTDSS and x86_64
pagetables. x86_64 is shared with the AMD driver.
There are a couple notable things in VT-d:
- Like AMD the second stage format is not sign extended, unlike AMD it
cannot decode a full 64 bits. The first stage format is a normal sign
extended x86 page table
- The HW caps can indicate how many levels, how many address bits and what
leaf page sizes are supported in HW. As before the highest number of
levels that can translate the entire supported address width is used.
The supported page sizes are adjusted directly from the dedicated
first/second stage cap bits.
- VTD requires flushing 'write buffers'. This logic is left unchanged,
the write buffer flushes on any gather flush or through iotlb_sync_map.
- Like ARM, VTD has an optional non-coherent page table walker that
requires cache flushing. This is supported through PT_FEAT_DMA_INCOHERENT
the same as ARM, however x86 can't use the DMA API for flush, it must
call the arch function clflush_cache_range()
- The PT_FEAT_DYNAMIC_TOP can probably be supported on VT-d someday for the
second stage when it uses 128 bit atomic stores for the HW context
structures.
- PT_FEAT_VTDSS_FORCE_WRITEABLE is used to work around ERRATA_772415_SPR17
- A kernel command line parameter "sp_off" disables all page sizes except
4k
Remove all the unused iommu_domain page table code. The debugfs paths have
their own independent page table walker that is left alone for now.
This corrects a race with the non-coherent walker that the ARM
implementations have fixed:
CPU 0 CPU 1
pfn_to_dma_pte() pfn_to_dma_pte()
pte = &parent[offset];
if (!dma_pte_present(pte)) {
try_cmpxchg64(&pte->val)
pte = &parent[offset];
.. dma_pte_present(pte) ..
[...]
// iommu_map() completes
// Device does DMA
domain_flush_cache(pte)
The CPU 1 mapping operation shares a page table level with the CPU 0
mapping operation. CPU 0 installed a new page table level but has not
flushed it yet. CPU1 returns from iommu_map() and the device does DMA. The
non coherent walker fails to see the new table level installed by CPU 0
and fails the DMA with non-present.
The iommupt PT_FEAT_DMA_INCOHERENT implementation uses the ARM design of
storing a flag when CPU 0 completes the flush. If the flag is not set CPU
1 will also flush to ensure the HW can fully walk to the PTE being
installed.
Cc: Tina Zhang <tina.zhang@intel.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
VT-d requires PT_FEAT_DMA_INCOHERENT for the x86 page table as well,
implement the required SW bits and enable the feature.
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
AMD and VTD are historically different here, adopt the VTD version of
setting the D bit only on writable PTEs as it makes more sense.
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
The VT-d second stage format is almost the same as the x86 PAE format,
except the bit encodings in the PTE are different and a few new PTE
features, like force coherency are present.
Among all the formats it is unique in not having a designated present bit.
Comparing the performance of several operations to the existing version:
iommu_map()
pgsz ,avg new,old ns, min new,old ns , min % (+ve is better)
2^12, 53,66 , 50,64 , 21.21
2^21, 59,70 , 56,67 , 16.16
2^30, 54,66 , 52,63 , 17.17
256*2^12, 384,524 , 337,516 , 34.34
256*2^21, 387,632 , 336,626 , 46.46
256*2^30, 376,629 , 323,623 , 48.48
iommu_unmap()
pgsz ,avg new,old ns, min new,old ns , min % (+ve is better)
2^12, 67,86 , 63,84 , 25.25
2^21, 64,84 , 59,80 , 26.26
2^30, 59,78 , 56,74 , 24.24
256*2^12, 216,335 , 198,317 , 37.37
256*2^21, 245,350 , 232,344 , 32.32
256*2^30, 248,345 , 226,339 , 33.33
Cc: Tina Zhang <tina.zhang@intel.com>
Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Flush the CPU cache for the page table memory after each set of writes to
the page table. The iommu should have visibility to the updated entries as
soon as the map/unmap/etc operations return, like normal coherent hardware
does.
The caches also have to be flushed before any gather can be submitted to
the driver.
Implement the same solution to the race as io-pgtable-arm by using a
software PTE bit to track if a table entry has been flushed or not. If
another thread is still flushing then another concurrent map operation
could return without IOMMU visibility to a required table entry. The SW
bit will tell the second thread to also flush the cache.
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
This is the first step to supporting an incoherent walker, start and stop
the incoherence around the allocation and frees of the page table memory.
The iommu_pages API maps this to dma_map/unmap_single(), or arch cache
flushing calls.
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
SW bits can be placed on items, including table entries, single OA's and
individual items within a contiguous OA. They are guaranteed to be ignored
by the HW. The API is very basic since the only use case so far is a
single bit.
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Some IOMMU HW cannot snoop the CPU cache when it walks the IO page tables.
The CPU is required to flush the cache to make changes visible to the HW.
Provide some helpers from iommu-pages to manage this. The helpers combine
both the ARM and x86 (used in Intel VT-d) versions of the cache flushing
under a single API.
The ARM version uses the DMA API to access the cache flush on the
assumption that the iommu is using a direct mapping and is already marked
incoherent. The helpers will do the DMA API calls to set things up and
keep track of DMA mapped folios using a bit in the ioptdesc so that
unmapping on error paths is cleaner.
The Intel version just calls the arch cache flush call directly and has no
need to cleanup prior to destruction.
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
This intends to have high coverage of the page table format functions and
the IOMMU implementation itself, exercising the various corner cases.
The kunit tests can be run in the kunit framework, using commands like:
tools/testing/kunit/kunit.py run --build_dir build_kunit_arm64 --arch arm64 --make_options LLVM=-19 --kunitconfig ./drivers/iommu/generic_pt/.kunitconfig
tools/testing/kunit/kunit.py run --build_dir build_kunit_uml --kunitconfig ./drivers/iommu/generic_pt/.kunitconfig
tools/testing/kunit/kunit.py run --build_dir build_kunit_x86_64 --arch x86_64 --kunitconfig ./drivers/iommu/generic_pt/.kunitconfig
tools/testing/kunit/kunit.py run --build_dir build_kunit_i386 --arch i386 --kunitconfig ./drivers/iommu/generic_pt/.kunitconfig
tools/testing/kunit/kunit.py run --build_dir build_kunit_i386pae --arch i386 --kunitconfig ./drivers/iommu/generic_pt/.kunitconfig --kconfig_add CONFIG_X86_PAE=y
There are several interesting corner cases on the 32 bit platforms that
need checking.
Like the generic tests, these are run on the format's configuration list
using kunit "params". This also checks the core iommu parts of the page
table code as it enters the logic through a mock iommu_domain.
The following are checked:
- PT_FEAT_DYNAMIC_TOP properly adds levels one by one
- Every page size can be iommu_map()'d, and mapping creates that size
- iommu_iova_to_phys() works with every page size
- Test converting OA -> non present -> OA when the two OAs overlap and
free table levels
- Test that unmap stops at holes, unmap doesn't split, and unmap returns
the right values for partial unmap requests
- Randomly map/unmap. Checks map with random sizes, that map fails when
hitting collisions doing nothing, unmap/map with random intersections and
full unmap of random sizes. Also checks iommu_iova_to_phys() with random
sizes
- Check for memory leaks by monitoring NR_SECONDARY_PAGETABLE
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Tested-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Replace the io_pgtable versions with pt_iommu versions. The v2 page table
uses the x86 implementation that will be eventually shared with VT-d.
This supports the same special features as the original code:
- increase_top for the v1 format to allow scaling from 3 to 6 levels
- non-present flushing
- Dirty tracking for v1 only
- __sme_set() to adjust the PTEs for CC
- Optimization for flushing with virtualization to minimize the range
- amd_iommu_pgsize_bitmap override of the native page sizes
- page tables allocate from the device's NUMA node
Rework the domain ops so that v1/v2 get their own ops. Make dedicated
allocation functions for v1 and v2. Hook up invalidation for a top change
to struct pt_iommu_flush_ops. Delete some of the iopgtable related code
that becomes unused in this patch. The next patch will delete the rest of
it.
This fixes a race bug in AMD's increase_address_space() implementation. It
stores the top level and top pointer in different memory, which prevents
other threads from reading a coherent version:
increase_address_space() alloc_pte()
level = pgtable->mode - 1;
pgtable->root = pte;
pgtable->mode += 1;
pte = &pgtable->root[PM_LEVEL_INDEX(level, address)];
The iommupt version is careful to put mode and root under a single
READ_ONCE and then is careful to only READ_ONCE a single time per
walk.
Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Tested-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
This is used by x86 CPUs and can be used in AMD/VT-d x86 IOMMUs. When a
x86 IOMMU is running SVA the MM will be using this format.
This implementation follows the AMD v2 io-pgtable version.
There is nothing remarkable here, the format can have 4 or 5 levels and
limited support for different page sizes. No contiguous pages support.
x86 uses a sign extension mechanism where the top bits of the VA must
match the sign bit. The core code supports this through
PT_FEAT_SIGN_EXTEND which creates and upper and lower VA range. All the
new operations will work correctly in both spaces, however currently there
is no way to report the upper space to other layers. Future patches can
improve that.
In principle this can support 3 page tables levels matching the 32 bit PAE
table format, but no iommu driver needs this. The focus is on the modern
64 bit 4 and 5 level formats.
Comparing the performance of several operations to the existing version:
iommu_map()
pgsz ,avg new,old ns, min new,old ns , min % (+ve is better)
2^12, 71,61 , 66,58 , -13.13
2^21, 66,60 , 61,55 , -10.10
2^30, 59,56 , 56,54 , -3.03
256*2^12, 392,1360 , 345,1289 , 73.73
256*2^21, 383,1159 , 335,1145 , 70.70
256*2^30, 378,965 , 331,892 , 62.62
iommu_unmap()
pgsz ,avg new,old ns, min new,old ns , min % (+ve is better)
2^12, 77,71 , 73,68 , -7.07
2^21, 76,70 , 70,66 , -6.06
2^30, 69,66 , 66,63 , -4.04
256*2^12, 225,899 , 210,870 , 75.75
256*2^21, 262,722 , 248,710 , 65.65
256*2^30, 251,643 , 244,634 , 61.61
The small -ve values in the iommu_unmap() are due to the core code calling
iommu_pgsize() before invoking the domain op. This is unncessary with this
implementation. Future work optimizes this and gets to 2%, 4%, 3%.
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Tested-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
The iommufd self test uses an xarray to store the pfns and their orders to
emulate a page table. Make it act more like a real iommu driver by
replacing the xarray with an iommupt based page table. The new AMDv1 mock
format behaves similarly to the xarray.
Add set_dirty() as a iommu_pt operation to allow the test suite to
simulate HW dirty.
Userspace can select between several formats including the normal AMDv1
format and a special MOCK_IOMMUPT_HUGE variation for testing huge page
dirty tracking. To make the dirty tracking test work the page table must
only store exactly 2M huge pages otherwise the logic the test uses
fails. They cannot be broken up or combined.
Aside from aligning the selftest with a real page table implementation,
this helps test the iommupt code itself.
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Samiullah Khawaja <skhawaja@google.com>
Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Tested-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
The iommufd self test uses an xarray to store the pfns and their orders to
emulate a page table. Slightly modify the amdv1 page table to create a
real page table that has similar properties:
- 2k base granule to simulate something like a 4k page table on a 64K
PAGE_SIZE ARM system
- Contiguous page support for every PFN order
- Dirty tracking
AMDv1 is the closest format, as it is the only one that already supports
every page size. Tweak it to have only 5 levels and an 11 bit base granule
and compile it separately as a format variant.
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Samiullah Khawaja <skhawaja@google.com>
Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Tested-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
This intends to have high coverage of the page table format functions, it
uses the IOMMU implementation to create a tree which it then walks through
and directly calls the generic page table functions to test them.
It is a good starting point to test a new format header as it is often
able to find typos and inconsistencies much more directly, rather than
with an obscure failure in the iommu implementation.
The tests can be run with commands like:
tools/testing/kunit/kunit.py run --build_dir build_kunit_arm64 --arch arm64 --make_options LLVM=-19 --kunitconfig ./drivers/iommu/generic_pt/.kunitconfig
tools/testing/kunit/kunit.py run --build_dir build_kunit_uml --kunitconfig ./drivers/iommu/generic_pt/.kunitconfig --kconfig_add CONFIG_WERROR=n
tools/testing/kunit/kunit.py run --build_dir build_kunit_x86_64 --arch x86_64 --kunitconfig ./drivers/iommu/generic_pt/.kunitconfig
tools/testing/kunit/kunit.py run --build_dir build_kunit_i386 --arch i386 --kunitconfig ./drivers/iommu/generic_pt/.kunitconfig
tools/testing/kunit/kunit.py run --build_dir build_kunit_i386pae --arch i386 --kunitconfig ./drivers/iommu/generic_pt/.kunitconfig --kconfig_add CONFIG_X86_PAE=y
There are several interesting corner cases on the 32 bit platforms that
need checking.
The format can declare a list of configurations that generate different
configurations the initialize the page table, for instance with different
top levels or other parameters. The kunit will turn these into "params"
which cause each test to run multiple times.
The tests are repeated to run at every table level to check that all the
item encoding formats work.
The following are checked:
- Basic init works for each configuration
- The various log2 functions have the expected behavior at the limits
- pt_compute_best_pgsize() works
- pt_table_pa() reads back what pt_install_table() writes
- range.max_vasz_lg2 works properly
- pt_table_oa_lg2sz() and pt_table_item_lg2sz() use a contiguous
non-overlapping set of bits from the VA up to the defined max_va
- pt_possible_sizes() and pt_can_have_leaf() produces a sensible layout
- pt_item_oa(), pt_entry_oa(), and pt_entry_num_contig_lg2() read back
what pt_install_leaf_entry() writes
- pt_clear_entry() works
- pt_attr_from_entry() reads back what pt_iommu_set_prot() &
pt_install_leaf_entry() writes
- pt_entry_set_write_clean(), pt_entry_make_write_dirty(), and
pt_entry_write_is_dirty() work
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Tested-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
IOMMU HW now supports updating a dirty bit in an entry when a DMA writes
to the entry's VA range. iommufd has a uAPI to read and clear the dirty
bits from the tables.
This is a trivial recursive descent algorithm to read and optionally clear
the dirty bits. The format needs a function to tell if a contiguous entry
is dirty, and a function to clear a contiguous entry back to clean.
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Samiullah Khawaja <skhawaja@google.com>
Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Tested-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
map is slightly complicated because it has to handle a number of special
edge cases:
- Overmapping a previously shared, but now empty, table level with an OA.
Requries validating and freeing the possibly empty tables
- Doing the above across an entire to-be-created contiguous entry
- Installing a new shared table level concurrently with another thread
- Expanding the table by adding more top levels
Table expansion is a unique feature of AMDv1, this version is quite
similar except we handle racing concurrent lockless map. The table top
pointer and starting level are encoded in a single uintptr_t which ensures
we can READ_ONCE() without tearing. Any op will do the READ_ONCE() and use
that fixed point as its starting point. Concurrent expansion is handled
with a table global spinlock.
When inserting a new table entry map checks that the entire portion of the
table is empty. This includes freeing any empty lower tables that will be
overwritten by an OA. A separate free list is used while checking and
collecting all the empty lower tables so that writing the new entry is
uninterrupted, either the new entry fully writes or nothing changes.
A special fast path for PAGE_SIZE is implemented that does a direct walk
to the leaf level and installs a single entry. This gives ~15% improvement
for iommu_map() when mapping lists of single pages.
This version sits under the iommu_domain_ops as map_pages() but does not
require the external page size calculation. The implementation is actually
map_range() and can do arbitrary ranges, internally handling all the
validation and supporting any arrangment of page sizes. A future series
can optimize iommu_map() to take advantage of this.
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Samiullah Khawaja <skhawaja@google.com>
Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Tested-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
unmap_pages removes mappings and any fully contained interior tables from
the given range. This follows the now-standard iommu_domain API definition
where it does not split up larger page sizes into smaller. The caller must
perform unmap only on ranges created by map or it must have somehow
otherwise determined safe cut points (eg iommufd/vfio use iova_to_phys to
scan for them)
A future work will provide 'cut' which explicitly does the page size split
if the HW can support it.
unmap is implemented with a recursive descent of the tree. If the caller
provides a VA range that spans an entire table item then the table memory
can be freed as well.
If an entire table item can be freed then this version will also check the
leaf-only level of the tree to ensure that all entries are present to
generate -EINVAL. Many of the existing drivers don't do this extra check.
This version sits under the iommu_domain_ops as unmap_pages() but does not
require the external page size calculation. The implementation is actually
unmap_range() and can do arbitrary ranges, internally handling all the
validation and supporting any arrangment of page sizes. A future series
can optimize __iommu_unmap() to take advantage of this.
Freed page table memory is batched up in the gather and will be freed in
the driver's iotlb_sync() callback after the IOTLB flush completes.
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Samiullah Khawaja <skhawaja@google.com>
Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Tested-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
AMD IOMMU v1 is unique in supporting contiguous pages with a variable size
and it can decode the full 64 bit VA space. Unlike other x86 page tables
this explicitly does not do sign extension as part of allowing the entire
64 bit VA space to be supported.
The general design is quite similar to the x86 PAE format, except with a
6th level and quite different PTE encoding.
This format is the only one that uses the PT_FEAT_DYNAMIC_TOP feature in
the existing code as the existing AMDv1 code starts out with a 3 level
table and adds levels on the fly if more IOVA is needed.
Comparing the performance of several operations to the existing version:
iommu_map()
pgsz ,avg new,old ns, min new,old ns , min % (+ve is better)
2^12, 65,64 , 62,61 , -1.01
2^13, 70,66 , 67,62 , -8.08
2^14, 73,69 , 71,65 , -9.09
2^15, 78,75 , 75,71 , -5.05
2^16, 89,89 , 86,84 , -2.02
2^17, 128,121 , 124,112 , -10.10
2^18, 175,175 , 170,163 , -4.04
2^19, 264,306 , 261,279 , 6.06
2^20, 444,525 , 438,489 , 10.10
2^21, 60,62 , 58,59 , 1.01
256*2^12, 381,1833 , 367,1795 , 79.79
256*2^21, 375,1623 , 356,1555 , 77.77
256*2^30, 356,1338 , 349,1277 , 72.72
iommu_unmap()
pgsz ,avg new,old ns, min new,old ns , min % (+ve is better)
2^12, 76,89 , 71,86 , 17.17
2^13, 79,89 , 75,86 , 12.12
2^14, 78,90 , 74,86 , 13.13
2^15, 82,89 , 74,86 , 13.13
2^16, 79,89 , 74,86 , 13.13
2^17, 81,89 , 77,87 , 11.11
2^18, 90,92 , 87,89 , 2.02
2^19, 91,93 , 88,90 , 2.02
2^20, 96,95 , 91,92 , 1.01
2^21, 72,88 , 68,85 , 20.20
256*2^12, 372,6583 , 364,6251 , 94.94
256*2^21, 398,6032 , 392,5758 , 93.93
256*2^30, 396,5665 , 389,5258 , 92.92
The ~5-17x speedup when working with mutli-PTE map/unmaps is because the
AMD implementation rewalks the entire table on every new PTE while this
version retains its position. The same speedup will be seen with dirtys as
well.
The old implementation triggers a compiler optimization that ends up
generating a "rep stos" memset for contiguous PTEs. Since AMD can have
contiguous PTEs that span 2Kbytes of table this is a huge win compared to
a normal movq loop. It is why the unmap side has a fairly flat runtime as
the contiguous PTE sides increases. This version makes it explicit with a
memset64() call.
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Tested-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
The existing IOMMU page table implementations duplicate all of the working
algorithms for each format. By using the generic page table API a single C
version of the IOMMU algorithms can be created and re-used for all of the
different formats used in the drivers. The implementation will provide a
single C version of the iommu domain operations: iova_to_phys, map, unmap,
and read_and_clear_dirty.
Further, adding new algorithms and techniques becomes easy to do across
the entire fleet of drivers and formats.
The C functions are drop in compatible with the existing iommu_domain_ops
using the IOMMU_PT_DOMAIN_OPS() macro. Each per-format implementation
compilation unit will produce exported symbols following the pattern
pt_iommu_FMT_map_pages() which the macro directly maps to the
iommu_domain_ops members. This avoids the additional function pointer
indirection like io-pgtable has.
The top level struct used by the drivers is pt_iommu_table_FMT. It
contains the other structs to allow container_of() to move between the
driver, iommu page table, generic page table, and generic format layers.
struct pt_iommu_table_amdv1 {
struct pt_iommu {
struct iommu_domain domain;
} iommu;
struct pt_amdv1 {
struct pt_common common;
} amdpt;
};
The driver is expected to union the pt_iommu_table_FMT with its own
existing domain struct:
struct driver_domain {
union {
struct iommu_domain domain;
struct pt_iommu_table_amdv1 amdv1;
};
};
PT_IOMMU_CHECK_DOMAIN(struct driver_domain, amdv1, domain);
To create an alias to avoid renaming 'domain' in a lot of driver code.
This allows all the layers to access all the necessary functions to
implement their different roles with no change to any of the existing
iommu core code.
Implement the basic starting point: pt_iommu_init(), get_info() and
deinit().
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Samiullah Khawaja <skhawaja@google.com>
Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Tested-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
The generic API is intended to be separated from the implementation of
page table algorithms. It contains only accessors for walking and
manipulating the table and helpers that are useful for building an
implementation. Memory management is not in the generic API, but part of
the implementation.
Using a multi-compilation approach the implementation module would include
headers in this order:
common.h
defs_FMT.h
pt_defs.h
FMT.h
pt_common.h
IMPLEMENTATION.h
Where each compilation unit would have a combination of FMT and
IMPLEMENTATION to produce a per-format per-implementation module.
The API is designed so that the format headers have minimal logic, and
default implementations are provided if the format doesn't include one.
Generally formats provide their code via an inline function using the
pattern:
static inline FMTpt_XX(..) {}
#define pt_XX FMTpt_XX
The common code then enforces a function signature so that there is no
drift in function arguments, or accidental polymorphic functions (as has
been slightly troublesome in mm). Use of function-like #defines are
avoided in the format even though many of the functions are small enough.
Provide kdocs for the API surface.
This is enough to implement the 8 initial format variations with all of
their features:
* Entries comprised of contiguous blocks of IO PTEs for larger page
sizes (AMDv1, ARMv8)
* Multi-level tables, up to 6 levels. Runtime selected top level
* The size of the top table level can be selected at runtime (ARM's
concatenated tables)
* The number of levels in the table can optionally increase dynamically
during map (AMDv1)
* Optional leaf entries at any level
* 32 bit/64 bit virtual and output addresses, using every bit
* Sign extended addressing (x86)
* Dirty tracking
A basic simple format takes about 200 lines to declare the require inline
functions.
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Samiullah Khawaja <skhawaja@google.com>
Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Tested-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
In iommu_mmio_write(), it validates the user-provided offset with the
check: `iommu->dbg_mmio_offset > iommu->mmio_phys_end - 4`.
This assumes a 4-byte access. However, the corresponding
show handler, iommu_mmio_show(), uses readq() to perform an 8-byte
(64-bit) read.
If a user provides an offset equal to `mmio_phys_end - 4`, the check
passes, and will lead to a 4-byte out-of-bounds read.
Fix this by adjusting the boundary check to use sizeof(u64), which
corresponds to the size of the readq() operation.
Fixes: 7a4ee419e8 ("iommu/amd: Add debugfs support to dump IOMMU MMIO registers")
Signed-off-by: Songtang Liu <liusongtang@bytedance.com>
Reviewed-by: Dheeraj Kumar Srivastava <dheerajkumar.srivastava@amd.com>
Tested-by: Dheeraj Kumar Srivastava <dheerajkumar.srivastava@amd.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
The cxl_acpi module spams "Extended linear cache calculation failed"
when the hmat memory target is not found for a node. This is normal
when the memory target does not contain extended linear cache
attributes. Adjust cxl_acpi_set_cache_size() to just return 0 if error
is returned from hmat_get_extended_linear_cache_size(). That is the
only error returned from hmat_get_extended_linear_cache_size() as
-ENOENT.
Also remove the check for -EOPNOTSUPP in cxl_setup_extended_linear_cache()
since that errno is never returned by cxl_acpi_set_cache_size().
[dj: Flipped minor return logic suggested by Jonathan ]
Suggested-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Alison Schofield <alison.schofield@intel.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Link: https://patch.msgid.link/20251003185509.3215900-1-dave.jiang@intel.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Add a loadable test module that validates CXL address translation
calculations using parameterized test vectors. The module tests both
host-to-device and device-to-host address translations for Modulo and
XOR interleave arithmetic.
Two types of testing are provided:
1. Parameterized test vectors:
Test vectors are passed as module parameters in the format:
"dpa pos r_eiw r_eig hb_ways math expected_spa".
Round-trip validation is performed:
- Translate a DPA and position to a SPA
- Verify the result matches expected SPA
- Translate that SPA back to a DPA and position
- Verify round-trip consistency
2. Internal validation testing:
When no test vectors are provided, the module performs validation
of the translation functions by checking parameter boundaries and
running 10,000 iterations of randomly generated valid parameters
to exercise the core calculation functions.
The module uses the CXL Driver translation functions through symbols
exported exclusively for cxl_translate.
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Alison Schofield <alison.schofield@intel.com>
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
In preparation for adding a test module that can exercise the address
translation functions performed by the CXL Driver, refactor the XOR
implementation like this:
- Extract the core calculation into a standalone helper function,
- Export the new function for use by test module cxl_translate only,
- Enhance the parameter validation since this new function will be
called from a test module with no guarantee of valid parameters,
- Move the define of struct cxl_cxims_data to cxl.h so the test module
can build xormaps.
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Signed-off-by: Alison Schofield <alison.schofield@intel.com>
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
In preparation for adding a test module that exercises the address
translation calculations, extract the core calculations into stand-
alone functions that operate on base parameters without dependencies
on struct cxl_region.
Perform additional parameter validation to protect against a test
module sending bad parameters. Export the validation function, as
well as the three core translation functions for use by test module
cxl_translate only.
This refactoring enables unit testing of the address translation logic
with controlled inputs, while preserving identical functionality in
the existing code paths.
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Alison Schofield <alison.schofield@intel.com>
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Currently if a user enqueue a work item using schedule_delayed_work() the
used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to
schedule_work() that is using system_wq and queue_work(), that makes use
again of WORK_CPU_UNBOUND.
This lack of consistency cannot be addressed without refactoring the API.
system_wq should be the per-cpu workqueue, yet in this name nothing makes
that clear, so replace system_wq with system_percpu_wq.
The old wq (system_wq) will be kept for a few release cycles.
See 128ea9f6cc ("workqueue: Add system_percpu_wq and system_dfl_wq")
for cause of changes.
[ dj: Add reference to commit that initiated the change. ]
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Acked-by: Davidlohr Bueso <dave@stgolabs.net>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>> ---
Link: https://patch.msgid.link/20251030163839.307752-1-marco.crivellari@suse.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
devm_cxl_port_enumerate_dports() is not longer used after below commit
commit 4f06d81e7c ("cxl: Defer dport allocation for switch ports")
Delete it and the relevant interface implemented in cxl_test.
Signed-off-by: Li Ming <ming.li@zohomail.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Add the Glymur DPU compatible to clients compatible list, as it needs
the workarounds.
Signed-off-by: Abel Vesa <abel.vesa@linaro.org>
Reviewed-by: Bjorn Andersson <andersson@kernel.org>
Signed-off-by: Will Deacon <will@kernel.org>
There seems to be nothing preventing this driver from being compile
tested so enable that to increase build coverage.
Signed-off-by: Johan Hovold <johan@kernel.org>
Reviewed-by: Jon Hunter <jonathanh@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
There seems to be nothing preventing this driver from being compile
tested so enable that to increase build coverage.
This also allows for compile testing of further Tegra drivers (e.g.
IOMMU driver).
Signed-off-by: Johan Hovold <johan@kernel.org>
Reviewed-by: Jon Hunter <jonathanh@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
The IOMMU core attaches each device to a default domain on probe(). Then,
every new "attach" operation has a fundamental meaning of two-fold:
- detach from its currently attached (old) domain
- attach to a given new domain
Modern IOMMU drivers following this pattern usually want to clean up the
things related to the old domain, so they call iommu_get_domain_for_dev()
to fetch the old domain.
Pass in the old domain pointer from the core to drivers, aligning with the
set_dev_pasid op that does so already.
Ensure all low-level attach fcuntions in the core can forward the correct
old domain pointer. Thus, rework those functions as well.
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
The last gdev is the device that failed the __iommu_device_set_domain().
So, it doesn't need to be reverted, given it's attached to group->domain
already.
This is not a problem currently, since it's a simply re-attach. However,
the core will need to pass in the old domain to __iommu_device_set_domain
so the old domain pointers would be inconsistent between a failed device
and all its prior succeeded devices, as all the prior devices need to be
reverted.
Avoid the re-attach for the last gdev, by breaking before the revert.
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
The set_dev_pasid for a release domain never gets called anyhow. So, there
is no point in defining a separate release_domain from the blocked_domain.
Simply reuse the blocked_domain.
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Following a coming core change to pass in the old domain pointer into the
attach_dev op and its callbacks, exynos_iommu_identity_attach() will need
this new argument too, which the release_device op doesn't provide.
Instead, the core provides a release_domain to attach to the device prior
to invoking the release_device callback. Thus, simply use that.
Acked-by: Marek Szyprowski <m.szyprowski@samsung.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Since the core now takes care of the require_direct case for the release
domain, simply use that via the release_domain op.
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Generally an IOMMU driver should leave the translation as BLOCKED until the
translation entry is probed onto a struct device. When the struct device is
removed, the translation should be put back to BLOCKED.
Drivers that are able to work like this can set their release_domain to the
blocking domain, and the core code handles this work.
The exception is when the device has an IOMMU_RESV_DIRECT region, in which
case the OS should continuously allow translations for the given range. And
the core code generally prevents using a BLOCKED domain with this device.
Continue this logic for the device release and hoist some open coding from
drivers. If the device has dev->iommu->require_direct and the driver uses a
BLOCKED release_domain, override it to IDENTITY to preserve the semantics.
The only remaining required driver code for IOMMU_RESV_DIRECT should preset
an IDENTITY translation during early IOMMU startup for those devices.
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Add DL_WITH_MULTI_LARB flag to support the HW which connect with
multiple larbs. Prepare for mt8189. In mt8189, the display connect
with larb1 and larb2 at the same time. Thus, we should add link
between disp-dev with these two larbs.
Signed-off-by: Zhengnan Chen <zhengnan.chen@mediatek.com>
Reviewed-by: Matthias Brugger <matthias.bgg@gmail.com>
Reviewed-by: Yong Wu <yong.wu@mediatek.com>
Reviewed-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
folio_nr_pages() is a faster helper function to get the number of pages when
NR_PAGES_IN_LARGE_FOLIO is enabled.
Signed-off-by: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
When a GSI MAD packet is sent on the QP, it will potentially be
retried CMA_MAX_CM_RETRIES times with a timeout value of:
4.096usec * 2 ^ CMA_CM_RESPONSE_TIMEOUT
The above equates to ~64 seconds using the default CMA values.
The cm_id_priv's refcount will be incremented for this period.
Therefore, the timeout value waiting for a cm_id destruction must be
based on the effective timeout of MAD packets. To provide additional
leeway, we add 25% to this timeout and use that instead of the
constant 10 seconds timeout, which may result in false negatives.
Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
Link: https://patch.msgid.link/20251021132738.4179604-1-haakon.bugge@oracle.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
The AMD Seattle DWC AHCI is behind an IOMMU and has 1-3 entries, so add
the 'iommus' property. There's not a specific compatible, so we can't
limit it to Seattle.
Signed-off-by: Rob Herring (Arm) <robh@kernel.org>
Signed-off-by: Niklas Cassel <cassel@kernel.org>
Fix 49 kernel-doc warnings in ib_verbs.h:
- Add struct short description for rdma_stat_desc, rdma_hw_stats.
- Fix kernel-doc format for struct members (use ':' instead of '-') for
several structs.
- Don't use "/**" kernel-doc notation for struct members in ib_device_ops
(most members are not documented and most of the kernel-doc was
not formatted correctly).
- Spell function parameters correctly in ib_dma_map_sgtable_attrs(),
ib_device_try_get(), rdma_roce_rescan_device().
- Add kernel-doc for the function parameter in
rdma_flow_label_to_udp_sport().
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Link: https://patch.msgid.link/20251020034320.3011094-1-rdunlap@infradead.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Document the SATA AHCI controller on the EIC7700 SoC platform,
including descriptions of its hardware configurations.
Signed-off-by: Yulin Lu <luyulin@eswincomputing.com>
Reviewed-by: Rob Herring (Arm) <robh@kernel.org>
Signed-off-by: Niklas Cassel <cassel@kernel.org>
2025-10-13 09:34:33 +02:00
348 changed files with 20451 additions and 5878 deletions
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.