Compare commits

...

330 Commits

Author SHA1 Message Date
Linus Torvalds
2061f18ad7 Merge tag 'caps-pr-20251204' of git://git.kernel.org/pub/scm/linux/kernel/git/sergeh/linux
Pull capabilities update from Serge Hallyn:
 "Ryan Foster had sent a patch to add testing of the
  rootid_owns_currentns() function. That patch pointed out
  that this function was not as clear as it should be. Fix it:

   - Clarify the intent of the function in the name

   - Split the function so that the base functionality is easier to test
     from a kunit test"

* tag 'caps-pr-20251204' of git://git.kernel.org/pub/scm/linux/kernel/git/sergeh/linux:
  Clarify the rootid_owns_currentns
2025-12-04 20:10:28 -08:00
Linus Torvalds
deb879faa9 Merge tag 'drm-next-2025-12-05' of https://gitlab.freedesktop.org/drm/kernel
Pull more drm updates from Dave Airlie:
 "There was some additional intel code for color operations we wanted to
  land. However I discovered I missed a pull for the xe vfio driver
  which I had sorted into 6.20 in my brain, until Thomas mentioned it.

  This contains the xe vfio code, a bunch of xe fixes that were waiting
  and the i915 color management support. I'd like to include it as part
  of keeping the two main vendors on the same page and giving a good
  cross-driver experience for userspace when it starts using it.

  vfio:
   - add a vfio_pci variant driver for Intel

  xe/i915 display:
   - add plane color management support

  xe:
   - Add scope-based cleanup helper for runtime PM
   - vfio xe driver prerequisites and exports
   - fix vfio link error
   - Fix a memory leak
   - Fix a 64-bit division
   - vf migration fix
   - LRC pause fix"

* tag 'drm-next-2025-12-05' of https://gitlab.freedesktop.org/drm/kernel: (25 commits)
  drm/i915/color: Enable Plane Color Pipelines
  drm/i915/color: Add 3D LUT to color pipeline
  drm/i915/color: Add registers for 3D LUT
  drm/i915/color: Program Plane Post CSC Registers
  drm/i915/color: Program Pre-CSC registers
  drm/i915/color: Add framework to program PRE/POST CSC LUT
  drm/i915: Add register definitions for Plane Post CSC
  drm/i915: Add register definitions for Plane Degamma
  drm/i915/color: Add plane CTM callback for D12 and beyond
  drm/i915/color: Preserve sign bit when int_bits is Zero
  drm/i915/color: Add framework to program CSC
  drm/i915/color: Create a transfer function color pipeline
  drm/i915/color: Add helper to create intel colorop
  drm/i915: Add intel_color_op
  drm/i915/display: Add identifiers for driver specific blocks
  drm/xe/pf: fix VFIO link error
  drm/xe: Protect against unset LRC when pausing submissions
  drm/xe/vf: Start re-emission from first unsignaled job during VF migration
  drm/xe/pf: Use div_u64 when calculating GGTT profile
  drm/xe: Fix memory leak when handling pagefault vma
  ...
2025-12-04 19:42:53 -08:00
Linus Torvalds
028bd4a146 Merge tag 'tpmdd-next-6.19-rc1-v4' of git://git.kernel.org/pub/scm/linux/kernel/git/jarkko/linux-tpmdd
Pull tpm updates from Jarkko Sakkinen:
 "This contains changes to unify TPM return code translation between
  trusted_tpm2 and TPM driver itself. Other than that the changes are
  either bug fixes or minor imrovements"

* tag 'tpmdd-next-6.19-rc1-v4' of git://git.kernel.org/pub/scm/linux/kernel/git/jarkko/linux-tpmdd:
  KEYS: trusted: Use tpm_ret_to_err() in trusted_tpm2
  tpm: Use -EPERM as fallback error code in tpm_ret_to_err
  tpm: Cap the number of PCR banks
  tpm: Remove tpm_find_get_ops
  tpm: add WQ_PERCPU to alloc_workqueue users
  tpm_crb: add missing loc parameter to kerneldoc
  tpm_crb: Fix a spelling mistake
  selftests: tpm2: Fix ill defined assertions
2025-12-04 19:30:09 -08:00
Linus Torvalds
16460bf96c Merge tag 'ata-6.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/libata/linux
Pull ata updates from Niklas Cassel:

 - Add DT binding for the Eswin EIC7700 SoC SATA Controller (Yulin Lu)

 - Allow 'iommus' property in the Synopsys DWC AHCI SATA controller DT
   binding (Rob Herring)

 - Replace deprecated strcpy with strscpy in the pata_it821x driver
   (Thorsten Blum)

 - Add Iomega Clik! PCMCIA ATA/ATAPI Adapter PCMCIA ID to the
   pata_pcmcia driver (René Rebe)

 - Add ATA_QUIRK_NOLPM quirk for two Silicon Motion SSDs with broken LPM
   support (me)

 - Add flag WQ_PERCPU to the workqueue in the libata-sff helper library
   to explicitly request the use of the per-CPU behavior (Marco
   Crivellari)

* tag 'ata-6.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/libata/linux:
  ata: libata-core: Disable LPM on Silicon Motion MD619{H,G}XCLDE3TC
  ata: pata_pcmcia: Add Iomega Clik! PCMCIA ATA/ATAPI Adapter
  ata: libata-sff: add WQ_PERCPU to alloc_workqueue users
  dt-bindings: ata: snps,dwc-ahci: Allow 'iommus' property
  ata: pata_it821x: Replace deprecated strcpy with strscpy in it821x_display_disk
  dt-bindings: ata: eswin: Document for EIC7700 SoC ahci
2025-12-04 19:27:11 -08:00
Linus Torvalds
bc69ed9752 Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost
Pull virtio updates from Michael Tsirkin:
 "Just a bunch of fixes and cleanups, mostly very simple. Several
  features were merged through net-next this time around"

* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
  virtio_pci: drop kernel.h
  vhost: switch to arrays of feature bits
  vhost/test: add test specific macro for features
  virtio: clean up features qword/dword terms
  vduse: add WQ_PERCPU to alloc_workqueue users
  virtio_balloon: add WQ_PERCPU to alloc_workqueue users
  vdpa/pds: use %pe for ERR_PTR() in event handler registration
  vhost: Fix kthread worker cgroup failure handling
  virtio: vdpa: Fix reference count leak in octep_sriov_enable()
  vdpa/mlx5: Fix incorrect error code reporting in query_virtqueues
  virtio: fix map ops comment
  virtio: fix virtqueue_set_affinity() docs
  virtio: standardize Returns documentation style
  virtio: fix grammar in virtio_map_ops docs
  virtio: fix grammar in virtio_queue_info docs
  virtio: fix whitespace in virtio_config_ops
  virtio: fix typo in virtio_device_ready() comment
  virtio: fix kernel-doc for mapping/free_coherent functions
  virtio_vdpa: fix misleading return in void function
2025-12-04 18:59:21 -08:00
Linus Torvalds
55aa394a5e Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull rdma updates from Jason Gunthorpe:
 "This has another new RDMA driver 'bng_en' for latest generation
  Broadcom NICs. There might be one more new driver still to come.

  Otherwise it is a fairly quite cycle. Summary:

   - Minor driver bug fixes and updates to cxgb4, rxe, rdmavt, bnxt_re,
     mlx5

   - Many bug fix patches for irdma

   - WQ_PERCPU annotations and system_dfl_wq changes

   - Improved mlx5 support for "other eswitches" and multiple PFs

   - 1600Gbps link speed reporting support. Four Digits Now!

   - New driver bng_en for latest generation Broadcom NICs

   - Bonding support for hns

   - Adjust mlx5's hmm based ODP to work with the very large address
     space created by the new 5 level paging default on x86

   - Lockdep fixups in rxe and siw"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (65 commits)
  RDMA/rxe: reclassify sockets in order to avoid false positives from lockdep
  RDMA/siw: reclassify sockets in order to avoid false positives from lockdep
  RDMA/bng_re: Remove prefetch instruction
  RDMA/core: Reduce cond_resched() frequency in __ib_umem_release
  RDMA/irdma: Fix SRQ shadow area address initialization
  RDMA/irdma: Remove doorbell elision logic
  RDMA/irdma: Do not set IBK_LOCAL_DMA_LKEY for GEN3+
  RDMA/irdma: Do not directly rely on IB_PD_UNSAFE_GLOBAL_RKEY
  RDMA/irdma: Add missing mutex destroy
  RDMA/irdma: Fix SIGBUS in AEQ destroy
  RDMA/irdma: Add a missing kfree of struct irdma_pci_f for GEN2
  RDMA/irdma: Fix data race in irdma_free_pble
  RDMA/irdma: Fix data race in irdma_sc_ccq_arm
  RDMA/mlx5: Add support for 1600_8x lane speed
  RDMA/core: Add new IB rate for XDR (8x) support
  IB/mlx5: Reduce IMR KSM size when 5-level paging is enabled
  RDMA/bnxt_re: Pass correct flag for dma mr creation
  RDMA/bnxt_re: Fix the inline size for GenP7 devices
  RDMA/hns: Support reset recovery for bond
  RDMA/hns: Support link state reporting for bond
  ...
2025-12-04 18:54:37 -08:00
Linus Torvalds
056daec292 Merge tag 'for-linus-iommufd' of git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd
Pull iommufd updates from Jason Gunthorpe:
 "This is a pretty consequential cycle for iommufd, though this pull is
  not too big. It is based on a shared branch with VFIO that introduces
  VFIO_DEVICE_FEATURE_DMA_BUF a DMABUF exporter for VFIO device's MMIO
  PCI BARs. This was a large multiple series journey over the last year
  and a half.

  Based on that work IOMMUFD gains support for VFIO DMABUF's in its
  existing IOMMU_IOAS_MAP_FILE, which closes the last major gap to
  support PCI peer to peer transfers within VMs.

  In Joerg's iommu tree we have the "generic page table" work which aims
  to consolidate all the duplicated page table code in every iommu
  driver into a single algorithm. This will be used by iommufd to
  implement unique page table operations to start adding new features
  and improve performance.

  In here:

   - Expand IOMMU_IOAS_MAP_FILE to accept a DMABUF exported from VFIO.
     This is the first step to broader DMABUF support in iommufd, right
     now it only works with VFIO. This closes the last functional gap
     with classic VFIO type 1 to safely support PCI peer to peer DMA by
     mapping the VFIO device's MMIO into the IOMMU.

   - Relax SMMUv3 restrictions on nesting domains to better support
     qemu's sequence to have an identity mapping before the vSID is
     established"

* tag 'for-linus-iommufd' of git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd:
  iommu/arm-smmu-v3-iommufd: Allow attaching nested domain for GBPA cases
  iommufd/selftest: Add some tests for the dmabuf flow
  iommufd: Accept a DMABUF through IOMMU_IOAS_MAP_FILE
  iommufd: Have iopt_map_file_pages convert the fd to a file
  iommufd: Have pfn_reader process DMABUF iopt_pages
  iommufd: Allow MMIO pages in a batch
  iommufd: Allow a DMABUF to be revoked
  iommufd: Do not map/unmap revoked DMABUFs
  iommufd: Add DMABUF to iopt_pages
  vfio/pci: Add vfio_pci_dma_buf_iommufd_map()
2025-12-04 18:50:11 -08:00
Linus Torvalds
a3ebb59eee Merge tag 'vfio-v6.19-rc1' of https://github.com/awilliam/linux-vfio
Pull VFIO updates from Alex Williamson:

 - Move libvfio selftest artifacts in preparation of more tightly
   coupled integration with KVM selftests (David Matlack)

 - Fix comment typo in mtty driver (Chu Guangqing)

 - Support for new hardware revision in the hisi_acc vfio-pci variant
   driver where the migration registers can now be accessed via the PF.
   When enabled for this support, the full BAR can be exposed to the
   user (Longfang Liu)

 - Fix vfio cdev support for VF token passing, using the correct size
   for the kernel structure, thereby actually allowing userspace to
   provide a non-zero UUID token. Also set the match token callback for
   the hisi_acc, fixing VF token support for this this vfio-pci variant
   driver (Raghavendra Rao Ananta)

 - Introduce internal callbacks on vfio devices to simplify and
   consolidate duplicate code for generating VFIO_DEVICE_GET_REGION_INFO
   data, removing various ioctl intercepts with a more structured
   solution (Jason Gunthorpe)

 - Introduce dma-buf support for vfio-pci devices, allowing MMIO regions
   to be exposed through dma-buf objects with lifecycle managed through
   move operations. This enables low-level interactions such as a
   vfio-pci based SPDK drivers interacting directly with dma-buf capable
   RDMA devices to enable peer-to-peer operations. IOMMUFD is also now
   able to build upon this support to fill a long standing feature gap
   versus the legacy vfio type1 IOMMU backend with an implementation of
   P2P support for VM use cases that better manages the lifecycle of the
   P2P mapping (Leon Romanovsky, Jason Gunthorpe, Vivek Kasireddy)

 - Convert eventfd triggering for error and request signals to use RCU
   mechanisms in order to avoid a 3-way lockdep reported deadlock issue
   (Alex Williamson)

 - Fix a 32-bit overflow introduced via dma-buf support manifesting with
   large DMA buffers (Alex Mastro)

 - Convert nvgrace-gpu vfio-pci variant driver to insert mappings on
   fault rather than at mmap time. This conversion serves both to make
   use of huge PFNMAPs but also to both avoid corrected RAS events
   during reset by now being subject to vfio-pci-core's use of
   unmap_mapping_range(), and to enable a device readiness test after
   reset (Ankit Agrawal)

 - Refactoring of vfio selftests to support multi-device tests and split
   code to provide better separation between IOMMU and device objects.
   This work also enables a new test suite addition to measure parallel
   device initialization latency (David Matlack)

* tag 'vfio-v6.19-rc1' of https://github.com/awilliam/linux-vfio: (65 commits)
  vfio: selftests: Add vfio_pci_device_init_perf_test
  vfio: selftests: Eliminate INVALID_IOVA
  vfio: selftests: Split libvfio.h into separate header files
  vfio: selftests: Move vfio_selftests_*() helpers into libvfio.c
  vfio: selftests: Rename vfio_util.h to libvfio.h
  vfio: selftests: Stop passing device for IOMMU operations
  vfio: selftests: Move IOVA allocator into iova_allocator.c
  vfio: selftests: Move IOMMU library code into iommu.c
  vfio: selftests: Rename struct vfio_dma_region to dma_region
  vfio: selftests: Upgrade driver logging to dev_err()
  vfio: selftests: Prefix logs with device BDF where relevant
  vfio: selftests: Eliminate overly chatty logging
  vfio: selftests: Support multiple devices in the same container/iommufd
  vfio: selftests: Introduce struct iommu
  vfio: selftests: Rename struct vfio_iommu_mode to iommu_mode
  vfio: selftests: Allow passing multiple BDFs on the command line
  vfio: selftests: Split run.sh into separate scripts
  vfio: selftests: Move run.sh into scripts directory
  vfio/nvgrace-gpu: wait for the GPU mem to be ready
  vfio/nvgrace-gpu: Inform devmem unmapped after reset
  ...
2025-12-04 18:42:48 -08:00
Linus Torvalds
ce5cfb0fa2 Merge tag 'iommu-updates-v6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux
Pull iommu updates from Joerg Roedel:

 - Introduction of the generic IO page-table framework with support for
   Intel and AMD IOMMU formats from Jason.

   This has good potential for unifying more IO page-table
   implementations and making future enhancements more easy. But this
   also needed quite some fixes during development. All known issues
   have been fixed, but my feeling is that there is a higher potential
   than usual that more might be needed.

 - Intel VT-d updates:
    - Use right invalidation hint in qi_desc_iotlb()
    - Reduce the scope of INTEL_IOMMU_FLOPPY_WA

 - ARM-SMMU updates:
    - Qualcomm device-tree binding updates for Kaanapali and Glymur SoCs
      and a new clock for the TBU.
    - Fix error handling if level 1 CD table allocation fails.
    - Permit more than the architectural maximum number of SMRs for
      funky Qualcomm mis-implementations of SMMUv2.

 - Mediatek driver:
    - MT8189 iommu support

 - Move ARM IO-pgtable selftests to kunit

 - Device leak fixes for a couple of drivers

 - Random smaller fixes and improvements

* tag 'iommu-updates-v6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux: (81 commits)
  iommupt/vtd: Support mgaw's less than a 4 level walk for first stage
  iommupt/vtd: Allow VT-d to have a larger table top than the vasz requires
  powerpc/pseries/svm: Make mem_encrypt.h self contained
  genpt: Make GENERIC_PT invisible
  iommupt: Avoid a compiler bug with sw_bit
  iommu/arm-smmu-qcom: Enable use of all SMR groups when running bare-metal
  iommupt: Fix unlikely flows in increase_top()
  iommu/amd: Propagate the error code returned by __modify_irte_ga() in modify_irte_ga()
  MAINTAINERS: Update my email address
  iommu/arm-smmu-v3: Fix error check in arm_smmu_alloc_cd_tables
  dt-bindings: iommu: qcom_iommu: Allow 'tbu' clock
  iommu/vt-d: Restore previous domain::aperture_end calculation
  iommu/vt-d: Fix unused invalidation hint in qi_desc_iotlb
  iommu/vt-d: Set INTEL_IOMMU_FLOPPY_WA depend on BLK_DEV_FD
  iommu/tegra: fix device leak on probe_device()
  iommu/sun50i: fix device leak on of_xlate()
  iommu/omap: simplify probe_device() error handling
  iommu/omap: fix device leaks on probe_device()
  iommu/mediatek-v1: add missing larb count sanity check
  iommu/mediatek-v1: fix device leaks on probe()
  ...
2025-12-04 18:05:06 -08:00
Linus Torvalds
5797d10ea4 Merge tag 'cxl-for-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl
Pull compute express link (CXL) updates from Dave Jiang:
 "The additions of note are adding CXL region remove support for locked
  CXL decoders, adding unit testing support for XOR address translation,
  and adding unit testing support for extended linear cache.

  Misc:
   - Remove incorrect page-allocator quirk section in documentation
   - Remove unused devm_cxl_port_enumerate_dports() function
   - Fix typo in cdat.c code comment
   - Replace use of system_wq with system_percpu_wq
   - Add locked CXL decoder support for region removal
   - Return when generic target updated
   - Rename region_res_match_cxl_range() to spa_maps_hpa()
   - Clarify comment in spa_maps_hpa()

  Enable unit testing for XOR address translation of SPA to DPA and vice versa:
   - Refactor address translation funcs for testing in cxl_region
   - Make the XOR calculations available for testing
   - Add cxl_translate module for address translation testing in
     cxl_test

  Extended Linear Cache changes:
   - Add extended linear cache size sysfs attribute
   - Adjust failure emission of extended linear cache detection in
     cxl_acpi
   - Added extended linear cache unit testing support in cxl_test

  Preparation refactor patches for PRM translation support:
   - Simplify cxl_rd_ops allocation and handling
   - Group xor arithmetric setup code in a single block
   - Remove local variable @inc in cxl_port_setup_targets()"

* tag 'cxl-for-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl: (22 commits)
  cxl/test: Assign overflow_err_count from log->nr_overflow
  cxl/test: Remove ret_limit race condition in mock_get_event()
  cxl/test: remove unused mock function for cxl_rcd_component_reg_phys()
  cxl/test: Add support for acpi extended linear cache
  cxl/test: Add cxl_test CFMWS support for extended linear cache
  cxl/test: Standardize CXL auto region size
  cxl/region: Remove local variable @inc in cxl_port_setup_targets()
  cxl/acpi: Group xor arithmetric setup code in a single block
  cxl: Simplify cxl_rd_ops allocation and handling
  cxl: Clarify comment in spa_maps_hpa()
  cxl: Rename region_res_match_cxl_range() to spa_maps_hpa()
  acpi/hmat: Return when generic target is updated
  cxl: Add handling of locked CXL decoder
  cxl/region: Add support to indicate region has extended linear cache
  cxl: Adjust extended linear cache failure emission in cxl_acpi
  cxl/test: Add cxl_translate module for address translation testing
  cxl/acpi: Make the XOR calculations available for testing
  cxl/region: Refactor address translation funcs for testing
  cxl/pci: replace use of system_wq with system_percpu_wq
  cxl: fix typos in cdat.c comments
  ...
2025-12-04 17:55:18 -08:00
Dave Airlie
c7685d1110 Merge tag 'topic/drm-intel-plane-color-pipeline-2025-12-04' of https://gitlab.freedesktop.org/drm/i915/kernel into drm-next
drm/i915 topic pull request for v6.19:

Features and functionality:
- Add plane color management support (Uma, Chaitanya)

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Jani Nikula <jani.nikula@intel.com>
Link: https://patch.msgid.link/e7129c6afd6208719d2f5124da86e810505e7a7b@intel.com
2025-12-05 10:27:57 +10:00
Dave Airlie
86fafc584c Merge tag 'drm-xe-next-fixes-2025-12-04' of https://gitlab.freedesktop.org/drm/xe/kernel into drm-next
Driver Changes:
- Fix a memory leak (Mika)
- Fix a 64-bit division (Michal Wajdeczko)
- vf migration fix (Matt Brost)
- LRC pause Fix (Tomasz lis)

Signed-off-by: Dave Airlie <airlied@redhat.com>

From: Thomas Hellstrom <thomas.hellstrom@linux.intel.com>
Link: https://patch.msgid.link/aTIGiHJnnMtqbDOO@fedora
2025-12-05 10:21:19 +10:00
Dave Airlie
e73c226204 Merge tag 'topic/xe-vfio-2025-12-04' of https://gitlab.freedesktop.org/drm/xe/kernel into drm-next
Driver Changes:
- fix VFIO link error for built-in xe module (Arnd Bergmann)

Signed-off-by: Dave Airlie <airlied@redhat.com>

From: Thomas Hellstrom <thomas.hellstrom@linux.intel.com>
Link: https://patch.msgid.link/aTIA9in2Bo_fA9TN@fedora
2025-12-05 10:16:47 +10:00
Dave Airlie
55a271a0f7 Merge tag 'topic/xe-vfio-2025-12-01' of https://gitlab.freedesktop.org/drm/xe/kernel into drm-next
Cross-subsystem Changes:
- Add device specific vfio_pci driver variant for intel graphics (Michal Winiarski)

Driver Changes:
- Add scope-based cleanup helper for runtime PM (Matt Roper)
- Additional xe driver prerequisites and exports (Michal Winiarski)

Signed-off-by: Dave Airlie <airlied@redhat.com>

From: Thomas Hellstrom <thomas.hellstrom@linux.intel.com>
Link: https://patch.msgid.link/aS1bNpqeem6PIHrA@fedora
2025-12-05 10:16:25 +10:00
Thomas Hellström
3f1c07fc21 Merge drm/drm-next into drm-xe-next-fixes
Backmerging to be able do to a clean PR.

Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
2025-12-04 22:54:56 +01:00
Uma Shankar
860daa4b0d drm/i915/color: Enable Plane Color Pipelines
Expose color pipeline and add ability to program it.

v2: Set bit to enable multisegmented lut
v3: s/drm_color_lut_32/drm_color_lut32 (Simon)
v4: - Fix dsb programming
    - Remove multi-segment LUT, they will be added in later patches
    - Add pipeline only to TGL+
    - Code Refactor

Reviewed-by: Suraj Kandpal <suraj.kandpal@intel.com>
Signed-off-by: Chaitanya Kumar Borah <chaitanya.kumar.borah@intel.com>
Signed-off-by: Uma Shankar <uma.shankar@intel.com>
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
Link: https://patch.msgid.link/20251203085211.3663374-16-uma.shankar@intel.com
2025-12-04 19:44:36 +02:00
Chaitanya Kumar Borah
65db7a1f9c drm/i915/color: Add 3D LUT to color pipeline
Add helpers to program the 3D LUT registers and arm them.

LUT_3D_READY in LUT_3D_CLT is cleared off by the HW once
the LUT buffer is loaded into it's internal working RAM.
So by the time we try to load/commit new values, we expect
it to be cleared off. If not, log an error and return
without writing new values. Do it only when writing with MMIO.
There is no way to read register within DSB execution.

v2:
- Add information regarding LUT_3D_READY to commit message (Jani)
- Log error instead of a drm_warn and return without committing changes
  if 3DLUT HW is not ready to accept new values.
- Refactor intel_color_crtc_has_3dlut()
  Also remove Gen10 check (Suraj)
v3:
- Addressed review comments (Suraj)

Reviewed-by: Suraj Kandpal <suraj.kandpal@intel.com>
Signed-off-by: Chaitanya Kumar Borah <chaitanya.kumar.borah@intel.com>
Signed-off-by: Uma Shankar <uma.shankar@intel.com>
Link: https://patch.msgid.link/20251203085211.3663374-15-uma.shankar@intel.com
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
2025-12-04 19:43:47 +02:00
Chaitanya Kumar Borah
55b0f3cd09 drm/i915/color: Add registers for 3D LUT
Add registers needed to program 3D LUT

v2:
- Follow convention documented in i915_reg.h (Jani)
- Removing space in trailer (Suraj)
- Move registers to intel_color_regs.h

BSpec: 69378, 69379, 69380
Reviewed-by: Suraj Kandpal <suraj.kandpal@intel.com>
Signed-off-by: Chaitanya Kumar Borah <chaitanya.kumar.borah@intel.com>
Signed-off-by: Uma Shankar <uma.shankar@intel.com>
Link: https://patch.msgid.link/20251203085211.3663374-14-uma.shankar@intel.com
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
2025-12-04 19:43:47 +02:00
Uma Shankar
bf0fd73754 drm/i915/color: Program Plane Post CSC Registers
Extract the LUT and program plane post csc registers.

v2: Add DSB support
v3: Add support for single segment 1D LUT
v4:
- s/drm_color_lut_32/drm_color_lut32 (Simon)
- Move declaration to beginning of the function (Suraj)
- Remove multisegmented code, add it later
- Remove dead code for SDR planes, add it later
v5:
- Fix iterator issues
v6: Removed redundant variable (Suraj)

Reviewed-by: Suraj Kandpal <suraj.kandpal@intel.com>
Signed-off-by: Uma Shankar <uma.shankar@intel.com>
Signed-off-by: Chaitanya Kumar Borah <chaitanya.kumar.borah@intel.com>
Link: https://patch.msgid.link/20251203085211.3663374-13-uma.shankar@intel.com
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
2025-12-04 19:43:47 +02:00
Uma Shankar
82caa1c881 drm/i915/color: Program Pre-CSC registers
Add callback to program Pre-CSC LUT for TGL and beyond

v2: Add DSB support
v3: Add support for single segment 1D LUT color op
v4:
- s/drm_color_lut_32/drm_color_lut32/ (Simon)
- Change commit message (Suraj)
- Improve comments (Suraj)
- Remove multisegmented programming, to be added later
- Remove dead code for SDR planes, add when needed

BSpec: 50411, 50412, 50413, 50414
Reviewed-by: Suraj Kandpal <suraj.kandpal@intel.com>
Signed-off-by: Uma Shankar <uma.shankar@intel.com>
Signed-off-by: Chaitanya Kumar Borah <chaitanya.kumar.borah@intel.com>
Link: https://patch.msgid.link/20251203085211.3663374-12-uma.shankar@intel.com
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
2025-12-04 19:43:47 +02:00
Uma Shankar
3b7476e786 drm/i915/color: Add framework to program PRE/POST CSC LUT
Add framework that will help in loading LUT to Pre/Post CSC color
blocks.

v2: Add dsb support
v3: Align enum names
v4: Propagate change in lut data to crtc_state

Reviewed-by: Suraj Kandpal <suraj.kandpal@intel.com>
Signed-off-by: Uma Shankar <uma.shankar@intel.com>
Signed-off-by: Chaitanya Kumar Borah <chaitanya.kumar.borah@intel.com>
Link: https://patch.msgid.link/20251203085211.3663374-11-uma.shankar@intel.com
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
2025-12-04 19:43:47 +02:00
Uma Shankar
05df71544c drm/i915: Add register definitions for Plane Post CSC
Add macros to define Plane Post CSC registers

v2:
- Add Plane Post CSC Gamma Multi Segment Enable bit
- Add BSpec entries (Suraj)
v3:
- Fix checkpatch issues (Suraj)

BSpec: 50403, 50404, 50405, 50406, 50409, 50410,
Reviewed-by: Suraj Kandpal <suraj.kandpal@intel.com>
Signed-off-by: Uma Shankar <uma.shankar@intel.com>
Signed-off-by: Chaitanya Kumar Borah <chaitanya.kumar.borah@intel.com>
Link: https://patch.msgid.link/20251203085211.3663374-10-uma.shankar@intel.com
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
2025-12-04 19:43:47 +02:00
Uma Shankar
ed0ebbc89f drm/i915: Add register definitions for Plane Degamma
Add macros to define Plane Degamma registers

v2:
 - Add BSpec links (Suraj)
v3:
 - Add Bspec links in trailer (Suraj)
 - Fix checkpatch issues (Suraj)

BSpec: 50411, 50412, 50413, 50414
Reviewed-by: Suraj Kandpal <suraj.kandpal@intel.com>
Signed-off-by: Uma Shankar <uma.shankar@intel.com>
Signed-off-by: Chaitanya Kumar Borah <chaitanya.kumar.borah@intel.com>
Link: https://patch.msgid.link/20251203085211.3663374-9-uma.shankar@intel.com
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
2025-12-04 19:43:47 +02:00
Uma Shankar
f00d02707d drm/i915/color: Add plane CTM callback for D12 and beyond
Add callback for setting CTM block in platforms D12 and beyond

v2:
- Add dsb support
- Pass plane_state as we are now doing a uapi to hw state copy
- Add support for 3x4 matrix

v3:
- Add relevant header file
- Fix typo (Suraj)
- Add callback to TGL+ (Suraj)

Reviewed-by: Suraj Kandpal <suraj.kandpal@intel.com>
Signed-off-by: Chaitanya Kumar Borah <chaitanya.kumar.borah@intel.com>
Signed-off-by: Uma Shankar <uma.shankar@intel.com>
Link: https://patch.msgid.link/20251203085211.3663374-8-uma.shankar@intel.com
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
2025-12-04 19:43:46 +02:00
Chaitanya Kumar Borah
6f1e094fb6 drm/i915/color: Preserve sign bit when int_bits is Zero
When int_bits == 0, we lose the sign bit when we do the range check
and apply the mask.

Fix this by ensuring a minimum of one integer bit, which guarantees space
for the sign bit in fully fractional representations (e.g. S0.12)

Reviewed-by: Suraj Kandpal <suraj.kandpal@intel.com>
Signed-off-by: Chaitanya Kumar Borah <chaitanya.kumar.borah@intel.com>
Signed-off-by: Uma Shankar <uma.shankar@intel.com>
Link: https://patch.msgid.link/20251203085211.3663374-7-uma.shankar@intel.com
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
2025-12-04 19:43:46 +02:00
Chaitanya Kumar Borah
a78f1b6baf drm/i915/color: Add framework to program CSC
Add framework to program CSC. It enables copying of matrix from UAPI
to intel plane state. Also add helper functions which will eventually
program values to hardware.

Add a crtc state variable to track plane color change.

v2:
- Add crtc_state->plane_color_changed
- Improve comments (Suraj)
- s/intel_plane_*_color/intel_plane_color_* (Suraj)

v3:
- align parameters with open braces (Suraj)
- Improve commit message (Suraj)

v4:
- Re-arrange variable declaration (Suraj)

Reviewed-by: Suraj Kandpal <suraj.kandpal@intel.com>
Signed-off-by: Chaitanya Kumar Borah <chaitanya.kumar.borah@intel.com>
Signed-off-by: Uma Shankar <uma.shankar@intel.com>
Link: https://patch.msgid.link/20251203085211.3663374-6-uma.shankar@intel.com
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
2025-12-04 19:43:46 +02:00
Chaitanya Kumar Borah
ef10531681 drm/i915/color: Create a transfer function color pipeline
Add a color pipeline with three colorops in the sequence

        1D LUT - 3x4 CTM - 1D LUT

This pipeline can be used to do any color space conversion or HDR
tone mapping

v2: Change namespace to drm_plane_colorop*
v3: Use simpler/pre-existing colorops for first iteration
v4:
 - s/*_tf_*/*_color_* (Jani)
 - Refactor to separate files (Jani)
 - Add missing space in comment (Suraj)
 - Consolidate patch that adds/attaches pipeline property
v5:
 - Limit MAX_COLOR_PIPELINES to 2.(Suraj)
	Increase it as and when we add more pipelines.
 - Remove redundant initialization code (Suraj)
v6:
 - Use drm_plane_create_color_pipeline_property() (Arun)
	Now MAX_COLOR_PIPELINES is 1

Reviewed-by: Suraj Kandpal <suraj.kandpal@intel.com>
Signed-off-by: Uma Shankar <uma.shankar@intel.com>
Signed-off-by: Chaitanya Kumar Borah <chaitanya.kumar.borah@intel.com>
Link: https://patch.msgid.link/20251203085211.3663374-5-uma.shankar@intel.com
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
2025-12-04 19:43:46 +02:00
Chaitanya Kumar Borah
730df5065e drm/i915/color: Add helper to create intel colorop
Add intel colorop create helper

v2:
 - Make function names consistent (Jani)
 - Remove redundant code related to colorop state
 - Refactor code to separate files

Reviewed-by: Suraj Kandpal <suraj.kandpal@intel.com>
Signed-off-by: Chaitanya Kumar Borah <chaitanya.kumar.borah@intel.com>
Signed-off-by: Uma Shankar <uma.shankar@intel.com>
Link: https://patch.msgid.link/20251203085211.3663374-4-uma.shankar@intel.com
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
2025-12-04 19:43:46 +02:00
Chaitanya Kumar Borah
3e9b06559a drm/i915: Add intel_color_op
Add data structure to store intel specific details of colorop

v2:
 - Remove dead code
 - Convert macro to function (Jani)
 - Remove colorop state as it is not being used
 - Refactor to separate file

Reviewed-by: Suraj Kandpal <suraj.kandpal@intel.com>
Signed-off-by: Chaitanya Kumar Borah <chaitanya.kumar.borah@intel.com>
Signed-off-by: Uma Shankar <uma.shankar@intel.com>
Link: https://patch.msgid.link/20251203085211.3663374-3-uma.shankar@intel.com
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
2025-12-04 19:43:46 +02:00
Chaitanya Kumar Borah
4cd8a64b15 drm/i915/display: Add identifiers for driver specific blocks
Add macros to identify intel specific color blocks. It will help
in mapping drm_color_ops to intel color HW blocks

v2:- Prefix enums with INTEL_* (Jani, Suraj)
   - Remove unnecessary comments (Jani)
   - Commit message improvements (Suraj)

Reviewed-by: Suraj Kandpal <suraj.kandpal@intel.com>
Signed-off-by: Chaitanya Kumar Borah <chaitanya.kumar.borah@intel.com>
Signed-off-by: Uma Shankar <uma.shankar@intel.com>
Link: https://patch.msgid.link/20251203085211.3663374-2-uma.shankar@intel.com
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
2025-12-04 19:43:46 +02:00
Arnd Bergmann
e45b5df47b drm/xe/pf: fix VFIO link error
The Makefile logic for building xe_sriov_vfio.o was added incorrectly,
as setting CONFIG_XE_VFIO_PCI=m means it doesn't get included into a
built-in xe driver:

ERROR: modpost: "xe_sriov_vfio_stop_copy_enter" [drivers/vfio/pci/xe/xe-vfio-pci.ko] undefined!
ERROR: modpost: "xe_sriov_vfio_stop_copy_exit" [drivers/vfio/pci/xe/xe-vfio-pci.ko] undefined!
ERROR: modpost: "xe_sriov_vfio_suspend_device" [drivers/vfio/pci/xe/xe-vfio-pci.ko] undefined!
ERROR: modpost: "xe_sriov_vfio_wait_flr_done" [drivers/vfio/pci/xe/xe-vfio-pci.ko] undefined!
ERROR: modpost: "xe_sriov_vfio_error" [drivers/vfio/pci/xe/xe-vfio-pci.ko] undefined!
ERROR: modpost: "xe_sriov_vfio_resume_data_enter" [drivers/vfio/pci/xe/xe-vfio-pci.ko] undefined!
ERROR: modpost: "xe_sriov_vfio_resume_device" [drivers/vfio/pci/xe/xe-vfio-pci.ko] undefined!
ERROR: modpost: "xe_sriov_vfio_resume_data_exit" [drivers/vfio/pci/xe/xe-vfio-pci.ko] undefined!
ERROR: modpost: "xe_sriov_vfio_data_write" [drivers/vfio/pci/xe/xe-vfio-pci.ko] undefined!
ERROR: modpost: "xe_sriov_vfio_migration_supported" [drivers/vfio/pci/xe/xe-vfio-pci.ko] undefined!
WARNING: modpost: suppressed 3 unresolved symbol warnings because there were too many)

Check for CONFIG_XE_VFIO_PCI being enabled in the Makefile to decide whether to
include the object instead.

Fixes: bd45d46ffc ("drm/xe/pf: Export helpers for VFIO")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Link: https://patch.msgid.link/20251204094154.1029357-1-arnd@kernel.org
(cherry picked from commit ef7de33544a7a6783d7afe09496da362d1e90ba1)
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
2025-12-04 16:34:00 +01:00
Jarkko Sakkinen
09b71a58ee KEYS: trusted: Use tpm_ret_to_err() in trusted_tpm2
Use tpm_ret_to_err() to transmute TPM return codes in trusted_tpm2.

Signed-off-by: Jarkko Sakkinen <jarkko.sakkinen@opinsys.com>
Acked-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
2025-12-03 22:55:28 +02:00
Jarkko Sakkinen
7fcf459ac8 tpm: Use -EPERM as fallback error code in tpm_ret_to_err
Using -EFAULT as the tpm_ret_to_err() fallback error code causes makes it
incompatible on how trusted keys transmute TPM return codes.

Change the fallback as -EPERM in order to gain compatibility with trusted
keys. In addition, map TPM_RC_HASH to -EINVAL in order to be compatible
with tpm2_seal_trusted() return values.

Signed-off-by: Jarkko Sakkinen <jarkko.sakkinen@opinsys.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
2025-12-03 22:55:28 +02:00
Jarkko Sakkinen
faf07e611d tpm: Cap the number of PCR banks
tpm2_get_pcr_allocation() does not cap any upper limit for the number of
banks. Cap the limit to eight banks so that out of bounds values coming
from external I/O cause on only limited harm.

Cc: stable@vger.kernel.org # v5.10+
Fixes: bcfff8384f ("tpm: dynamically allocate the allocated_banks array")
Tested-by: Lai Yi <yi1.lai@linux.intel.com>
Reviewed-by: Jonathan McDowell <noodles@meta.com>
Reviewed-by: Roberto Sassu <roberto.sassu@huawei.com>
Signed-off-by: Jarkko Sakkinen <jarkko.sakkinen@opinsys.com>
2025-12-03 22:55:28 +02:00
Jonathan McDowell
020a0d8fea tpm: Remove tpm_find_get_ops
tpm_find_get_ops() looks for the first valid TPM if the caller passes in
NULL. All internal users have been converted to either associate
themselves with a TPM directly, or call tpm_default_chip() as part of
their setup. Remove the no longer necessary tpm_find_get_ops().

Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Jonathan McDowell <noodles@meta.com>
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
2025-12-03 22:55:28 +02:00
Marco Crivellari
e68407b6b0 tpm: add WQ_PERCPU to alloc_workqueue users
Currently if a user enqueues a work item using schedule_delayed_work() the
used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to
schedule_work() that is using system_wq and queue_work(), that makes use
again of WORK_CPU_UNBOUND.
This lack of consistency cannot be addressed without refactoring the API.

alloc_workqueue() treats all queues as per-CPU by default, while unbound
workqueues must opt-in via WQ_UNBOUND.

This default is suboptimal: most workloads benefit from unbound queues,
allowing the scheduler to place worker threads where they’re needed and
reducing noise when CPUs are isolated.

This continues the effort to refactor workqueue APIs, which began with
the introduction of new workqueues and a new alloc_workqueue flag in:

commit 128ea9f6cc ("workqueue: Add system_percpu_wq and system_dfl_wq")
commit 930c2ea566 ("workqueue: Add new WQ_PERCPU flag")

This change adds a new WQ_PERCPU flag to explicitly request
alloc_workqueue() to be per-cpu when WQ_UNBOUND has not been specified.

With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND),
any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND
must now use WQ_PERCPU.

Once migration is complete, WQ_UNBOUND can be removed and unbound will
become the implicit default.

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
2025-12-03 22:55:28 +02:00
Stuart Yoder
6187221487 tpm_crb: add missing loc parameter to kerneldoc
Update the kerneldoc parameter definitions for __crb_go_idle
and __crb_cmd_ready to include the loc parameter.

Signed-off-by: Stuart Yoder <stuart.yoder@arm.com>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
2025-12-03 22:55:27 +02:00
Chu Guangqing
76b1a8aebe tpm_crb: Fix a spelling mistake
The spelling of the word "requrest" is incorrect; it should be "request".

Signed-off-by: Chu Guangqing <chuguangqing@inspur.com>
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
2025-12-03 22:55:27 +02:00
Maurice Hieronymus
cffc934c0d selftests: tpm2: Fix ill defined assertions
Remove parentheses around assert statements in Python. With parentheses,
assert always evaluates to True, making the checks ineffective.

Signed-off-by: Maurice Hieronymus <mhi@mailbox.org>
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
2025-12-03 22:55:27 +02:00
Tomasz Lis
d72312d730 drm/xe: Protect against unset LRC when pausing submissions
While pausing submissions, it is possible to encouner an exec queue
which is during creation, and therefore doesn't have a valid xe_lrc
struct reference.

Protect agains such situation, by checking for NULL before access.

Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Fixes: c25c1010df ("drm/xe/vf: Replay GuC submission state on pause / unpause")
Signed-off-by: Tomasz Lis <tomasz.lis@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Link: https://patch.msgid.link/20251124222853.1900800-1-tomasz.lis@intel.com
(cherry picked from commit 07cf4b864f523f01d2bb522a05813df30b076ba8)
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
2025-12-01 10:16:17 +01:00
Matthew Brost
3d98a7164d drm/xe/vf: Start re-emission from first unsignaled job during VF migration
The LRC software ring tail is reset to the first unsignaled pending
job's head.

Fix the re-emission logic to begin submitting from the first unsignaled
job detected, rather than scanning all pending jobs, which can cause
imbalance.

v2:
 - Include missing local changes
v3:
 - s/skip_replay/restore_replay (Tomasz)

Fixes: c25c1010df ("drm/xe/vf: Replay GuC submission state on pause / unpause")
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Tomasz Lis <tomasz.lis@intel.com>
Link: https://patch.msgid.link/20251121152750.240557-1-matthew.brost@intel.com
(cherry picked from commit 00937fe1921ab346b6f6a4beaa5c38e14733caa3)
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
2025-12-01 10:16:11 +01:00
Michal Wajdeczko
14a8d83cbe drm/xe/pf: Use div_u64 when calculating GGTT profile
This will fix the following error seen on some 32-bit config:

"ERROR: modpost: "__udivdi3" [drivers/gpu/drm/xe/xe.ko] undefined!"

Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202511150929.3vUi6PEJ-lkp@intel.com/
Fixes: e448372e8a ("drm/xe/pf: Use migration-friendly GGTT auto-provisioning")
Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Reviewed-by: Piotr Piórkowski <piotr.piorkowski@intel.com>
Link: https://patch.msgid.link/20251115151323.10828-1-michal.wajdeczko@intel.com
(cherry picked from commit 0f4435a1f46efc3177eb082cd3f73e29da5ab86a)
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
2025-12-01 10:16:03 +01:00
Mika Kuoppala
bf213ac637 drm/xe: Fix memory leak when handling pagefault vma
When the pagefault handling code was moved to a new file, an extra
drm_exec_init() was added to the VMA path. This call is unnecessary because
xe_validation_ctx_init() already performs a drm_exec_init(), resulting in a
memory leak reported by kmemleak.

Remove the redundant drm_exec_init() from the VMA pagefault handling code.

Fixes: fb544b8445 ("drm/xe: Implement xe_pagefault_queue_work")
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Stuart Summers <stuart.summers@intel.com>
Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Cc: "Thomas Hellström" <thomas.hellstrom@linux.intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: "Christian König" <christian.koenig@amd.com>
Cc: intel-xe@lists.freedesktop.org
Cc: linux-media@vger.kernel.org
Cc: dri-devel@lists.freedesktop.org
Cc: linaro-mm-sig@lists.linaro.org
Signed-off-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Link: https://patch.msgid.link/20251120161435.3674556-1-mika.kuoppala@linux.intel.com
(cherry picked from commit 62519b77aecad22b525eda482660ffa127e7ad80)
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
2025-12-01 10:15:57 +01:00
Michał Winiarski
1f5556ec8b vfio/xe: Add device specific vfio_pci driver variant for Intel graphics
In addition to generic VFIO PCI functionality, the driver implements
VFIO migration uAPI, allowing userspace to enable migration for Intel
Graphics SR-IOV Virtual Functions.
The driver binds to VF device and uses API exposed by Xe driver to
transfer the VF migration data under the control of PF device.

Acked-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Alex Williamson <alex@shazbot.org>
Link: https://patch.msgid.link/20251127093934.1462188-5-michal.winiarski@intel.com
Link: https://lore.kernel.org/all/20251128125322.34edbeaf.alex@shazbot.org/
Signed-off-by: Michał Winiarski <michal.winiarski@intel.com>
(cherry picked from commit 2e38c50ae4929f0b954fee69d428db7121452867)
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
2025-12-01 09:45:48 +01:00
Michał Winiarski
bd45d46ffc drm/xe/pf: Export helpers for VFIO
Device specific VFIO driver variant for Xe will implement VF migration.
Export everything that's needed for migration ops.

Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Link: https://patch.msgid.link/20251127093934.1462188-4-michal.winiarski@intel.com
Signed-off-by: Michał Winiarski <michal.winiarski@intel.com>
(cherry picked from commit 17f22465c5a5573724c942ca7147b4024631ef87)
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
2025-12-01 09:42:37 +01:00
Michał Winiarski
5be29ebe9f drm/xe/pci: Introduce a helper to allow VF access to PF xe_device
In certain scenarios (such as VF migration), VF driver needs to interact
with PF driver.
Add a helper to allow VF driver access to PF xe_device.

Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Link: https://patch.msgid.link/20251127093934.1462188-3-michal.winiarski@intel.com
Signed-off-by: Michał Winiarski <michal.winiarski@intel.com>
(cherry picked from commit 8b3cce3ad9c78ce3dae1c178f99352d50e12a3c0)
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
2025-12-01 09:42:37 +01:00
Michał Winiarski
73834d03a5 drm/xe/pf: Enable SR-IOV VF migration
All of the necessary building blocks are now in place to support SR-IOV
VF migration.
Flip the enable/disable logic to match VF code and disable the feature
only for platforms that don't meet the necessary prerequisites.
To allow more testing and experiments, on DEBUG builds any missing
prerequisites will be ignored.

Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Link: https://patch.msgid.link/20251127093934.1462188-2-michal.winiarski@intel.com
Signed-off-by: Michał Winiarski <michal.winiarski@intel.com>
(cherry picked from commit 01c724aa7bf84e9d081a56e0cbf1d282678ce144)
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
2025-12-01 09:42:36 +01:00
Matt Roper
50a59230fa drm/xe/pm: Add scope-based cleanup helper for runtime PM
Add a scope-based helpers for runtime PM that may be used to simplify
cleanup logic and potentially avoid goto-based cleanup.

For example, using

        guard(xe_pm_runtime)(xe);

will get runtime PM and cause a corresponding put to occur automatically
when the current scope is exited.  'xe_pm_runtime_noresume' can be used
as a guard replacement for the corresponding 'noresume' variant.
There's also an xe_pm_runtime_ioctl conditional guard that can be used
as a replacement for xe_runtime_ioctl():

        ACQUIRE(xe_pm_runtime_ioctl, pm)(xe);
        if ((ret = ACQUIRE_ERR(xe_pm_runtime_ioctl, &pm)) < 0)
                /* failed */

In a few rare cases (such as gt_reset_worker()) we need to ensure that
runtime PM is dropped when the function is exited by any means
(including error paths), but the function does not need to acquire
runtime PM because that has already been done earlier by a different
function.  For these special cases, an 'xe_pm_runtime_release_only'
guard can be used to handle the release without doing an acquisition.

These guards will be used in future patches to eliminate some of our
goto-based cleanup.

v2:
 - Specify success condition for xe_pm runtime_ioctl as _RET >= 0 so
   that positive values will be properly identified as success and
   trigger destructor cleanup properly.

v3:
 - Add comments to the kerneldoc for the existing 'get' functions
   indicating that scope-based handling should be preferred where
   possible.  (Gustavo)

Cc: Gustavo Sousa <gustavo.sousa@intel.com>
Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Reviewed-by: Gustavo Sousa <gustavo.sousa@intel.com>
Link: https://patch.msgid.link/20251118164338.3572146-31-matthew.d.roper@intel.com
Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
(cherry picked from commit 59e7528dbfd52efbed05e0f11b2143217a12bc74)
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
2025-12-01 09:41:33 +01:00
Michael S. Tsirkin
205dd7a5d6 virtio_pci: drop kernel.h
virtio UAPI headers really have no business pulling in kernel.h
Replace it with const.h which seems to be what's needed
for __KERNEL_DIV_ROUND_UP.

Fixes: 7c1ae151e8 ("virtio_pci: Introduce device parts access commands")
Cc: Yishai Hadas <yishaih@nvidia.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Message-ID: <7a73b6c6af67e13b86633cd7bf11ad56b5d9809b.1763535341.git.mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2025-11-30 18:02:43 -05:00
Michael S. Tsirkin
dfe44d177a vhost: switch to arrays of feature bits
The current interface where caller has to know in which 64 bit chunk
each bit is, is inelegant and fragile.
Let's simply use arrays of bits.
By using unroll macros text size grows only slightly.

Message-ID: <637e182e139980e5930d50b928ba5ac072d628a9.1764225384.git.mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2025-11-30 18:02:43 -05:00
David Matlack
d721f52e31 vfio: selftests: Add vfio_pci_device_init_perf_test
Add a new VFIO selftest for measuring the time it takes to run
vfio_pci_device_init() in parallel for one or more devices.

This test serves as manual regression test for the performance
improvement of commit e908f58b6b ("vfio/pci: Separate SR-IOV VF
dev_set"). For example, when running this test with 64 VFs under the
same PF:

Before:

  $ ./vfio_pci_device_init_perf_test -r vfio_pci_device_init_perf_test.iommufd.init 0000:1a:00.0 0000:1a:00.1 ...
  ...
  Wall time: 6.653234463s
  Min init time (per device): 0.101215344s
  Max init time (per device): 6.652755941s
  Avg init time (per device): 3.377609608s

After:

  $ ./vfio_pci_device_init_perf_test -r vfio_pci_device_init_perf_test.iommufd.init 0000:1a:00.0 0000:1a:00.1 ...
  ...
  Wall time: 0.122978332s
  Min init time (per device): 0.108121915s
  Max init time (per device): 0.122762761s
  Avg init time (per device): 0.113816748s

This test does not make any assertions about performance, since any such
assertion is likely to be flaky due to system differences and random
noise. However this test can be fed into automation to detect
regressions, and can be used by developers in the future to measure
performance optimizations.

Suggested-by: Aaron Lewis <aaronlewis@google.com>
Reviewed-by: Alex Mastro <amastro@fb.com>
Tested-by: Alex Mastro <amastro@fb.com>
Reviewed-by: Raghavendra Rao Ananta <rananta@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Link: https://lore.kernel.org/r/20251126231733.3302983-19-dmatlack@google.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-28 10:58:07 -07:00
David Matlack
b8e96c8805 vfio: selftests: Eliminate INVALID_IOVA
Eliminate INVALID_IOVA as there are platforms where UINT64_MAX is a
valid iova.

Reviewed-by: Alex Mastro <amastro@fb.com>
Tested-by: Alex Mastro <amastro@fb.com>
Reviewed-by: Raghavendra Rao Ananta <rananta@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Link: https://lore.kernel.org/r/20251126231733.3302983-18-dmatlack@google.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-28 10:58:07 -07:00
David Matlack
5fabc49abf vfio: selftests: Split libvfio.h into separate header files
Split out the contents of libvfio.h into separate header files, but keep
libvfio.h as the top-level include that all tests can use.

Put all new header files into a libvfio/ subdirectory to avoid future
name conflicts in include paths when libvfio is used by other selftests
like KVM.

No functional change intended.

Reviewed-by: Alex Mastro <amastro@fb.com>
Tested-by: Alex Mastro <amastro@fb.com>
Reviewed-by: Raghavendra Rao Ananta <rananta@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Link: https://lore.kernel.org/r/20251126231733.3302983-17-dmatlack@google.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-28 10:58:07 -07:00
David Matlack
19cf492c1b vfio: selftests: Move vfio_selftests_*() helpers into libvfio.c
Move the vfio_selftests_*() helpers into their own file libvfio.c. These
helpers have nothing to do with struct vfio_pci_device, so they don't
make sense in vfio_pci_device.c.

No functional change intended.

Reviewed-by: Alex Mastro <amastro@fb.com>
Tested-by: Alex Mastro <amastro@fb.com>
Reviewed-by: Raghavendra Rao Ananta <rananta@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Link: https://lore.kernel.org/r/20251126231733.3302983-16-dmatlack@google.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-28 10:58:07 -07:00
David Matlack
657d241e69 vfio: selftests: Rename vfio_util.h to libvfio.h
Rename vfio_util.h to libvfio.h to match the name of libvfio.mk.

No functional change intended.

Reviewed-by: Alex Mastro <amastro@fb.com>
Tested-by: Alex Mastro <amastro@fb.com>
Reviewed-by: Raghavendra Rao Ananta <rananta@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Link: https://lore.kernel.org/r/20251126231733.3302983-15-dmatlack@google.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-28 10:58:07 -07:00
David Matlack
831c37a5bf vfio: selftests: Stop passing device for IOMMU operations
Drop the struct vfio_pci_device wrappers for IOMMU map/unmap functions
and require tests to directly call iommu_map(), iommu_unmap(), etc. This
results in more concise code, and also makes it clear the map operations
are happening on a struct iommu, not necessarily on a specific device,
especially when multi-device tests are introduced.

Do the same for iova_allocator_init() as that function only needs the
struct iommu, not struct vfio_pci_device.

Reviewed-by: Alex Mastro <amastro@fb.com>
Tested-by: Alex Mastro <amastro@fb.com>
Reviewed-by: Raghavendra Rao Ananta <rananta@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Link: https://lore.kernel.org/r/20251126231733.3302983-14-dmatlack@google.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-28 10:58:06 -07:00
David Matlack
2607a43613 vfio: selftests: Move IOVA allocator into iova_allocator.c
Move the IOVA allocator into its own file, to provide better separation
between the allocator and the struct vfio_pci_device helper code.

The allocator could go into iommu.c, but it is standalone enough that a
separate file seems cleaner. This also continues the trend of having a
.c for every major object in VFIO selftests (vfio_pci_device.c,
vfio_pci_driver.c, iommu.c, and now iova_allocator.c).

No functional change intended.

Reviewed-by: Alex Mastro <amastro@fb.com>
Tested-by: Alex Mastro <amastro@fb.com>
Reviewed-by: Raghavendra Rao Ananta <rananta@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Link: https://lore.kernel.org/r/20251126231733.3302983-13-dmatlack@google.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-28 10:58:06 -07:00
David Matlack
2aca571089 vfio: selftests: Move IOMMU library code into iommu.c
Move all the IOMMU related library code into their own file iommu.c.
This provides a better separation between the vfio_pci_device helper
code and the iommu code.

No function change intended.

Reviewed-by: Alex Mastro <amastro@fb.com>
Tested-by: Alex Mastro <amastro@fb.com>
Reviewed-by: Raghavendra Rao Ananta <rananta@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Link: https://lore.kernel.org/r/20251126231733.3302983-12-dmatlack@google.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-28 10:58:06 -07:00
David Matlack
9a659d74f2 vfio: selftests: Rename struct vfio_dma_region to dma_region
Rename struct vfio_dma_region to dma_region. This is in preparation for
separating the VFIO PCI device library code from the IOMMU library code.
This name change also better reflects the fact that DMA mappings can be
managed by either VFIO or IOMMUFD. i.e. the "vfio_" prefix is
misleading.

Reviewed-by: Alex Mastro <amastro@fb.com>
Tested-by: Alex Mastro <amastro@fb.com>
Reviewed-by: Raghavendra Rao Ananta <rananta@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Link: https://lore.kernel.org/r/20251126231733.3302983-11-dmatlack@google.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-28 10:58:06 -07:00
David Matlack
28a84da744 vfio: selftests: Upgrade driver logging to dev_err()
Upgrade various logging in the VFIO selftests drivers from dev_info() to
dev_err(). All of these logs indicate scenarios that may be unexpected.
For example, the logging during probing indicates matching devices but
that aren't supported by the driver. And the memcpy errors can indicate
a problem if the caller was not trying to do something like exercise I/O
fault handling. Exercising I/O fault handling is certainly a valid thing
to do, but the driver can't infer the caller's expectations, so better
to just log with dev_err().

Suggested-by: Raghavendra Rao Ananta <rananta@google.com>
Reviewed-by: Alex Mastro <amastro@fb.com>
Tested-by: Alex Mastro <amastro@fb.com>
Reviewed-by: Raghavendra Rao Ananta <rananta@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Link: https://lore.kernel.org/r/20251126231733.3302983-10-dmatlack@google.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-28 10:58:06 -07:00
David Matlack
c48545442e vfio: selftests: Prefix logs with device BDF where relevant
Prefix log messages with the device's BDF where relevant. This will help
understanding VFIO selftests logs when tests are run with multiple
devices.

Reviewed-by: Alex Mastro <amastro@fb.com>
Tested-by: Alex Mastro <amastro@fb.com>
Reviewed-by: Raghavendra Rao Ananta <rananta@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Link: https://lore.kernel.org/r/20251126231733.3302983-9-dmatlack@google.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-28 10:58:06 -07:00
David Matlack
6c74d9830d vfio: selftests: Eliminate overly chatty logging
Eliminate overly chatty logs that are printed during almost every test.
These logs are adding more noise than value. If a test cares about this
information it can log it itself. This is especially true as the VFIO
selftests gains support for multiple devices in a single test (which
multiplies all these logs).

Reviewed-by: Alex Mastro <amastro@fb.com>
Tested-by: Alex Mastro <amastro@fb.com>
Reviewed-by: Raghavendra Rao Ananta <rananta@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Link: https://lore.kernel.org/r/20251126231733.3302983-8-dmatlack@google.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-28 10:58:06 -07:00
David Matlack
d8470a775c vfio: selftests: Support multiple devices in the same container/iommufd
Support tests that want to add multiple devices to the same
container/iommufd by decoupling struct vfio_pci_device from
struct iommu.

Multi-devices tests can now put multiple devices in the same
container/iommufd like so:

  iommu = iommu_init(iommu_mode);

  device1 = vfio_pci_device_init(bdf1, iommu);
  device2 = vfio_pci_device_init(bdf2, iommu);
  device3 = vfio_pci_device_init(bdf3, iommu);

  ...

  vfio_pci_device_cleanup(device3);
  vfio_pci_device_cleanup(device2);
  vfio_pci_device_cleanup(device1);

  iommu_cleanup(iommu);

To account for the new separation of vfio_pci_device and iommu, update
existing tests to initialize and cleanup a struct iommu.

Reviewed-by: Alex Mastro <amastro@fb.com>
Tested-by: Alex Mastro <amastro@fb.com>
Reviewed-by: Raghavendra Rao Ananta <rananta@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Link: https://lore.kernel.org/r/20251126231733.3302983-7-dmatlack@google.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-28 10:58:06 -07:00
David Matlack
c9756b4d27 vfio: selftests: Introduce struct iommu
Introduce struct iommu, which logically represents either a VFIO
container or an iommufd IOAS, depending on which IOMMU mode is used by
the test.

This will be used in a subsequent commit to allow devices to be added to
the same container/iommufd.

Reviewed-by: Alex Mastro <amastro@fb.com>
Tested-by: Alex Mastro <amastro@fb.com>
Reviewed-by: Raghavendra Rao Ananta <rananta@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Link: https://lore.kernel.org/r/20251126231733.3302983-6-dmatlack@google.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-28 10:58:06 -07:00
David Matlack
dd56ef239d vfio: selftests: Rename struct vfio_iommu_mode to iommu_mode
Rename struct vfio_iommu_mode to struct iommu_mode since the mode can
include iommufd. This also prepares for splitting out all the IOMMU code
into its own structs/helpers/files which are independent from the
vfio_pci_device code.

No function change intended.

Reviewed-by: Alex Mastro <amastro@fb.com>
Tested-by: Alex Mastro <amastro@fb.com>
Reviewed-by: Raghavendra Rao Ananta <rananta@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Link: https://lore.kernel.org/r/20251126231733.3302983-5-dmatlack@google.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-28 10:58:06 -07:00
David Matlack
6282ca8585 vfio: selftests: Allow passing multiple BDFs on the command line
Add support for passing multiple device BDFs to a test via the command
line. This is a prerequisite for multi-device tests.

Single-device tests can continue using vfio_selftests_get_bdf(), which
will continue to return argv[argc - 1] (if it is a BDF string), or the
environment variable $VFIO_SELFTESTS_BDF otherwise.

For multi-device tests, a new helper called vfio_selftests_get_bdfs() is
introduced which will return an array of all BDFs found at the end of
argv[], as well as the number of BDFs found (passed back to the caller
via argument). The array of BDFs returned does not need to be freed by
the caller.

The environment variable VFIO_SELFTESTS_BDF continues to support only a
single BDF for the time being.

Reviewed-by: Alex Mastro <amastro@fb.com>
Tested-by: Alex Mastro <amastro@fb.com>
Reviewed-by: Raghavendra Rao Ananta <rananta@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Link: https://lore.kernel.org/r/20251126231733.3302983-4-dmatlack@google.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-28 10:58:06 -07:00
David Matlack
fa246a1d06 vfio: selftests: Split run.sh into separate scripts
Split run.sh into separate scripts (setup.sh, run.sh, cleanup.sh) to
enable multi-device testing, and prepare for VFIO selftests
automatically detecting which devices to use for testing by storing
device metadata on the filesystem.

 - setup.sh takes one or more BDFs as arguments and sets up each device.
   Metadata about each device is stored on the filesystem in the
   directory:

	   ${TMPDIR:-/tmp}/vfio-selftests-devices

   Within this directory is a directory for each BDF, and then files in
   those directories that cleanup.sh uses to cleanup the device.

 - run.sh runs a selftest by passing it the BDFs of all set up devices.

 - cleanup.sh takes zero or more BDFs as arguments and cleans up each
   device. If no BDFs are provided, it cleans up all devices.

This split enables multi-device testing by allowing multiple BDFs to be
set up and passed into tests:

For example:

  $ tools/testing/selftests/vfio/scripts/setup.sh <BDF1> <BDF2>
  $ tools/testing/selftests/vfio/scripts/setup.sh <BDF3>
  $ tools/testing/selftests/vfio/scripts/run.sh echo
  <BDF1> <BDF2> <BDF3>
  $ tools/testing/selftests/vfio/scripts/cleanup.sh

In the future, VFIO selftests can automatically detect set up devices by
inspecting ${TMPDIR:-/tmp}/vfio-selftests-devices. This will avoid the
need for the run.sh script.

Reviewed-by: Alex Mastro <amastro@fb.com>
Tested-by: Alex Mastro <amastro@fb.com>
Reviewed-by: Raghavendra Rao Ananta <rananta@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Link: https://lore.kernel.org/r/20251126231733.3302983-3-dmatlack@google.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-28 10:58:06 -07:00
David Matlack
2d5dbd3156 vfio: selftests: Move run.sh into scripts directory
Move run.sh in a new sub-directory scripts/. This directory will be used
to house various helper scripts to be used by humans and automation for
running VFIO selftests.

Opportunistically also switch run.sh from TEST_PROGS_EXTENDED to
TEST_FILES. The former is for actual test executables that are just not
run by default. TEST_FILES is a better fit for helper scripts.

No functional change intended.

Reviewed-by: Alex Mastro <amastro@fb.com>
Tested-by: Alex Mastro <amastro@fb.com>
Reviewed-by: Raghavendra Rao Ananta <rananta@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Link: https://lore.kernel.org/r/20251126231733.3302983-2-dmatlack@google.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-28 10:58:06 -07:00
Alex Williamson
a63a03afd8 Merge tag 'vfio-v6.18-rc6' into v6.19/vfio/next
Merge mainline vfio-selftest updates for ongoing v6.19 work.

Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-28 10:54:22 -07:00
Ankit Agrawal
a23b10608d vfio/nvgrace-gpu: wait for the GPU mem to be ready
Speculative prefetches from CPU to GPU memory until the GPU is
ready after reset can cause harmless corrected RAS events to
be logged on Grace systems. It is thus preferred that the
mapping not be re-established until the GPU is ready post reset.

The GPU readiness can be checked through BAR0 registers similar
to the checking at the time of device probe.

It can take several seconds for the GPU to be ready. So it is
desirable that the time overlaps as much of the VM startup as
possible to reduce impact on the VM bootup time. The GPU
readiness state is thus checked on the first fault/huge_fault
request or read/write access which amortizes the GPU readiness
time.

The first fault and read/write checks the GPU state when the
reset_done flag - which denotes whether the GPU has just been
reset. The memory_lock is taken across map/access to avoid
races with GPU reset.

Also check if the memory is enabled, before waiting for GPU
to be ready. Otherwise the readiness check would block for 30s.

Lastly added PM handling wrapping on read/write access.

Cc: Shameer Kolothum <skolothumtho@nvidia.com>
Cc: Alex Williamson <alex@shazbot.org>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Vikram Sethi <vsethi@nvidia.com>
Reviewed-by: Shameer Kolothum <skolothumtho@nvidia.com>
Suggested-by: Alex Williamson <alex@shazbot.org>
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
Link: https://lore.kernel.org/r/20251127170632.3477-7-ankita@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-28 10:09:26 -07:00
Ankit Agrawal
dfe765499a vfio/nvgrace-gpu: Inform devmem unmapped after reset
Introduce a new flag reset_done to notify that the GPU has just
been reset and the mapping to the GPU memory is zapped.

Implement the reset_done handler to set this new variable. It
will be used later in the patches to wait for the GPU memory
to be ready before doing any mapping or access.

Cc: Jason Gunthorpe <jgg@ziepe.ca>
Reviewed-by: Shameer Kolothum <skolothumtho@nvidia.com>
Suggested-by: Alex Williamson <alex@shazbot.org>
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
Link: https://lore.kernel.org/r/20251127170632.3477-6-ankita@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-28 10:07:26 -07:00
Ankit Agrawal
7d055071d7 vfio/nvgrace-gpu: split the code to wait for GPU ready
Split the function that check for the GPU device being ready on
the probe.

Move the code to wait for the GPU to be ready through BAR0 register
reads to a separate function. This would help reuse the code.

This also fixes a bug where the return status in case of timeout
gets overridden by return from pci_enable_device. With the fix,
a timeout generate an error as initially intended.

Fixes: d85f69d520 ("vfio/nvgrace-gpu: Check the HBM training and C2C link status")

Reviewed-by: Zhi Wang <zhiw@nvidia.com>
Reviewed-by: Shameer Kolothum <skolothumtho@nvidia.com>
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
Link: https://lore.kernel.org/r/20251127170632.3477-5-ankita@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-28 10:07:25 -07:00
Ankit Agrawal
7f5764e179 vfio: use vfio_pci_core_setup_barmap to map bar in mmap
Remove code duplication in vfio_pci_core_mmap by calling
vfio_pci_core_setup_barmap to perform the bar mapping.

No functional change is intended.

Cc: Donald Dutile <ddutile@redhat.com>
Reviewed-by: Shameer Kolothum <skolothumtho@nvidia.com>
Reviewed-by: Zhi Wang <zhiw@nvidia.com>
Suggested-by: Alex Williamson <alex@shazbot.org>
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
Link: https://lore.kernel.org/r/20251127170632.3477-4-ankita@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-28 10:07:25 -07:00
Ankit Agrawal
9db65489b8 vfio/nvgrace-gpu: Add support for huge pfnmap
NVIDIA's Grace based systems have large device memory. The device
memory is mapped as VM_PFNMAP in the VMM VMA. The nvgrace-gpu
module could make use of the huge PFNMAP support added in mm [1].

To make use of the huge pfnmap support, fault/huge_fault ops
based mapping mechanism needs to be implemented. Currently nvgrace-gpu
module relies on remap_pfn_range to do the mapping during VM bootup.
Replace it to instead rely on fault and use vfio_pci_vmf_insert_pfn
to setup the mapping.

Moreover to enable huge pfnmap, nvgrace-gpu module is updated by
adding huge_fault ops implementation. The implementation establishes
mapping according to the order request. Note that if the PFN or the
VMA address is unaligned to the order, the mapping fallbacks to
the PTE level.

Link: https://lore.kernel.org/all/20240826204353.2228736-1-peterx@redhat.com/ [1]

Cc: Shameer Kolothum <skolothumtho@nvidia.com>
Cc: Alex Williamson <alex@shazbot.org>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Vikram Sethi <vsethi@nvidia.com>
Reviewed-by: Zhi Wang <zhiw@nvidia.com>
Reviewed-by: Shameer Kolothum <skolothumtho@nvidia.com>
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
Link: https://lore.kernel.org/r/20251127170632.3477-3-ankita@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-28 10:07:25 -07:00
Ankit Agrawal
9b92bc7554 vfio: refactor vfio_pci_mmap_huge_fault function
Refactor vfio_pci_mmap_huge_fault to take out the implementation
to map the VMA to the PTE/PMD/PUD as a separate function.

Export the new function to be used by nvgrace-gpu module.

Move the alignment check code to verify that pfn and VMA VA is
aligned to the page order to the header file and make it inline.

No functional change is intended.

Cc: Shameer Kolothum <skolothumtho@nvidia.com>
Cc: Alex Williamson <alex@shazbot.org>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Reviewed-by: Shameer Kolothum <skolothumtho@nvidia.com>
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
Link: https://lore.kernel.org/r/20251127170632.3477-2-ankita@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-28 10:07:25 -07:00
Alex Mastro
590d745680 dma-buf: fix integer overflow in fill_sg_entry() for buffers >= 8GiB
fill_sg_entry() splits large DMA buffers into multiple scatter-gather
entries, each holding up to UINT_MAX bytes. When calculating the DMA
address for entries beyond the second one, the expression (i * UINT_MAX)
causes integer overflow due to 32-bit arithmetic.

This manifests when the input arg length >= 8 GiB results in looping for
i >= 2.

Fix by casting i to dma_addr_t before multiplication.

Fixes: 3aa31a8bb1 ("dma-buf: provide phys_vec to scatter-gather mapping routine")
Signed-off-by: Alex Mastro <amastro@fb.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Leon Romanovsky <leon@kernel.org>
Link: https://lore.kernel.org/r/20251125-dma-buf-overflow-v1-1-b70ea1e6c4ba@fb.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-28 10:06:25 -07:00
Alex Williamson
98693e0897 vfio/pci: Use RCU for error/request triggers to avoid circular locking
Thanks to a device generating an ACS violation during bus reset,
lockdep reported the following circular locking issue:

CPU0: SET_IRQS (MSI/X): holds igate, acquires memory_lock
CPU1: HOT_RESET: holds memory_lock, acquires pci_bus_sem
CPU2: AER: holds pci_bus_sem, acquires igate

This results in a potential 3-way deadlock.

Remove the pci_bus_sem->igate leg of the triangle by using RCU
to peek at the eventfd rather than locking it with igate.

Fixes: 3be3a074cf ("vfio-pci: Don't use device_lock around AER interrupt setup")
Signed-off-by: Alex Williamson <alex.williamson@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20251124223623.2770706-1-alex@shazbot.org
Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-28 10:04:27 -07:00
Joerg Roedel
0d081b1694 Merge branches 'arm/smmu/updates', 'arm/smmu/bindings', 'mediatek', 'nvidia/tegra', 'intel/vt-d', 'amd/amd-vi' and 'core' into next 2025-11-28 08:44:21 +01:00
Jason Gunthorpe
1eb0ae6fbd iommupt/vtd: Support mgaw's less than a 4 level walk for first stage
If the IOVA is limited to less than 48 the page table will be constructed
with a 3 level configuration which is unsupported by hardware.

Like the second stage the caller needs to pass in both the top_level an
the vasz to specify a table that has more levels than required to hold the
IOVA range.

Fixes: 6cbc09b771 ("iommu/vt-d: Restore previous domain::aperture_end calculation")
Reported-by: Calvin Owens <calvin@wbinvd.org>
Closes: https://lore.kernel.org/r/8f257d2651eb8a4358fcbd47b0145002e5f1d638.1764237717.git.calvin@wbinvd.org
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Tested-by: Calvin Owens <calvin@wbinvd.org>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-28 08:43:55 +01:00
Jason Gunthorpe
d856f9d278 iommupt/vtd: Allow VT-d to have a larger table top than the vasz requires
VT-d second stage HW specifies both the maximum IOVA and the supported
table walk starting points. Weirdly there is HW that only supports a 4
level walk but has a maximum IOVA that only needs 3.

The current code miscalculates this and creates a wrongly sized page table
which ultimately fails the compatibility check for number of levels.

This is fixed by allowing the page table to be created with both a vasz
and top_level input. The vasz will set the aperture for the domain while
the top_level will set the page table geometry.

Add top_level to vtdss and correct the logic in VT-d to generate the right
top_level and vasz from mgaw and sagaw.

Fixes: d373449d8e ("iommu/vt-d: Use the generic iommu page table")
Reported-by: Calvin Owens <calvin@wbinvd.org>
Closes: https://lore.kernel.org/r/8f257d2651eb8a4358fcbd47b0145002e5f1d638.1764237717.git.calvin@wbinvd.org
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Tested-by: Calvin Owens <calvin@wbinvd.org>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-28 08:43:03 +01:00
Jason Gunthorpe
416d9a220e powerpc/pseries/svm: Make mem_encrypt.h self contained
Add the missing forward declarations and includes so it does not have
implicit dependencies. mem_encrypt.h is a public header imported by
drivers. Users should not have to guess what include files are needed.

Resolves a kbuild splat:

   In file included from drivers/iommu/generic_pt/fmt/iommu_amdv1.c:15:
   In file included from drivers/iommu/generic_pt/fmt/iommu_template.h:36:
   In file included from drivers/iommu/generic_pt/fmt/amdv1.h:23:
   In file included from include/linux/mem_encrypt.h:17:
>> arch/powerpc/include/asm/mem_encrypt.h:13:49: warning: declaration of 'struct device' will not be visible outside of this function [-Wvisibility]
      13 | static inline bool force_dma_unencrypted(struct device *dev)

Fixes: 879ced2bab ("iommupt: Add the AMD IOMMU v1 page table format")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202511161358.rS5pSb3U-lkp@intel.com/
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-28 08:40:55 +01:00
Geert Uytterhoeven
01569c216d genpt: Make GENERIC_PT invisible
There is no point in asking the user about the Generic Radix Page
Table API:
  - All IOMMU drivers that use this API already select GENERIC_PT when
    needed,
  - Most users probably do not know what to answer anyway.

Fixes: 7c5b184db7 ("genpt: Generic Page Table base API")
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-28 08:40:24 +01:00
Niklas Cassel
6ce0dd9f54 ata: libata-core: Disable LPM on Silicon Motion MD619{H,G}XCLDE3TC
According to a user report, the Silicon Motion MD619HXCLDE3TC SSD and
the Silicon Motion MD619GXCLDE3TC SSD have problems with LPM.

Reported-by: Yihang Li <liyihang9@h-partners.com>
Closes: https://lore.kernel.org/linux-ide/20251121073502.3388239-1-liyihang9@h-partners.com/
Signed-off-by: Niklas Cassel <cassel@kernel.org>
2025-11-28 06:58:24 +01:00
Stefan Metzmacher
80a85a771d RDMA/rxe: reclassify sockets in order to avoid false positives from lockdep
While developing IPPROTO_SMBDIRECT support for the code
under fs/smb/common/smbdirect [1], I noticed false positives like this:

[+0,003927] ============================================
[+0,000532] WARNING: possible recursive locking detected
[+0,000611] 6.18.0-rc5-metze-kasan-lockdep.02+ #1 Tainted: G           OE
[+0,000835] --------------------------------------------
[+0,000729] ksmbd:r5445/3609 is trying to acquire lock:
[+0,000709] ffff88800b9570f8 (k-sk_lock-AF_INET){+.+.}-{0:0},
                              at: inet_shutdown+0x52/0x360
[+0,000831]
            but task is already holding lock:
[+0,000684] ffff88800654af78 (k-sk_lock-AF_INET){+.+.}-{0:0},
                           at: smbdirect_sk_close+0x122/0x790 [smbdirect]
[+0,000928]
            other info that might help us debug this:
[+0,005552]  Possible unsafe locking scenario:

[+0,000723]        CPU0
[+0,000359]        ----
[+0,000377]   lock(k-sk_lock-AF_INET);
[+0,000478]   lock(k-sk_lock-AF_INET);
[+0,000498]
             *** DEADLOCK ***

[+0,001012]  May be due to missing lock nesting notation

[+0,000831] 3 locks held by ksmbd:r5445/3609:
[+0,000484]  #0: ffff88800654af78 (k-sk_lock-AF_INET){+.+.}-{0:0},
                           at: smbdirect_sk_close+0x122/0x790 [smbdirect]
[+0,001000]  #1: ffff888020a40458 (&id_priv->handler_mutex){+.+.}-{4:4},
                           at: rdma_lock_handler+0x17/0x30 [rdma_cm]
[+0,000982]  #2: ffff888020a40350 (&id_priv->qp_mutex){+.+.}-{4:4},
                           at: rdma_destroy_qp+0x5d/0x1f0 [rdma_cm]
[+0,000934]
            stack backtrace:
[+0,000589] CPU: 0 UID: 0 PID: 3609 Comm: ksmbd:r5445 Kdump: loaded
             Tainted: G           OE
             6.18.0-rc5-metze-kasan-lockdep.02+ #1 PREEMPT(voluntary)
[+0,000023] Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
[+0,000004] Hardware name: innotek GmbH VirtualBox/VirtualBox,
            BIOS VirtualBox 12/01/2006
...
[+0,000010] print_deadlock_bug+0x245/0x330
[+0,000014] validate_chain+0x32a/0x590
[+0,000012] __lock_acquire+0x535/0xc30
[+0,000013] lock_acquire.part.0+0xb3/0x240
[+0,000017] ? inet_shutdown+0x52/0x360
[+0,000013] ? srso_alias_return_thunk+0x5/0xfbef5
[+0,000007] ? mark_held_locks+0x46/0x90
[+0,000012] lock_acquire+0x60/0x140
[+0,000006] ? inet_shutdown+0x52/0x360
[+0,000028] lock_sock_nested+0x3b/0xf0
[+0,000009] ? inet_shutdown+0x52/0x360
[+0,000008] inet_shutdown+0x52/0x360
[+0,000010] kernel_sock_shutdown+0x5b/0x90
[+0,000011] rxe_qp_do_cleanup+0x4ef/0x810 [rdma_rxe]
[+0,000043] ? __pfx_rxe_qp_do_cleanup+0x10/0x10 [rdma_rxe]
[+0,000030] execute_in_process_context+0x2b/0x170
[+0,000013] rxe_qp_cleanup+0x1c/0x30 [rdma_rxe]
[+0,000021] __rxe_cleanup+0x1cf/0x2e0 [rdma_rxe]
[+0,000036] ? __pfx___rxe_cleanup+0x10/0x10 [rdma_rxe]
[+0,000020] ? srso_alias_return_thunk+0x5/0xfbef5
[+0,000006] ? __kasan_check_read+0x11/0x20
[+0,000012] rxe_destroy_qp+0xe1/0x230 [rdma_rxe]
[+0,000035] ib_destroy_qp_user+0x217/0x450 [ib_core]
[+0,000074] rdma_destroy_qp+0x83/0x1f0 [rdma_cm]
[+0,000034] smbdirect_connection_destroy_qp+0x98/0x2e0 [smbdirect]
[+0,000017] ? __pfx_smb_direct_logging_needed+0x10/0x10 [ksmbd]
[+0,000044] smbdirect_connection_destroy+0x698/0xed0 [smbdirect]
[+0,000023] ? __pfx_smbdirect_connection_destroy+0x10/0x10 [smbdirect]
[+0,000033] ? __pfx_smb_direct_logging_needed+0x10/0x10 [ksmbd]
[+0,000031] smbdirect_connection_destroy_sync+0x42b/0x9f0 [smbdirect]
[+0,000029] ? mark_held_locks+0x46/0x90
[+0,000012] ? __pfx_smbdirect_connection_destroy_sync+0x10/0x10 [smbdirect]
[+0,000019] ? srso_alias_return_thunk+0x5/0xfbef5
[+0,000007] ? trace_hardirqs_on+0x64/0x70
[+0,000029] ? srso_alias_return_thunk+0x5/0xfbef5
[+0,000010] ? srso_alias_return_thunk+0x5/0xfbef5
[+0,000006] ? __smbdirect_connection_schedule_disconnect+0x339/0x4b0
[+0,000021] smbdirect_sk_destroy+0xb0/0x680 [smbdirect]
[+0,000024] ? srso_alias_return_thunk+0x5/0xfbef5
[+0,000006] ? trace_hardirqs_on+0x64/0x70
[+0,000006] ? srso_alias_return_thunk+0x5/0xfbef5
[+0,000005] ? __local_bh_enable_ip+0xba/0x150
[+0,000011] sk_common_release+0x66/0x340
[+0,000010] smbdirect_sk_close+0x12a/0x790 [smbdirect]
[+0,000023] ? ip_mc_drop_socket+0x1e/0x240
[+0,000013] inet_release+0x10a/0x240
[+0,000011] smbdirect_sock_release+0x502/0xe80 [smbdirect]
[+0,000015] ? srso_alias_return_thunk+0x5/0xfbef5
[+0,000024] sock_release+0x91/0x1c0
[+0,000010] smb_direct_free_transport+0x31/0x50 [ksmbd]
[+0,000025] ksmbd_conn_free+0x1d0/0x240 [ksmbd]
[+0,000040] smb_direct_disconnect+0xb2/0x120 [ksmbd]
[+0,000023] ? srso_alias_return_thunk+0x5/0xfbef5
[+0,000018] ksmbd_conn_handler_loop+0x94e/0xf10 [ksmbd]
...

I'll also add reclassify to the smbdirect socket code [1],
but I think it's better to have it in both direction
(below and above the RDMA layer).

[1]
https://git.samba.org/?p=metze/linux/wip.git;a=shortlog;h=refs/heads/master-ipproto-smbdirect

Cc: Zhu Yanjun <zyjzyj2000@gmail.com>
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: linux-rdma@vger.kernel.org
Cc: netdev@vger.kernel.org
Cc: linux-cifs@vger.kernel.org
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Link: https://patch.msgid.link/20251127105614.2040922-1-metze@samba.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-27 07:10:02 -05:00
Jason Gunthorpe
5de863efbf iommupt: Avoid a compiler bug with sw_bit
gcc 13, in some cases, gets confused if the __builtin_constant_p() is
inside the switch. It thinks that bitnr can have the value max+1 and
fails. Lift the check outside the switch to avoid it.

Fixes: ef7bfe5bbf ("iommupt/x86: Support SW bits and permit PT_FEAT_DMA_INCOHERENT")
Fixes: 5448c1558f ("iommupt: Add the Intel VT-d second stage page table format")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202511242012.I7g504Ab-lkp@intel.com/
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-27 12:46:10 +01:00
Stefan Metzmacher
3a2c32d357 RDMA/siw: reclassify sockets in order to avoid false positives from lockdep
While developing IPPROTO_SMBDIRECT support for the code
under fs/smb/common/smbdirect [1], I noticed false positives like this:

[T79] ======================================================
[T79] WARNING: possible circular locking dependency detected
[T79] 6.18.0-rc4-metze-kasan-lockdep.01+ #1 Tainted: G           OE
[T79] ------------------------------------------------------
[T79] kworker/2:0/79 is trying to acquire lock:
[T79] ffff88801f968278 (sk_lock-AF_INET){+.+.}-{0:0},
                        at: sock_set_reuseaddr+0x14/0x70
[T79]
        but task is already holding lock:
[T79] ffffffffc10f7230 (lock#9){+.+.}-{4:4},
                        at: rdma_listen+0x3d2/0x740 [rdma_cm]
[T79]
        which lock already depends on the new lock.

[T79]
        the existing dependency chain (in reverse order) is:
[T79]
        -> #1 (lock#9){+.+.}-{4:4}:
[T79]        __lock_acquire+0x535/0xc30
[T79]        lock_acquire.part.0+0xb3/0x240
[T79]        lock_acquire+0x60/0x140
[T79]        __mutex_lock+0x1af/0x1c10
[T79]        mutex_lock_nested+0x1b/0x30
[T79]        cma_get_port+0xba/0x7d0 [rdma_cm]
[T79]        rdma_bind_addr_dst+0x598/0x9a0 [rdma_cm]
[T79]        cma_bind_addr+0x107/0x320 [rdma_cm]
[T79]        rdma_resolve_addr+0xa3/0x830 [rdma_cm]
[T79]        destroy_lease_table+0x12b/0x420 [ksmbd]
[T79]        ksmbd_NTtimeToUnix+0x3e/0x80 [ksmbd]
[T79]        ndr_encode_posix_acl+0x6e9/0xab0 [ksmbd]
[T79]        ndr_encode_v4_ntacl+0x53/0x870 [ksmbd]
[T79]        __sys_connect_file+0x131/0x1c0
[T79]        __sys_connect+0x111/0x140
[T79]        __x64_sys_connect+0x72/0xc0
[T79]        x64_sys_call+0xe7d/0x26a0
[T79]        do_syscall_64+0x93/0xff0
[T79]        entry_SYSCALL_64_after_hwframe+0x76/0x7e
[T79]
        -> #0 (sk_lock-AF_INET){+.+.}-{0:0}:
[T79]        check_prev_add+0xf3/0xcd0
[T79]        validate_chain+0x466/0x590
[T79]        __lock_acquire+0x535/0xc30
[T79]        lock_acquire.part.0+0xb3/0x240
[T79]        lock_acquire+0x60/0x140
[T79]        lock_sock_nested+0x3b/0xf0
[T79]        sock_set_reuseaddr+0x14/0x70
[T79]        siw_create_listen+0x145/0x1540 [siw]
[T79]        iw_cm_listen+0x313/0x5b0 [iw_cm]
[T79]        cma_iw_listen+0x271/0x3c0 [rdma_cm]
[T79]        rdma_listen+0x3b1/0x740 [rdma_cm]
[T79]        cma_listen_on_dev+0x46a/0x750 [rdma_cm]
[T79]        rdma_listen+0x4b0/0x740 [rdma_cm]
[T79]        ksmbd_rdma_init+0x12b/0x270 [ksmbd]
[T79]        ksmbd_conn_transport_init+0x26/0x70 [ksmbd]
[T79]        server_ctrl_handle_work+0x1e5/0x280 [ksmbd]
[T79]        process_one_work+0x86c/0x1930
[T79]        worker_thread+0x6f0/0x11f0
[T79]        kthread+0x3ec/0x8b0
[T79]        ret_from_fork+0x314/0x400
[T79]        ret_from_fork_asm+0x1a/0x30
[T79]
        other info that might help us debug this:

[T79]  Possible unsafe locking scenario:

[T79]        CPU0                    CPU1
[T79]        ----                    ----
[T79]   lock(lock#9);
[T79]                                lock(sk_lock-AF_INET);
[T79]                                lock(lock#9);
[T79]   lock(sk_lock-AF_INET);
[T79]
         *** DEADLOCK ***

[T79] 5 locks held by kworker/2:0/79:
[T79] #0: ffff88800120b158 ((wq_completion)events_long){+.+.}-{0:0},
                           at: process_one_work+0xfca/0x1930
[T79] #1: ffffc9000474fd00 ((work_completion)(&ctrl->ctrl_work))
                           {+.+.}-{0:0},
                           at: process_one_work+0x804/0x1930
[T79] #2: ffffffffc11307d0 (ctrl_lock){+.+.}-{4:4},
                           at: server_ctrl_handle_work+0x21/0x280 [ksmbd]
[T79] #3: ffffffffc11347b0 (init_lock){+.+.}-{4:4},
                           at: ksmbd_conn_transport_init+0x18/0x70 [ksmbd]
[T79] #4: ffffffffc10f7230 (lock#9){+.+.}-{4:4},
                            at: rdma_listen+0x3d2/0x740 [rdma_cm]
[T79]
        stack backtrace:
[T79] CPU: 2 UID: 0 PID: 79 Comm: kworker/2:0 Kdump: loaded
      Tainted: G           OE
      6.18.0-rc4-metze-kasan-lockdep.01+ #1 PREEMPT(voluntary)
[T79] Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
[T79] Hardware name: innotek GmbH VirtualBox/VirtualBox,
      BIOS VirtualBox 12/01/2006
[T79] Workqueue: events_long server_ctrl_handle_work [ksmbd]
...
[T79]  print_circular_bug+0xfd/0x130
[T79]  check_noncircular+0x150/0x170
[T79]  check_prev_add+0xf3/0xcd0
[T79]  validate_chain+0x466/0x590
[T79]  __lock_acquire+0x535/0xc30
[T79]  ? srso_alias_return_thunk+0x5/0xfbef5
[T79]  lock_acquire.part.0+0xb3/0x240
[T79]  ? sock_set_reuseaddr+0x14/0x70
[T79]  ? srso_alias_return_thunk+0x5/0xfbef5
[T79]  ? __kasan_check_write+0x14/0x30
[T79]  ? srso_alias_return_thunk+0x5/0xfbef5
[T79]  ? apparmor_socket_post_create+0x180/0x700
[T79]  lock_acquire+0x60/0x140
[T79]  ? sock_set_reuseaddr+0x14/0x70
[T79]  lock_sock_nested+0x3b/0xf0
[T79]  ? sock_set_reuseaddr+0x14/0x70
[T79]  sock_set_reuseaddr+0x14/0x70
[T79]  siw_create_listen+0x145/0x1540 [siw]
[T79]  ? srso_alias_return_thunk+0x5/0xfbef5
[T79]  ? local_clock_noinstr+0xe/0xd0
[T79]  ? __pfx_siw_create_listen+0x10/0x10 [siw]
[T79]  ? trace_preempt_on+0x4c/0x130
[T79]  ? __raw_spin_unlock_irqrestore+0x4a/0x90
[T79]  ? srso_alias_return_thunk+0x5/0xfbef5
[T79]  ? preempt_count_sub+0x52/0x80
[T79]  iw_cm_listen+0x313/0x5b0 [iw_cm]
[T79]  cma_iw_listen+0x271/0x3c0 [rdma_cm]
[T79]  ? srso_alias_return_thunk+0x5/0xfbef5
[T79]  rdma_listen+0x3b1/0x740 [rdma_cm]
[T79]  ? _raw_spin_unlock+0x2c/0x60
[T79]  ? __pfx_rdma_listen+0x10/0x10 [rdma_cm]
[T79]  ? rdma_restrack_add+0x12c/0x630 [ib_core]
[T79]  ? srso_alias_return_thunk+0x5/0xfbef5
[T79]  cma_listen_on_dev+0x46a/0x750 [rdma_cm]
[T79]  rdma_listen+0x4b0/0x740 [rdma_cm]
[T79]  ? __pfx_rdma_listen+0x10/0x10 [rdma_cm]
[T79]  ? cma_get_port+0x30d/0x7d0 [rdma_cm]
[T79]  ? srso_alias_return_thunk+0x5/0xfbef5
[T79]  ? rdma_bind_addr_dst+0x598/0x9a0 [rdma_cm]
[T79]  ksmbd_rdma_init+0x12b/0x270 [ksmbd]
[T79]  ? __pfx_ksmbd_rdma_init+0x10/0x10 [ksmbd]
[T79]  ? srso_alias_return_thunk+0x5/0xfbef5
[T79]  ? srso_alias_return_thunk+0x5/0xfbef5
[T79]  ? register_netdevice_notifier+0x1dc/0x240
[T79]  ksmbd_conn_transport_init+0x26/0x70 [ksmbd]
[T79]  server_ctrl_handle_work+0x1e5/0x280 [ksmbd]
[T79]  process_one_work+0x86c/0x1930
[T79]  ? __pfx_process_one_work+0x10/0x10
[T79]  ? srso_alias_return_thunk+0x5/0xfbef5
[T79]  ? assign_work+0x16f/0x280
[T79]  worker_thread+0x6f0/0x11f0

I was not able to reproduce this as I was testing with various
runs switching siw and rxe as well as IPPROTO_SMBDIRECT sockets,
while the above stack used siw with the non IPPROTO_SMBDIRECT
patches [1].

Even if this patch doesn't solve the above I think it's
a good idea to reclassify the sockets used by siw,
I also send patches for rxe to reclassify, as well
as my IPPROTO_SMBDIRECT socket patches [1] will do it,
this should minimize potential false positives.

[1]
https://git.samba.org/?p=metze/linux/wip.git;a=shortlog;h=refs/heads/master-ipproto-smbdirect

Cc: Bernard Metzler <bernard.metzler@linux.dev>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: linux-rdma@vger.kernel.org
Cc: netdev@vger.kernel.org
Cc: linux-cifs@vger.kernel.org
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Link: https://patch.msgid.link/20251126150842.1837072-1-metze@samba.org
Acked-by: Bernard Metzler <bernard.metzler@linux.dev>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-27 05:07:24 -05:00
Leon Romanovsky
155c9971fa RDMA/bng_re: Remove prefetch instruction
The prefetch instruction is meant to speed up access to memory referenced
by its address argument. In the bng_re code path, it has no effect,
because the pointer refers to a function that is executed immediately
afterward.

The issue was identified due to the following kbuild compilation error:

  drivers/infiniband/hw/bng_re/bng_fw.c: In function 'bng_re_creq_irq':
    drivers/infiniband/hw/bng_re/bng_fw.c:278:9: error: implicit
    declaration of function 'prefetch' [-Wimplicit-function-declaration]
      278 |     prefetch(bng_re_get_qe(hwq, sw_cons, NULL));
          |     ^~~~~~~~

Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202511260607.Kuxn4NnN-lkp@intel.com/
Fixes: 4f830cd8d7 ("RDMA/bng_re: Add infrastructure for enabling Firmware channel")
Link: https://patch.msgid.link/20251126-remove-prefetch-v1-1-fcac22007ea7@nvidia.com
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
2025-11-27 03:41:27 -05:00
Michael S. Tsirkin
350a840110 vhost/test: add test specific macro for features
test just uses vhost features with no change,
but people tend to copy/paste code, so let's
add our own define.

Message-ID: <23ca04512a800ee8b3594482492e536020931340.1764225384.git.mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2025-11-27 02:03:07 -05:00
Michael S. Tsirkin
9513f25056 virtio: clean up features qword/dword terms
virtio pci uses word to mean "16 bits". mmio uses it to mean
"32 bits".

To avoid confusion, let's avoid the term in core virtio
altogether. Just say U64 to mean "64 bit".

Fixes: e7d4c1c5a5 ("virtio: introduce extended features")
Cc: Paolo Abeni <pabeni@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Message-ID: <ad53b7b6be87fc524f45abaeca0bb05fb3633397.1764225384.git.mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2025-11-27 02:03:07 -05:00
Marco Crivellari
2828c60b24 vduse: add WQ_PERCPU to alloc_workqueue users
Currently if a user enqueues a work item using schedule_delayed_work() the
used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to
schedule_work() that is using system_wq and queue_work(), that makes use
again of WORK_CPU_UNBOUND.
This lack of consistency cannot be addressed without refactoring the API.

alloc_workqueue() treats all queues as per-CPU by default, while unbound
workqueues must opt-in via WQ_UNBOUND.

This default is suboptimal: most workloads benefit from unbound queues,
allowing the scheduler to place worker threads where they’re needed and
reducing noise when CPUs are isolated.

This continues the effort to refactor workqueue APIs, which began with
the introduction of new workqueues and a new alloc_workqueue flag in:

commit 128ea9f6cc ("workqueue: Add system_percpu_wq and system_dfl_wq")
commit 930c2ea566 ("workqueue: Add new WQ_PERCPU flag")

This change adds a new WQ_PERCPU flag to explicitly request
alloc_workqueue() to be per-cpu when WQ_UNBOUND has not been specified.

With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND),
any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND
must now use WQ_PERCPU.

Once migration is complete, WQ_UNBOUND can be removed and unbound will
become the implicit default.

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Message-Id: <20251107154917.313090-3-marco.crivellari@suse.com>
2025-11-27 02:03:07 -05:00
Marco Crivellari
a8980af1bf virtio_balloon: add WQ_PERCPU to alloc_workqueue users
Currently if a user enqueues a work item using schedule_delayed_work() the
used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to
schedule_work() that is using system_wq and queue_work(), that makes use
again of WORK_CPU_UNBOUND.
This lack of consistency cannot be addressed without refactoring the API.

alloc_workqueue() treats all queues as per-CPU by default, while unbound
workqueues must opt-in via WQ_UNBOUND.

This default is suboptimal: most workloads benefit from unbound queues,
allowing the scheduler to place worker threads where they’re needed and
reducing noise when CPUs are isolated.

This continues the effort to refactor workqueue APIs, which began with
the introduction of new workqueues and a new alloc_workqueue flag in:

commit 128ea9f6cc ("workqueue: Add system_percpu_wq and system_dfl_wq")
commit 930c2ea566 ("workqueue: Add new WQ_PERCPU flag")

This change adds a new WQ_PERCPU flag to explicitly request
alloc_workqueue() to be per-cpu when WQ_UNBOUND has not been specified.

With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND),
any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND
must now use WQ_PERCPU.

Once migration is complete, WQ_UNBOUND can be removed and unbound will
become the implicit default.

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Message-Id: <20251107154917.313090-2-marco.crivellari@suse.com>
2025-11-27 02:03:07 -05:00
Alok Tiwari
731ca4a4cc vdpa/pds: use %pe for ERR_PTR() in event handler registration
Use %pe instead of %ps when printing ERR_PTR() values. %ps is intended
for string pointers, while %pe correctly prints symbolic error names
for error pointers returned via ERR_PTR().
This shows the returned error value more clearly.

Fixes: 67f27b8b3a ("pds_vdpa: subscribe to the pds_core events")
Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com>
Reviewed-by: Brett Creeley <brett.creeley@amd.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Message-Id: <20251018174705.1511982-1-alok.a.tiwari@oracle.com>
2025-11-27 02:03:07 -05:00
Mike Christie
f3f64c2eaf vhost: Fix kthread worker cgroup failure handling
If we fail to attach to a cgroup we are leaking the id. This adds
a new goto to free the id.

Fixes: 7d9896e9f6 ("vhost: Reintroduce kthread API and add mode selection")
Signed-off-by: Mike Christie <michael.christie@oracle.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Message-Id: <20251101194358.13605-1-michael.christie@oracle.com>
2025-11-27 02:03:06 -05:00
Miaoqian Lin
b41ca62c00 virtio: vdpa: Fix reference count leak in octep_sriov_enable()
pci_get_device() will increase the reference count for the returned
pci_dev, and also decrease the reference count for the input parameter
from if it is not NULL.

If we break the loop in  with 'vf_pdev' not NULL. We
need to call pci_dev_put() to decrease the reference count.

Found via static anlaysis and this is similar to commit c508eb042d
("perf/x86/intel/uncore: Fix reference count leak in sad_cfg_iio_topology()")

Fixes: 8b6c724cda ("virtio: vdpa: vDPA driver for Marvell OCTEON DPU devices")
Cc: stable@vger.kernel.org
Signed-off-by: Miaoqian Lin <linmq006@gmail.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Message-Id: <20251027060737.33815-1-linmq006@gmail.com>
2025-11-27 02:03:06 -05:00
Alok Tiwari
f0ea2e9109 vdpa/mlx5: Fix incorrect error code reporting in query_virtqueues
When query_virtqueues() fails, the error log prints the variable err
instead of cmd->err. Since err may still be zero at this point, the
log message can misleadingly report a success value 0 even though the
command actually failed.

Even worse, once err is set to the first failure, subsequent logs
print that same stale value. This makes the error reporting appear
one step behind the actual failing queue index, which is confusing
and misleading.

Fix the log to report cmd->err, which reflects the real failure code
returned by the firmware.

Fixes: 1fcdf43ea6 ("vdpa/mlx5: Use async API for vq query command")
Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Message-Id: <20250929134258.80956-1-alok.a.tiwari@oracle.com>
2025-11-27 02:03:06 -05:00
Michael S. Tsirkin
deb55fc994 virtio: fix map ops comment
@free will free the map handle not sync it. Fix the doc to match.

Fixes: bee8c7c24b ("virtio: introduce map ops in virtio core")
Message-Id: <f6ff1c7aff8401900bf362007d7fb52dfdb6a15b.1763026134.git.mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2025-11-27 02:03:06 -05:00
Michael S. Tsirkin
43236d8bba virtio: fix virtqueue_set_affinity() docs
Rewrite the comment for better grammar and clarity.

Fixes: 75a0a52be3 ("virtio: introduce an API to set affinity for a virtqueue")
Message-Id: <e317e91bd43b070e5eaec0ebbe60c5749d02e2dd.1763026134.git.mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2025-11-27 02:03:06 -05:00
Michael S. Tsirkin
5e88a5a97d virtio: standardize Returns documentation style
Remove colons after "Returns" in virtio_map_ops function
documentation - both to avoid triggering an htmldoc warning
and for consistency with virtio_config_ops.

This affects map_page, alloc, need_sync, and max_mapping_size.

Fixes: bee8c7c24b ("virtio: introduce map ops in virtio core")
Message-Id: <c262893fa21f4b1265147ef864574a9bd173348f.1763026134.git.mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2025-11-27 02:03:06 -05:00
Michael S. Tsirkin
c15f42e091 virtio: fix grammar in virtio_map_ops docs
Fix grammar issues in the virtio_map_ops docs:
- missing article before "transport"
- "implements" -> "implement" to match subject

Fixes: bee8c7c24b ("virtio: introduce map ops in virtio core")
Message-Id: <3f7bcae5a984f14b72e67e82572b110acb06fa7e.1763026134.git.mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2025-11-27 02:03:05 -05:00
Michael S. Tsirkin
63598fba55 virtio: fix grammar in virtio_queue_info docs
Fix grammar in the description of @ctx

Fixes: c502eb85c3 ("virtio: introduce virtio_queue_info struct and find_vqs_info() config op")
Message-Id: <a5cf2b92573200bdb1c1927e559d3930d61a4af2.1763026134.git.mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2025-11-27 02:03:05 -05:00
Michael S. Tsirkin
7831791e77 virtio: fix whitespace in virtio_config_ops
The finalize_features documentation uses a tab between words.
Use space instead.

Fixes: d16c0cd273 ("docs: driver-api: virtio: virtio on Linux")
Message-Id: <39d7685c82848dc6a876d175e33a1407f6ab3fc1.1763026134.git.mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2025-11-27 02:03:05 -05:00
Michael S. Tsirkin
361173f95a virtio: fix typo in virtio_device_ready() comment
"coherenct" -> "coherent"

Fixes: 8b4ec69d7e ("virtio: harden vring IRQ")
Message-Id: <db286e9a65449347f6584e68c9960fd5ded2b4b0.1763026134.git.mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2025-11-27 02:03:05 -05:00
Kriish Sharma
f811300085 virtio: fix kernel-doc for mapping/free_coherent functions
Documentation build reported:

  WARNING: ./drivers/virtio/virtio_ring.c:3174 function parameter 'vaddr' not described in 'virtqueue_map_free_coherent'
  WARNING: ./drivers/virtio/virtio_ring.c:3308 expecting prototype for virtqueue_mapping_error(). Prototype was for virtqueue_map_mapping_error() instead

The kernel-doc block for virtqueue_map_free_coherent() omitted the @vaddr parameter, and
the kernel-doc header for virtqueue_map_mapping_error() used the wrong function name
(virtqueue_mapping_error) instead of the actual function name.

This change updates:

  - the function name in the comment to virtqueue_map_mapping_error()
  - adds the missing @vaddr description in the comment for virtqueue_map_free_coherent()

Fixes: b41cb3bcf6 ("virtio: rename dma helpers")
Signed-off-by: Kriish Sharma <kriish.sharma2006@gmail.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Message-Id: <20251110202920.2250244-1-kriish.sharma2006@gmail.com>
2025-11-27 02:03:05 -05:00
Alok Tiwari
e40b6abe0b virtio_vdpa: fix misleading return in void function
virtio_vdpa_set_status() is declared as returning void, but it used
"return vdpa_set_status()" Since vdpa_set_status() also returns
void, the return statement is unnecessary and misleading.
Remove it.

Fixes: c043b4a8cf ("virtio: introduce a vDPA based transport")
Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com>
Message-Id: <20251001191653.1713923-1-alok.a.tiwari@oracle.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
2025-11-27 02:03:05 -05:00
Jason Gunthorpe
5185c4d8a5 Merge branch 'iommufd_dmabuf' into k.o-iommufd/for-next
Jason Gunthorpe says:

====================
This series is the start of adding full DMABUF support to
iommufd. Currently it is limited to only work with VFIO's DMABUF exporter.
It sits on top of Leon's series to add a DMABUF exporter to VFIO:

   https://lore.kernel.org/all/20251120-dmabuf-vfio-v9-0-d7f71607f371@nvidia.com/

The existing IOMMU_IOAS_MAP_FILE is enhanced to detect DMABUF fd's, but
otherwise works the same as it does today for a memfd. The user can select
a slice of the FD to map into the ioas and if the underliyng alignment
requirements are met it will be placed in the iommu_domain.

Though limited, it is enough to allow a VMM like QEMU to connect MMIO BAR
memory from VFIO to an iommu_domain controlled by iommufd. This is used
for PCI Peer to Peer support in VMs, and is the last feature that the VFIO
type 1 container has that iommufd couldn't do.

The VFIO type1 version extracts raw PFNs from VMAs, which has no lifetime
control and is a use-after-free security problem.

Instead iommufd relies on revokable DMABUFs. Whenever VFIO thinks there
should be no access to the MMIO it can shoot down the mapping in iommufd
which will unmap it from the iommu_domain. There is no automatic remap,
this is a safety protocol so the kernel doesn't get stuck. Userspace is
expected to know it is doing something that will revoke the dmabuf and
map/unmap it around the activity. Eg when QEMU goes to issue FLR it should
do the map/unmap to iommufd.

Since DMABUF is missing some key general features for this use case it
relies on a "private interconnect" between VFIO and iommufd via the
vfio_pci_dma_buf_iommufd_map() call.

The call confirms the DMABUF has revoke semantics and delivers a phys_addr
for the memory suitable for use with iommu_map().

Medium term there is a desire to expand the supported DMABUFs to include
GPU drivers to support DPDK/SPDK type use cases so future series will work
to add a general concept of revoke and a general negotiation of
interconnect to remove vfio_pci_dma_buf_iommufd_map().

I also plan another series to modify iommufd's vfio_compat to
transparently pull a dmabuf out of a VFIO VMA to emulate more of the uAPI
of type1.

The latest series for interconnect negotation to exchange a phys_addr is:
 https://lore.kernel.org/r/20251027044712.1676175-1-vivek.kasireddy@intel.com

And the discussion for design of revoke is here:
 https://lore.kernel.org/dri-devel/20250114173103.GE5556@nvidia.com/
====================

Based on a shared branch with vfio.

* iommufd_dmabuf:
  iommufd/selftest: Add some tests for the dmabuf flow
  iommufd: Accept a DMABUF through IOMMU_IOAS_MAP_FILE
  iommufd: Have iopt_map_file_pages convert the fd to a file
  iommufd: Have pfn_reader process DMABUF iopt_pages
  iommufd: Allow MMIO pages in a batch
  iommufd: Allow a DMABUF to be revoked
  iommufd: Do not map/unmap revoked DMABUFs
  iommufd: Add DMABUF to iopt_pages
  vfio/pci: Add vfio_pci_dma_buf_iommufd_map()
  vfio/nvgrace: Support get_dmabuf_phys
  vfio/pci: Add dma-buf export support for MMIO regions
  vfio/pci: Enable peer-to-peer DMA transactions by default
  vfio/pci: Share the core device pointer while invoking feature functions
  vfio: Export vfio device get and put registration helpers
  dma-buf: provide phys_vec to scatter-gather mapping routine
  PCI/P2PDMA: Document DMABUF model
  PCI/P2PDMA: Provide an access to pci_p2pdma_map_type() function
  PCI/P2PDMA: Refactor to separate core P2P functionality from memory allocation
  PCI/P2PDMA: Simplify bus address mapping API
  PCI/P2PDMA: Separate the mmap() support from the core logic

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2025-11-26 14:04:10 -04:00
Nicolin Chen
81c45c62dc iommu/arm-smmu-v3-iommufd: Allow attaching nested domain for GBPA cases
A vDEVICE has been a hard requirement for attaching a nested domain to the
device. This makes sense when installing a guest STE, since a vSID must be
present and given to the kernel during the vDEVICE allocation.

But, when CR0.SMMUEN is disabled, VM doesn't really need a vSID to program
the vSMMU behavior as GBPA will take effect, in which case the vSTE in the
nested domain could have carried the bypass or abort configuration in GBPA
register. Thus, having such a hard requirement doesn't work well for GBPA.

Skip vmaster allocation in arm_smmu_attach_prepare_vmaster() for an abort
or bypass vSTE. Note that device on this attachment won't report vevents.

Update the uAPI doc accordingly.

Link: https://patch.msgid.link/r/20251103172755.2026145-1-nicolinc@nvidia.com
Tested-by: Shameer Kolothum <skolothumtho@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Pranjal Shrivastava <praan@google.com>
Tested-by: Shuai Xue <xueshuai@linux.alibaba.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2025-11-26 14:04:04 -04:00
Li RongQing
f37e286879 RDMA/core: Reduce cond_resched() frequency in __ib_umem_release
The current implementation calls cond_resched() for every SG entry
in __ib_umem_release(), which can increase needless overhead.

This patch introduces RESCHED_LOOP_CNT_THRESHOLD (0x1000) to limit
how often cond_resched() is called. The function now yields the CPU
once every 4096 iterations, and yield at the very first iteration
for lots of small umem case, to reduce scheduling overhead.

Fixes: d056bc45b6 ("RDMA/core: Prevent soft lockup during large user memory region cleanup")
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Link: https://patch.msgid.link/20251126025147.2627-1-lirongqing@baidu.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-26 03:15:36 -05:00
Jijun Wang
01dad9ca37 RDMA/irdma: Fix SRQ shadow area address initialization
Fix SRQ shadow area address initialization.

Fixes: 563e1feb5f ("RDMA/irdma: Add SRQ support")
Signed-off-by: Jijun Wang <jijun.wang@intel.com>
Signed-off-by: Jay Bhat <jay.bhat@intel.com>
Link: https://patch.msgid.link/20251125025350.180-10-tatyana.e.nikolova@intel.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-26 02:26:05 -05:00
Jacob Moroni
62356fccb1 RDMA/irdma: Remove doorbell elision logic
In some cases, this logic can result in doorbell writes being
skipped when they should not have been (at least on GEN3 HW),
so remove it. This also means that the mb() can be safely
downgraded to dma_wmb().

Fixes: 551c46edc7 ("RDMA/irdma: Add user/kernel shared libraries")
Signed-off-by: Jacob Moroni <jmoroni@google.com>
Signed-off-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com>
Link: https://patch.msgid.link/20251125025350.180-9-tatyana.e.nikolova@intel.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-26 02:26:05 -05:00
Jacob Moroni
eef3ad030b RDMA/irdma: Do not set IBK_LOCAL_DMA_LKEY for GEN3+
The GEN3 hardware does not appear to support IBK_LOCAL_DMA_LKEY. Attempts
to use it will result in an AE.

Fixes: eb31dfc2b4 ("RDMA/irdma: Restrict Memory Window and CQE Timestamping to GEN3")
Signed-off-by: Jacob Moroni <jmoroni@google.com>
Signed-off-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com>
Link: https://patch.msgid.link/20251125025350.180-8-tatyana.e.nikolova@intel.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-26 02:26:05 -05:00
Jacob Moroni
71d3bdae5e RDMA/irdma: Do not directly rely on IB_PD_UNSAFE_GLOBAL_RKEY
The HW disables bounds checking for MRs with a length of zero, so
the driver will only allow a zero length MR if the "all_memory"
flag is set, and this flag is only set if IB_PD_UNSAFE_GLOBAL_RKEY
is set for the PD.

This means that the "get_dma_mr" method will currently fail unless
the IB_PD_UNSAFE_GLOBAL_RKEY flag is set. This has not been an issue
because the "get_dma_mr" method is only ever invoked if the device
does not support the local DMA key or if IB_PD_UNSAFE_GLOBAL_RKEY
is set, and so far, all IRDMA HW supports the local DMA lkey.

However, some new HW does not support the local DMA lkey, so the
"get_dma_mr" method needs to work without IB_PD_UNSAFE_GLOBAL_RKEY
being set.

To support HW that does not allow the local DMA lkey, the logic has
been changed to pass an explicit flag to indicate when a dma_mr is
being created so that the zero length will be allowed.

Also, the "all_memory" flag has been forced to false for normal MR
allocation since these MRs are never supposed to provide global
unsafe rkey semantics anyway; only the MR created with "get_dma_mr"
should support this.

Fixes: bb6d73d9ad ("RDMA/irdma: Prevent zero-length STAG registration")
Signed-off-by: Jacob Moroni <jmoroni@google.com>
Signed-off-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com>
Link: https://patch.msgid.link/20251125025350.180-7-tatyana.e.nikolova@intel.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-26 02:26:05 -05:00
Anil Samal
35bd787bab RDMA/irdma: Add missing mutex destroy
Add missing destroy of ah_tbl_lock and vchnl_mutex.

Fixes: d5edd33364 ("RDMA/irdma: RDMA/irdma: Add GEN3 core driver support")
Signed-off-by: Anil Samal <anil.samal@intel.com>
Signed-off-by: Krzysztof Czurylo <krzysztof.czurylo@intel.com>
Signed-off-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com>
Link: https://patch.msgid.link/20251125025350.180-6-tatyana.e.nikolova@intel.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-26 02:26:05 -05:00
Krzysztof Czurylo
5eff1ecce3 RDMA/irdma: Fix SIGBUS in AEQ destroy
Removes write to IRDMA_PFINT_AEQCTL register prior to destroying AEQ,
as this register does not exist in GEN3+ hardware and this kind of IRQ
configuration is no longer required.

Fixes: b800e82feb ("RDMA/irdma: Add GEN3 support for AEQ and CEQ")
Signed-off-by: Krzysztof Czurylo <krzysztof.czurylo@intel.com>
Signed-off-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com>
Link: https://patch.msgid.link/20251125025350.180-5-tatyana.e.nikolova@intel.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-26 02:26:05 -05:00
Tatyana Nikolova
9e13d880eb RDMA/irdma: Add a missing kfree of struct irdma_pci_f for GEN2
During a refactor of the irdma GEN2 code, the kfree of the irdma_pci_f struct
in icrdma_remove(), which was originally introduced upstream as part of
commit 80f2ab46c2 ("irdma: free iwdev->rf after removing MSI-X")
was accidentally removed.

Fixes: 0c2b80cac9 ("RDMA/irdma: Refactor GEN2 auxiliary driver")
Signed-off-by: Krzysztof Czurylo <krzysztof.czurylo@intel.com>
Signed-off-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com>
Link: https://patch.msgid.link/20251125025350.180-4-tatyana.e.nikolova@intel.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-26 02:26:05 -05:00
Krzysztof Czurylo
81f44409fb RDMA/irdma: Fix data race in irdma_free_pble
Protects pble_rsrc counters with mutex to prevent data race.
Fixes the following data race in irdma_free_pble reported by KCSAN:

BUG: KCSAN: data-race in irdma_free_pble [irdma] / irdma_free_pble [irdma]

write to 0xffff91430baa0078 of 8 bytes by task 16956 on cpu 5:
 irdma_free_pble+0x3b/0xb0 [irdma]
 irdma_dereg_mr+0x108/0x110 [irdma]
 ib_dereg_mr_user+0x74/0x160 [ib_core]
 uverbs_free_mr+0x26/0x30 [ib_uverbs]
 destroy_hw_idr_uobject+0x4a/0x90 [ib_uverbs]
 uverbs_destroy_uobject+0x7b/0x330 [ib_uverbs]
 uobj_destroy+0x61/0xb0 [ib_uverbs]
 ib_uverbs_run_method+0x1f2/0x380 [ib_uverbs]
 ib_uverbs_cmd_verbs+0x365/0x440 [ib_uverbs]
 ib_uverbs_ioctl+0x111/0x190 [ib_uverbs]
 __x64_sys_ioctl+0xc9/0x100
 do_syscall_64+0x44/0xa0
 entry_SYSCALL_64_after_hwframe+0x6e/0xd8

read to 0xffff91430baa0078 of 8 bytes by task 16953 on cpu 2:
 irdma_free_pble+0x23/0xb0 [irdma]
 irdma_dereg_mr+0x108/0x110 [irdma]
 ib_dereg_mr_user+0x74/0x160 [ib_core]
 uverbs_free_mr+0x26/0x30 [ib_uverbs]
 destroy_hw_idr_uobject+0x4a/0x90 [ib_uverbs]
 uverbs_destroy_uobject+0x7b/0x330 [ib_uverbs]
 uobj_destroy+0x61/0xb0 [ib_uverbs]
 ib_uverbs_run_method+0x1f2/0x380 [ib_uverbs]
 ib_uverbs_cmd_verbs+0x365/0x440 [ib_uverbs]
 ib_uverbs_ioctl+0x111/0x190 [ib_uverbs]
 __x64_sys_ioctl+0xc9/0x100
 do_syscall_64+0x44/0xa0
 entry_SYSCALL_64_after_hwframe+0x6e/0xd8

value changed: 0x0000000000005a62 -> 0x0000000000005a68

Fixes: e8c4dbc2fc ("RDMA/irdma: Add PBLE resource manager")
Signed-off-by: Krzysztof Czurylo <krzysztof.czurylo@intel.com>
Signed-off-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com>
Link: https://patch.msgid.link/20251125025350.180-3-tatyana.e.nikolova@intel.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-26 02:26:05 -05:00
Krzysztof Czurylo
a521928164 RDMA/irdma: Fix data race in irdma_sc_ccq_arm
Adds a lock around irdma_sc_ccq_arm body to prevent inter-thread data race.
Fixes data race in irdma_sc_ccq_arm() reported by KCSAN:

BUG: KCSAN: data-race in irdma_sc_ccq_arm [irdma] / irdma_sc_ccq_arm [irdma]

read to 0xffff9d51b4034220 of 8 bytes by task 255 on cpu 11:
 irdma_sc_ccq_arm+0x36/0xd0 [irdma]
 irdma_cqp_ce_handler+0x300/0x310 [irdma]
 cqp_compl_worker+0x2a/0x40 [irdma]
 process_one_work+0x402/0x7e0
 worker_thread+0xb3/0x6d0
 kthread+0x178/0x1a0
 ret_from_fork+0x2c/0x50

write to 0xffff9d51b4034220 of 8 bytes by task 89 on cpu 3:
 irdma_sc_ccq_arm+0x7e/0xd0 [irdma]
 irdma_cqp_ce_handler+0x300/0x310 [irdma]
 irdma_wait_event+0xd4/0x3e0 [irdma]
 irdma_handle_cqp_op+0xa5/0x220 [irdma]
 irdma_hw_flush_wqes+0xb1/0x300 [irdma]
 irdma_flush_wqes+0x22e/0x3a0 [irdma]
 irdma_cm_disconn_true+0x4c7/0x5d0 [irdma]
 irdma_disconnect_worker+0x35/0x50 [irdma]
 process_one_work+0x402/0x7e0
 worker_thread+0xb3/0x6d0
 kthread+0x178/0x1a0
 ret_from_fork+0x2c/0x50

value changed: 0x0000000000024000 -> 0x0000000000034000

Fixes: 3f49d68425 ("RDMA/irdma: Implement HW Admin Queue OPs")
Signed-off-by: Krzysztof Czurylo <krzysztof.czurylo@intel.com>
Signed-off-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com>
Link: https://patch.msgid.link/20251125025350.180-2-tatyana.e.nikolova@intel.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-26 02:26:05 -05:00
Stephan Gerhold
5583a55e07 iommu/arm-smmu-qcom: Enable use of all SMR groups when running bare-metal
Some platforms (e.g. SC8280XP and X1E) support more than 128 stream
matching groups. This is more than what is defined as maximum by the ARM
SMMU architecture specification. Commit 1226113473 ("iommu/arm-smmu-qcom:
Limit the SMR groups to 128") disabled use of the additional groups because
they don't exhibit the same behavior as the architecture supported ones.

It seems like this is just another quirk of the hypervisor: When running
bare-metal without the hypervisor, the additional groups appear to behave
just like all others. The boot firmware uses some of the additional groups,
so ignoring them in this situation leads to stream match conflicts whenever
we allocate a new SMR group for the same SID.

The workaround exists primarily because the bypass quirk detection fails
when using a S2CR register from the additional matching groups, so let's
perform the test with the last reliable S2CR (127) and then limit the
number of SMR groups only if we detect that we are running below the
hypervisor (because of the bypass quirk).

Fixes: 1226113473 ("iommu/arm-smmu-qcom: Limit the SMR groups to 128")
Signed-off-by: Stephan Gerhold <stephan.gerhold@linaro.org>
Signed-off-by: Will Deacon <will@kernel.org>
2025-11-25 16:22:27 +00:00
Jason Gunthorpe
d2041f1f11 iommufd/selftest: Add some tests for the dmabuf flow
Basic tests of establishing a dmabuf and revoking it. The selftest kernel
side provides a basic small dmabuf for this testing.

Link: https://patch.msgid.link/r/9-v2-b2c110338e3f+5c2-iommufd_dmabuf_jgg@nvidia.com
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2025-11-25 11:30:16 -04:00
Jason Gunthorpe
44ebaa1744 iommufd: Accept a DMABUF through IOMMU_IOAS_MAP_FILE
Finally call iopt_alloc_dmabuf_pages() if the user passed in a DMABUF
through IOMMU_IOAS_MAP_FILE. This makes the feature visible to userspace.

Link: https://patch.msgid.link/r/8-v2-b2c110338e3f+5c2-iommufd_dmabuf_jgg@nvidia.com
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Shuai Xue <xueshuai@linux.alibaba.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2025-11-25 11:30:16 -04:00
Jason Gunthorpe
217725f0b2 iommufd: Have iopt_map_file_pages convert the fd to a file
Since dmabuf only has APIs that work on an int fd and not a struct file *,
pass the fd deeper into the call chain so we can use the dmabuf APIs as
is.

Link: https://patch.msgid.link/r/7-v2-b2c110338e3f+5c2-iommufd_dmabuf_jgg@nvidia.com
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Shuai Xue <xueshuai@linux.alibaba.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2025-11-25 11:30:16 -04:00
Jason Gunthorpe
74014a4b55 iommufd: Have pfn_reader process DMABUF iopt_pages
Make another sub implementation of pfn_reader for DMABUF. This version
will fill the batch using the struct phys_vec recorded during the
attachment.

Link: https://patch.msgid.link/r/6-v2-b2c110338e3f+5c2-iommufd_dmabuf_jgg@nvidia.com
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Shuai Xue <xueshuai@linux.alibaba.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2025-11-25 11:30:16 -04:00
Jason Gunthorpe
3114c67440 iommufd: Allow MMIO pages in a batch
Addresses intended for MMIO should be propagated through to the iommu with
the IOMMU_MMIO flag set.

Keep track in the batch if all the pfns are cachable or mmio and flush the
batch out of it ever needs to be changed. Switch to IOMMU_MMIO if the
batch is MMIO when mapping the iommu.

Link: https://patch.msgid.link/r/5-v2-b2c110338e3f+5c2-iommufd_dmabuf_jgg@nvidia.com
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Shuai Xue <xueshuai@linux.alibaba.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2025-11-25 11:30:15 -04:00
Jason Gunthorpe
fc7063abd9 iommufd: Allow a DMABUF to be revoked
When connected to VFIO, the only DMABUF exporter that is accepted, the
move_notify callback will be made when VFIO wants to remove access to the
MMIO. This is being called revoke.

Wire up revoke to go through all the iommu_domain's that have mapped the
DMABUF and unmap them.

The locking here is unpleasant, since the existing locking scheme was
designed to come from the iopt through the area to the pages we cannot use
pages as starting point for the locking. There is no way to obtain the
domains_rwsem before obtaining the pages mutex to reliably use the
existing domains_itree.

Solve this problem by adding a new tracking structure just for DMABUF
revoke. Record a linked list of areas and domains inside the pages
mutex. Clean the entries on the list during revoke. The map/unmaps are now
all done under a pages mutex while updating the tracking linked list so
nothing can get out of sync. Only one lock is required for revoke
processing.

Link: https://patch.msgid.link/r/4-v2-b2c110338e3f+5c2-iommufd_dmabuf_jgg@nvidia.com
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Shuai Xue <xueshuai@linux.alibaba.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2025-11-25 11:30:15 -04:00
Jason Gunthorpe
71e2409a0c iommufd: Do not map/unmap revoked DMABUFs
Once a DMABUF is revoked the domain will be unmapped under the pages
mutex. Double unmapping will trigger a WARN, and mapping while revoked
will fail.

Check for revoked DMABUFs along all the map and unmap paths to resolve
this. Ensure that map/unmap is always done under the pages mutex so it is
synchronized with the revoke notifier.

If a revoke happens between allocating the iopt_pages and the population
to a domain then the population will succeed, and leave things unmapped as
though revoke had happened immediately after.

Currently there is no way to repopulate the domains. Userspace is expected
to know if it is going to do something that would trigger revoke (eg if it
is about to do a FLR) then it should go and remove the DMABUF mappings
before and put the back after. The revoke is only to protect the kernel
from mis-behaving userspace.

Link: https://patch.msgid.link/r/3-v2-b2c110338e3f+5c2-iommufd_dmabuf_jgg@nvidia.com
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Shuai Xue <xueshuai@linux.alibaba.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2025-11-25 11:30:15 -04:00
Jason Gunthorpe
71db84a092 iommufd: Add DMABUF to iopt_pages
Add IOPT_ADDRESS_DMABUF to the iopt_pages and the basic infrastructure to
create an iopt_pages from a struct dma_buf *.

DMABUF pages are not supported for accesses, and for now can only be used
with the VFIO DMABUF exporter.

The overall flow will be similar to memfd where the user can pass in a
DMABUF file descriptor to IOMMU_IOAS_MAP_FILE and create an area and
pages. Like other areas it can be copied and otherwise manipulated, though
there is little point in doing so.

There is no pinned page accounting done for DMABUF maps.

The DMABUF attachment exists so long as the dmabuf is mapped into an IOAS,
even if the IOAS is not mapped to any domains.

Link: https://patch.msgid.link/r/2-v2-b2c110338e3f+5c2-iommufd_dmabuf_jgg@nvidia.com
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Shuai Xue <xueshuai@linux.alibaba.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2025-11-25 11:30:15 -04:00
Jason Gunthorpe
96ce2aeb15 vfio/pci: Add vfio_pci_dma_buf_iommufd_map()
This function is used to establish the "private interconnect" between the
VFIO DMABUF exporter and the iommufd DMABUF importer. This is intended to
be a temporary API until the core DMABUF interface is improved to natively
support a private interconnect and revocable negotiation.

This function should only be called by iommufd when trying to map a
DMABUF. For now iommufd will only support VFIO DMABUFs.

The following improvements are needed in the DMABUF API to generically
support more exporters with iommufd/kvm type importers that cannot use the
DMA API:

 1) Revoke semantics. VFIO needs to be able to prevent access to the MMIO
    during FLR, and so it will use dma_buf_move_notify() to prevent
    access. iommmufd does not support fault handling so it cannot
    implement the full move_notify. Instead if revoke is negotiated the
    exporter promises not to use move_notify() unless the importer can
    experiance failures. iommufd will unmap the dmabuf from the iommu page
    tables while it is revoked.

 2) Private interconnect negotiation. iommufd will only be able to map
    a "private interconnect" that provides a phys_addr_t and a
    struct p2pdma_provider * to describe the memory. It cannot use a DMA
    mapped scatterlist since it is directly calling iommu_map().

 3) NULL device during dma_buf_dynamic_attach(). Since iommufd doesn't use
    the DMA API it doesn't have a DMAable struct device to pass here.

Link: https://patch.msgid.link/r/1-v2-b2c110338e3f+5c2-iommufd_dmabuf_jgg@nvidia.com
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Shuai Xue <xueshuai@linux.alibaba.com>
Acked-by: Alex Williamson <alex@shazbot.org>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2025-11-25 11:30:15 -04:00
Jason Gunthorpe
152c862c17 iommupt: Fix unlikely flows in increase_top()
Since increase_top() does it's own READ_ONCE() on top_of_table, the
caller's prior READ_ONCE() could be inconsistent and the first time
through the loop we may actually already have the right level if two
threads are racing map.

In this case new_level will be left uninitialized.

Further all the exits from the loop have to either commit to the new top
or free any memory allocated so the early return must be a goto err_free.

Make it so the only break from the loop always sets new_level to the right
value and all other exits go to err_free. Use pts.level (the pts
represents the top we are stacking) within the loop instead of new_level.

Fixes: dcd6a011a8 ("iommupt: Add map_pages op")
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/r/aRwgNW9PiW2j-Qwo@stanley.mountain
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-25 15:11:49 +01:00
Jinhui Guo
2381a1b40b iommu/amd: Propagate the error code returned by __modify_irte_ga() in modify_irte_ga()
The return type of __modify_irte_ga() is int, but modify_irte_ga()
treats it as a bool. Casting the int to bool discards the error code.

To fix the issue, change the type of ret to int in modify_irte_ga().

Fixes: 57cdb720ea ("iommu/amd: Do not flush IRTE when only updating isRun and destination fields")
Cc: stable@vger.kernel.org
Signed-off-by: Jinhui Guo <guojinhui.liam@bytedance.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-25 15:10:23 +01:00
Jean-Philippe Brucker
1e8b6eb141 MAINTAINERS: Update my email address
Use jpb@kernel.org as my main address.

Signed-off-by: Jean-Philippe Brucker <jpb@kernel.org>
Acked-by: Will Deacon <will@kernel.org>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-25 15:07:46 +01:00
Ryan Huang
5941f0e0c1 iommu/arm-smmu-v3: Fix error check in arm_smmu_alloc_cd_tables
In arm_smmu_alloc_cd_tables(), the error check following the
dma_alloc_coherent() for cd_table->l2.l1tab incorrectly tests
cd_table->l2.l2ptrs.

This means an allocation failure for l1tab goes undetected, causing
the function to return 0 (success) erroneously.

Correct the check to test cd_table->l2.l1tab.

Fixes: e3b1be2e73 ("iommu/arm-smmu-v3: Reorganize struct arm_smmu_ctx_desc_cfg")
Signed-off-by: Daniel Mentz <danielmentz@google.com>
Signed-off-by: Ryan Huang <tzukui@google.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Pranjal Shrivastava <praan@google.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Will Deacon <will@kernel.org>
2025-11-24 16:56:31 +00:00
Konrad Dybcio
fe6262910c dt-bindings: iommu: qcom_iommu: Allow 'tbu' clock
Some IOMMUs on some platforms (there doesn't seem to be a good denominator
for this) require the presence of a third clock, specifically relating
to the instance's Translation Buffer Unit (TBU).

Stephan Gerhold noted [1] that according to Qualcomm Snapdragon 410E
Processor (APQ8016E) Technical Reference Manual, SMMU chapter, section
"8.8.3.1.2 Clock gating", which reads:

For APPS TCU/TBU (TBU to TCU interface is asynchronous)
Software should turn ON clock to APPS TCU
  - During APPS TCU register programming sequence

For GPU TCU/TBU (TBU to TCU interface is synchronous)
Software should turn ON clock to GPU TBU
  - During GPU TLB invalidation sequence <=====================
Software should turn ON clock to GPU TCU
  - During GPU TCU register programming sequence
  - While GPU master clock is Active

The clock should be turned on at least during TLB invalidation on the
GPU SMMU instance. This is corroborated by Commit 5bc1cf1466
("iommu/qcom: add optional 'tbu' clock for TLB invalidate").

This is also not to be confused with qcom,sdm845-tbu, which is a
description of a debug interface, absent on the generation of hardware
that this binding describes.

Allow this clock.

[1] https://lore.kernel.org/linux-arm-msm/aPX_cKtial56AgvU@linaro.org/

Reviewed-by: Rob Herring <robh@kernel.org>
Signed-off-by: Konrad Dybcio <konrad.dybcio@linaro.org>
Reviewed-by: Bjorn Andersson <andersson@kernel.org>
Signed-off-by: Will Deacon <will@kernel.org>
2025-11-24 16:55:18 +00:00
Maher Sanalla
4022c7b634 RDMA/mlx5: Add support for 1600_8x lane speed
Add a check for 1600G_8X link speed when querying PTYS and report it
back correctly when needed.

While at it, adjust mlx5 function which maps the speed rate from IB
spec values to internal driver values to be able to handle speeds
up to 1600Gbps.

Reviewed-by: Michael Guralnik <michaelgur@nvidia.com>
Signed-off-by: Maher Sanalla <msanalla@nvidia.com>
Link: https://patch.msgid.link/20251120-speed-8-v1-2-e6a7efef8cb8@nvidia.com
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-24 02:58:30 -05:00
Maher Sanalla
0f1f9b5e47 RDMA/core: Add new IB rate for XDR (8x) support
Add the new rates as defined in the Infiniband spec for XDR and 8x
link width support.

Furthermore, modify the utility conversion methods accordingly.

Reference: IB Spec Release 1.8

Reviewed-by: Michael Guralnik <michaelgur@nvidia.com>
Signed-off-by: Maher Sanalla <msanalla@nvidia.com>
Link: https://patch.msgid.link/20251120-speed-8-v1-1-e6a7efef8cb8@nvidia.com
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-24 02:58:30 -05:00
Yishai Hadas
6dbd547ada IB/mlx5: Reduce IMR KSM size when 5-level paging is enabled
Enabling 5-level paging (LA57) increases TASK_SIZE on x86_64 from 2^47
to 2^56. This affects implicit ODP, which uses TASK_SIZE to calculate
the number of IMR KSM entries.

As a result, the number of entries and the memory usage for KSM mkeys
increase drastically:

- With 2^47 TASK_SIZE: 0x20000 entries (~2MB)
- With 2^56 TASK_SIZE: 0x4000000 entries (~1GB)

This issue could happen previously on systems with LA57 manually
enabled, but now commit 7212b58d6d ("x86/mm/64: Make 5-level paging
support unconditional") enables LA57 by default on all supported
systems. This makes the issue impact widespread.

To mitigate this, increase the size each MTT entry maps from 1GB to 16GB
when 5-level paging is enabled. This reduces the number of KSM entries
and lowers the memory usage on LA57 systems from 1GB to 64MB per IMR.

As now 'mlx5_imr_mtt_size' is larger than 32 bits, we move to use u64
instead of int as part of populate_klm() to prevent overflow of the
'step' variable.

In addition, as populate_klm() actually handles KSM and not KLM, as it's
used only by implicit ODP, we renamed its signature and the internal
structures accordingly while dropping the byte_count handling which is
not relevant in KSM. The page size in KSM is fixed for all the entries
and come from the log_page_size of the mkey.

Note:
On platforms where the calculated value for 'mlx5_imr_ksm_page_shift' is
higher than the max firmware cap to be changed over UMR, or that the
calculated value for 'log_va_pages' is higher than what we may expect,
the implicit ODP cap will be simply turned off.

Co-developed-by: Or Har-Toov <ohartoov@nvidia.com>
Signed-off-by: Or Har-Toov <ohartoov@nvidia.com>
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Reviewed-by: Michael Guralnik <michaelgur@nvidia.com>
Signed-off-by: Edward Srouji <edwards@nvidia.com>
Link: https://patch.msgid.link/20251120-reduce-ksm-v1-1-6864bfc814dc@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-24 02:58:30 -05:00
Selvin Xavier
a26c4c7cdb RDMA/bnxt_re: Pass correct flag for dma mr creation
DMA MR doesn't use the unified MR model. So the lkey passed
on to the reg_mr command to FW should contain the correct
lkey. Driver is incorrectly over writing the lkey with pdid
and firmware commands fails due to this.

Avoid passing the wrong key for cases where the unified MR
registration is not used.

Fixes: f786eebbbe ("RDMA/bnxt_re: Avoid an extra hwrm per MR creation")
Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Link: https://patch.msgid.link/1763624215-10382-2-git-send-email-selvin.xavier@broadcom.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-24 02:58:30 -05:00
Selvin Xavier
6afe40ff48 RDMA/bnxt_re: Fix the inline size for GenP7 devices
Inline size supported by the device is based on the number
of SGEs supported by the adapter. Change the inline
size calculation based on that.

Fixes: de1d364c38 ("RDMA/bnxt_re: Add support for Variable WQE in Genp7 adapters")
Reviewed-by: Kashyap Desai <kashyap.desai@broadcom.com>
Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Link: https://patch.msgid.link/1763624215-10382-1-git-send-email-selvin.xavier@broadcom.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-24 02:58:30 -05:00
Junxian Huang
d70f30cef2 RDMA/hns: Support reset recovery for bond
Re-set bond configuration to HW after HW reset.

Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Link: https://patch.msgid.link/20251112093510.3696363-9-huangjunxian6@hisilicon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-24 02:58:30 -05:00
Junxian Huang
e72d274f8f RDMA/hns: Support link state reporting for bond
The link state of bond depends on the upper device. Adapt current
link state querying flow and ib_event dispatching flow to report
correct link state of bond.

Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Link: https://patch.msgid.link/20251112093510.3696363-8-huangjunxian6@hisilicon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-24 02:58:30 -05:00
Junxian Huang
5d91677bbb RDMA/hns: Add delayed work for bonding
When conditions are met, schedule a delayed work in bond event handler
to perform bonding operation according to the bond state. In the case
of changing slave number or link state, re-set the netdev for the bond
ibdev after the modification is complete, since these two operations
may not call hns_roce_set_bond_netdev() in hns_roce_init().

The delayed work will be paused when there is a driver reset or exit
to avoid concurrency.

Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Link: https://patch.msgid.link/20251112093510.3696363-7-huangjunxian6@hisilicon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-24 02:58:30 -05:00
Junxian Huang
d9023e461b RDMA/hns: Implement bonding init/uninit process
Implement hns_roce_slave_init() and hns_roce_slave_uninit() for device
init/uninit in bonding cases. The former is used to initialize a slave
ibdev (when the slave is unlinked from a bond) or a bond ibdev, while
the latter does the opposite. Most of the process is the same as
regular device init/uninit, while some bonding‑specific steps below are
also added.

In bond device init flow, choose one slave to re-initialize as the
main_hr_dev of the bond, and it will be the only device presented for
multiple slaves. During registration, set and active netdev to the
ibdev based on the link state of the slaves. When this main_hr_dev
slave is being unlinked while the bond is still valid, choose a new
slave from the rest and initialize it as the new bond device.

In uninit flow, add a bond cleanup process, restore all the other
slaves and clean up bond resource. This is only for the case where
the port of main_hr_dev is directly removed without unlinking it
from bond.

Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Link: https://patch.msgid.link/20251112093510.3696363-6-huangjunxian6@hisilicon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-24 02:58:30 -05:00
Junxian Huang
14f0455e4a RDMA/hns: Add bonding cmds
Add three bonding cmds to configure bonding settings to HW.

Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Link: https://patch.msgid.link/20251112093510.3696363-5-huangjunxian6@hisilicon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-24 02:58:29 -05:00
Junxian Huang
d31d410b38 RDMA/hns: Add bonding event handler
Register netdev notifier for two bonding events NETDEV_CHANGEUPPER
and NETDEV_CHANGELOWERSTATE.

In NETDEV_CHANGEUPPER event handler, check some rules about the HW
constraints when trying to link a new slave to the masteri, and
store some bonding information from the notifier. In unlinking case,
simply check the number of the rest slaves to decide whether the
bond is still supported.

In NETDEV_CHANGELOWERSTATE event handler, not much is done. It
simply sets the bond state when the bond is ready, which will be
used later.

Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Link: https://patch.msgid.link/20251112093510.3696363-4-huangjunxian6@hisilicon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-24 02:58:29 -05:00
Junxian Huang
b37ad2e290 RDMA/hns: Initialize bonding resources
Allocate bond_grp resources for each card when the first device in
this card is registered. Block the initialization of VF when its PF
is a bonded slave, as VF is not supported in this case due to HW
constraints.

Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Link: https://patch.msgid.link/20251112093510.3696363-3-huangjunxian6@hisilicon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-24 02:58:29 -05:00
Junxian Huang
cdb3a6f183 RDMA/hns: Add helpers to obtain netdev and bus_num from hr_dev
Add helpers to obtain netdev and bus_num from hr_dev.

Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Link: https://patch.msgid.link/20251112093510.3696363-2-huangjunxian6@hisilicon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-24 02:58:29 -05:00
Siva Reddy Kallam
04e031ff6e RDMA/bng_re: Initialize the Firmware and Hardware
Initialize the firmware and hardware with HWRM command.

Signed-off-by: Siva Reddy Kallam <siva.kallam@broadcom.com>
Link: https://patch.msgid.link/20251117171136.128193-9-siva.kallam@broadcom.com
Reviewed-by: Usman Ansari <usman.ansari@broadcom.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-24 02:58:29 -05:00
Siva Reddy Kallam
99e4e10283 RDMA/bng_re: Add basic debugfs infrastructure
Add basic debugfs infrastructure for Broadcom next generation
controller.

Signed-off-by: Siva Reddy Kallam <siva.kallam@broadcom.com>
Link: https://patch.msgid.link/20251117171136.128193-8-siva.kallam@broadcom.com
Reviewed-by: Usman Ansari <usman.ansari@broadcom.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-24 02:58:29 -05:00
Siva Reddy Kallam
53c6ee7d7f RDMA/bng_re: Enable Firmware channel and query device attributes
Enable Firmware channel and query device attributes

Signed-off-by: Siva Reddy Kallam <siva.kallam@broadcom.com>
Link: https://patch.msgid.link/20251117171136.128193-7-siva.kallam@broadcom.com
Reviewed-by: Usman Ansari <usman.ansari@broadcom.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-24 02:58:29 -05:00
Siva Reddy Kallam
4f830cd8d7 RDMA/bng_re: Add infrastructure for enabling Firmware channel
Add infrastructure for enabling Firmware channel.

Signed-off-by: Siva Reddy Kallam <siva.kallam@broadcom.com>
Link: https://patch.msgid.link/20251117171136.128193-6-siva.kallam@broadcom.com
Reviewed-by: Usman Ansari <usman.ansari@broadcom.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-24 02:58:29 -05:00
Siva Reddy Kallam
53310b698f RDMA/bng_re: Allocate required memory resources for Firmware channel
Allocate required memory resources for Firmware channel.

Signed-off-by: Siva Reddy Kallam <siva.kallam@broadcom.com>
Link: https://patch.msgid.link/20251117171136.128193-5-siva.kallam@broadcom.com
Reviewed-by: Usman Ansari <usman.ansari@broadcom.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-24 02:58:29 -05:00
Siva Reddy Kallam
745065770c RDMA/bng_re: Register and get the resources from bnge driver
Register and get the basic required resources from bnge driver.

Signed-off-by: Siva Reddy Kallam <siva.kallam@broadcom.com>
Link: https://patch.msgid.link/20251117171136.128193-4-siva.kallam@broadcom.com
Reviewed-by: Usman Ansari <usman.ansari@broadcom.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-24 02:58:29 -05:00
Siva Reddy Kallam
d0da769c19 RDMA/bng_re: Add Auxiliary interface
Add basic Auxiliary interface to the driver which supports
the BCM5770X NIC family.

Signed-off-by: Siva Reddy Kallam <siva.kallam@broadcom.com>
Link: https://patch.msgid.link/20251117171136.128193-3-siva.kallam@broadcom.com
Reviewed-by: Usman Ansari <usman.ansari@broadcom.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-24 02:58:29 -05:00
Alex Williamson
fa804aa4ac Merge tag 'vfio-v6.19-dma-buf-v9+' into v6.19/vfio/next
[v9] vfio/pci: Allow MMIO regions to be exported through dma-buf

https://lore.kernel.org/all/20251120-dmabuf-vfio-v9-0-d7f71607f371@nvidia.com

Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-20 21:20:00 -07:00
Jason Gunthorpe
5415d887db vfio/nvgrace: Support get_dmabuf_phys
Call vfio_pci_core_fill_phys_vec() with the proper physical ranges for the
synthetic BAR 2 and BAR 4 regions. Otherwise use the normal flow based on
the PCI bar.

This demonstrates a DMABUF that follows the region info report to only
allow mapping parts of the region that are mmapable. Since the BAR is
power of two sized and the "CXL" region is just page aligned the there can
be a padding region at the end that is not mmaped or passed into the
DMABUF.

The "CXL" ranges that are remapped into BAR 2 and BAR 4 areas are not PCI
MMIO, they actually run over the CXL-like coherent interconnect and for
the purposes of DMA behave identically to DRAM. We don't try to model this
distinction between true PCI BAR memory that takes a real PCI path and the
"CXL" memory that takes a different path in the p2p framework for now.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Tested-by: Alex Mastro <amastro@fb.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Acked-by: Ankit Agrawal <ankita@nvidia.com>
Reviewed-by: Ankit Agrawal <ankita@nvidia.com>
Link: https://lore.kernel.org/r/20251120-dmabuf-vfio-v9-11-d7f71607f371@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-20 21:15:17 -07:00
Leon Romanovsky
5d74781ebc vfio/pci: Add dma-buf export support for MMIO regions
Add support for exporting PCI device MMIO regions through dma-buf,
enabling safe sharing of non-struct page memory with controlled
lifetime management. This allows RDMA and other subsystems to import
dma-buf FDs and build them into memory regions for PCI P2P operations.

The implementation provides a revocable attachment mechanism using
dma-buf move operations. MMIO regions are normally pinned as BARs
don't change physical addresses, but access is revoked when the VFIO
device is closed or a PCI reset is issued. This ensures kernel
self-defense against potentially hostile userspace.

Currently VFIO can take MMIO regions from the device's BAR and map
them into a PFNMAP VMA with special PTEs. This mapping type ensures
the memory cannot be used with things like pin_user_pages(), hmm, and
so on. In practice only the user process CPU and KVM can safely make
use of these VMA. When VFIO shuts down these VMAs are cleaned by
unmap_mapping_range() to prevent any UAF of the MMIO beyond driver
unbind.

However, VFIO type 1 has an insecure behavior where it uses
follow_pfnmap_*() to fish a MMIO PFN out of a VMA and program it back
into the IOMMU. This has a long history of enabling P2P DMA inside
VMs, but has serious lifetime problems by allowing a UAF of the MMIO
after the VFIO driver has been unbound.

Introduce DMABUF as a new safe way to export a FD based handle for the
MMIO regions. This can be consumed by existing DMABUF importers like
RDMA or DRM without opening an UAF. A following series will add an
importer to iommufd to obsolete the type 1 code and allow safe
UAF-free MMIO P2P in VM cases.

DMABUF has a built in synchronous invalidation mechanism called
move_notify. VFIO keeps track of all drivers importing its MMIO and
can invoke a synchronous invalidation callback to tell the importing
drivers to DMA unmap and forget about the MMIO pfns. This process is
being called revoke. This synchronous invalidation fully prevents any
lifecycle problems. VFIO will do this before unbinding its driver
ensuring there is no UAF of the MMIO beyond the driver lifecycle.

Further, VFIO has additional behavior to block access to the MMIO
during things like Function Level Reset. This is because some poor
platforms may experience a MCE type crash when touching MMIO of a PCI
device that is undergoing a reset. Today this is done by using
unmap_mapping_range() on the VMAs. Extend that into the DMABUF world
and temporarily revoke the MMIO from the DMABUF importers during FLR
as well. This will more robustly prevent an errant P2P from possibly
upsetting the platform.

A DMABUF FD is a preferred handle for MMIO compared to using something
like a pgmap because:
 - VFIO is supported, including its P2P feature, on archs that don't
   support pgmap
 - PCI devices have all sorts of BAR sizes, including ones smaller
   than a section so a pgmap cannot always be created
 - It is undesirable to waste a lot of memory for struct pages,
   especially for a case like a GPU with ~100GB of BAR size
 - We want a synchronous revoke semantic to support FLR with light
   hardware requirements

Use the P2P subsystem to help generate the DMA mapping. This is a
significant upgrade over the abuse of dma_map_resource() that has
historically been used by DMABUF exporters. Experience with an OOT
version of this patch shows that real systems do need this. This
approach deals with all the P2P scenarios:
 - Non-zero PCI bus_offset
 - ACS flags routing traffic to the IOMMU
 - ACS flags that bypass the IOMMU - though vfio noiommu is required
   to hit this.

There will be further work to formalize the revoke semantic in
DMABUF. For now this acts like a move_notify dynamic exporter where
importer fault handling will get a failure when they attempt to map.
This means that only fully restartable fault capable importers can
import the VFIO DMABUFs. A future revoke semantic should open this up
to more HW as the HW only needs to invalidate, not handle restartable
faults.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Acked-by: Ankit Agrawal <ankita@nvidia.com>
Link: https://lore.kernel.org/r/20251120-dmabuf-vfio-v9-10-d7f71607f371@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-20 21:12:19 -07:00
Leon Romanovsky
35c3503908 vfio/pci: Enable peer-to-peer DMA transactions by default
Make sure that all VFIO PCI devices have peer-to-peer capabilities
enables, so we would be able to export their MMIO memory through DMABUF,

VFIO has always supported P2P mappings with itself. VFIO type 1
insecurely reads PFNs directly out of a VMA's PTEs and programs them
into the IOMMU allowing any two VFIO devices to perform P2P to each
other.

All existing VMMs use this capability to export P2P into a VM where
the VM could setup any kind of DMA it likes. Projects like DPDK/SPDK
are also known to make use of this, though less frequently.

As a first step to more properly integrating VFIO with the P2P
subsystem unconditionally enable P2P support for VFIO PCI devices. The
struct p2pdma_provider will act has a handle to the P2P subsystem to
do things like DMA mapping.

While real PCI devices have to support P2P (they can't even tell if an
IOVA is P2P or not) there may be fake PCI devices that may trigger
some kind of catastrophic system failure. To date VFIO has never
tripped up on such a case, but if one is discovered the plan is to add
a PCI quirk and have pcim_p2pdma_init() fail. This will fully block
the broken device throughout any users of the P2P subsystem in the
kernel.

Thus P2P through DMABUF will follow the historical VFIO model and be
unconditionally enabled by vfio-pci.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Tested-by: Alex Mastro <amastro@fb.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Acked-by: Ankit Agrawal <ankita@nvidia.com>
Link: https://lore.kernel.org/r/20251120-dmabuf-vfio-v9-9-d7f71607f371@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-20 12:02:46 -07:00
Vivek Kasireddy
47d13c939d vfio/pci: Share the core device pointer while invoking feature functions
There is no need to share the main device pointer (struct vfio_device *)
with all the feature functions as they only need the core device
pointer. Therefore, extract the core device pointer once in the
caller (vfio_pci_core_ioctl_feature) and share it instead.

Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Tested-by: Alex Mastro <amastro@fb.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Acked-by: Ankit Agrawal <ankita@nvidia.com>
Link: https://lore.kernel.org/r/20251120-dmabuf-vfio-v9-8-d7f71607f371@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-20 12:02:40 -07:00
Vivek Kasireddy
64a5dedcff vfio: Export vfio device get and put registration helpers
These helpers are useful for managing additional references taken
on the device from other associated VFIO modules.

Original-patch-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Tested-by: Alex Mastro <amastro@fb.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Acked-by: Ankit Agrawal <ankita@nvidia.com>
Link: https://lore.kernel.org/r/20251120-dmabuf-vfio-v9-7-d7f71607f371@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-20 12:02:32 -07:00
Leon Romanovsky
3aa31a8bb1 dma-buf: provide phys_vec to scatter-gather mapping routine
Add dma_buf_phys_vec_to_sgt() and dma_buf_free_sgt() helpers to convert
an array of MMIO physical address ranges into scatter-gather tables with
proper DMA mapping.

These common functions are a starting point and support any PCI
drivers creating mappings from their BAR's MMIO addresses. VFIO is one
case, as shortly will be RDMA. We can review existing DRM drivers to
refactor them separately. We hope this will evolve into routines to
help common DRM that include mixed CPU and MMIO mappings.

Compared to the dma_map_resource() abuse this implementation handles
the complicated PCI P2P scenarios properly, especially when an IOMMU
is enabled:

 - Direct bus address mapping without IOVA allocation for
   PCI_P2PDMA_MAP_BUS_ADDR, using pci_p2pdma_bus_addr_map(). This
   happens if the IOMMU is enabled but the PCIe switch ACS flags allow
   transactions to avoid the host bridge.

   Further, this handles the slightly obscure, case of MMIO with a
   phys_addr_t that is different from the physical BAR programming
   (bus offset). The phys_addr_t is converted to a dma_addr_t and
   accommodates this effect. This enables certain real systems to
   work, especially on ARM platforms.

 - Mapping through host bridge with IOVA allocation and DMA_ATTR_MMIO
   attribute for MMIO memory regions (PCI_P2PDMA_MAP_THRU_HOST_BRIDGE).
   This happens when the IOMMU is enabled and the ACS flags are forcing
   all traffic to the IOMMU - ie for virtualization systems.

 - Cases where P2P is not supported through the host bridge/CPU. The
   P2P subsystem is the proper place to detect this and block it.

Helper functions fill_sg_entry() and calc_sg_nents() handle the
scatter-gather table construction, splitting large regions into
UINT_MAX-sized chunks to fit within sg->length field limits.

Since the physical address based DMA API forbids use of the CPU list
of the scatterlist this will produce a mangled scatterlist that has
a fully zero-length and NULL'd CPU list. The list is 0 length,
all the struct page pointers are NULL and zero sized. This is stronger
and more robust than the existing mangle_sg_table() technique. It is
a future project to migrate DMABUF as a subsystem away from using
scatterlist for this data structure.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Tested-by: Alex Mastro <amastro@fb.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Acked-by: Christian König <christian.koenig@amd.com>
Acked-by: Ankit Agrawal <ankita@nvidia.com>
Link: https://lore.kernel.org/r/20251120-dmabuf-vfio-v9-6-d7f71607f371@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-20 12:02:19 -07:00
Jason Gunthorpe
50d44fce53 PCI/P2PDMA: Document DMABUF model
Reflect latest changes in p2p implementation to support DMABUF lifecycle.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Acked-by: Ankit Agrawal <ankita@nvidia.com>
Link: https://lore.kernel.org/r/20251120-dmabuf-vfio-v9-5-d7f71607f371@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-20 12:02:09 -07:00
Leon Romanovsky
395698bd2c PCI/P2PDMA: Provide an access to pci_p2pdma_map_type() function
Provide an access to pci_p2pdma_map_type() function to allow subsystems
to determine the appropriate mapping type for P2PDMA transfers between
a provider and target device.

The pci_p2pdma_map_type() function is the core P2P layer version of
the existing public, but struct page focused, pci_p2pdma_state()
function. It returns the same result. It is required to use the p2p
subsystem from drivers that don't use the struct page layer.

Like __pci_p2pdma_update_state() it is not an exported function. The
idea is that only subsystem code will implement mapping helpers for
taking in phys_addr_t lists, this is deliberately not made accessible
to every driver to prevent abuse.

Following patches will use this function to implement a shared DMA
mapping helper for DMABUF.

Tested-by: Alex Mastro <amastro@fb.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Acked-by: Ankit Agrawal <ankita@nvidia.com>
Link: https://lore.kernel.org/r/20251120-dmabuf-vfio-v9-4-d7f71607f371@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-20 12:02:00 -07:00
Leon Romanovsky
372d6d1b8a PCI/P2PDMA: Refactor to separate core P2P functionality from memory allocation
Refactor the PCI P2PDMA subsystem to separate the core peer-to-peer DMA
functionality from the optional memory allocation layer. This creates a
two-tier architecture:

The core layer provides P2P mapping functionality for physical addresses
based on PCI device MMIO BARs and integrates with the DMA API for
mapping operations. This layer is required for all P2PDMA users.

The optional upper layer provides memory allocation capabilities
including gen_pool allocator, struct page support, and sysfs interface
for user space access.

This separation allows subsystems like DMABUF to use only the core P2P
mapping functionality without the overhead of memory allocation features
they don't need. The core functionality is now available through the
new pcim_p2pdma_provider() function that returns a p2pdma_provider
structure.

Tested-by: Alex Mastro <amastro@fb.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Acked-by: Ankit Agrawal <ankita@nvidia.com>
Link: https://lore.kernel.org/r/20251120-dmabuf-vfio-v9-3-d7f71607f371@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-20 12:01:52 -07:00
Leon Romanovsky
d4504262f7 PCI/P2PDMA: Simplify bus address mapping API
Update the pci_p2pdma_bus_addr_map() function to take a direct pointer
to the p2pdma_provider structure instead of the pci_p2pdma_map_state.
This simplifies the API by removing the need for callers to extract
the provider from the state structure.

The change updates all callers across the kernel (block layer, IOMMU,
DMA direct, and HMM) to pass the provider pointer directly, making
the code more explicit and reducing unnecessary indirection. This
also removes the runtime warning check since callers now have direct
control over which provider they use.

Tested-by: Alex Mastro <amastro@fb.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Acked-by: Ankit Agrawal <ankita@nvidia.com>
Link: https://lore.kernel.org/r/20251120-dmabuf-vfio-v9-2-d7f71607f371@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-20 12:01:41 -07:00
Leon Romanovsky
f58ef9d1d1 PCI/P2PDMA: Separate the mmap() support from the core logic
Currently the P2PDMA code requires a pgmap and a struct page to
function. The was serving three important purposes:

 - DMA API compatibility, where scatterlist required a struct page as
   input

 - Life cycle management, the percpu_ref is used to prevent UAF during
   device hot unplug

 - A way to get the P2P provider data through the pci_p2pdma_pagemap

The DMA API now has a new flow, and has gained phys_addr_t support, so
it no longer needs struct pages to perform P2P mapping.

Lifecycle management can be delegated to the user, DMABUF for instance
has a suitable invalidation protocol that does not require struct page.

Finding the P2P provider data can also be managed by the caller
without need to look it up from the phys_addr.

Split the P2PDMA code into two layers. The optional upper layer,
effectively, provides a way to mmap() P2P memory into a VMA by
providing struct page, pgmap, a genalloc and sysfs.

The lower layer provides the actual P2P infrastructure and is wrapped
up in a new struct p2pdma_provider. Rework the mmap layer to use new
p2pdma_provider based APIs.

Drivers that do not want to put P2P memory into VMA's can allocate a
struct p2pdma_provider after probe() starts and free it before
remove() completes. When DMA mapping the driver must convey the struct
p2pdma_provider to the DMA mapping code along with a phys_addr of the
MMIO BAR slice to map. The driver must ensure that no DMA mapping
outlives the lifetime of the struct p2pdma_provider.

The intended target of this new API layer is DMABUF. There is usually
only a single p2pdma_provider for a DMABUF exporter. Most drivers can
establish the p2pdma_provider during probe, access the single instance
during DMABUF attach and use that to drive the DMA mapping.

DMABUF provides an invalidation mechanism that can guarantee all DMA
is halted and the DMA mappings are undone prior to destroying the
struct p2pdma_provider. This ensures there is no UAF through DMABUFs
that are lingering past driver removal.

The new p2pdma_provider layer cannot be used to create P2P memory that
can be mapped into VMA's, be used with pin_user_pages(), O_DIRECT, and
so on. These use cases must still use the mmap() layer. The
p2pdma_provider layer is principally for DMABUF-like use cases where
DMABUF natively manages the life cycle and access instead of
vmas/pin_user_pages()/struct page.

In addition, remove the bus_off field from pci_p2pdma_map_state since
it duplicates information already available in the pgmap structure.
The bus_offset is only used in one location (pci_p2pdma_bus_addr_map)
and is always identical to pgmap->bus_offset.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Tested-by: Alex Mastro <amastro@fb.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Acked-by: Ankit Agrawal <ankita@nvidia.com>
Link: https://lore.kernel.org/r/20251120-dmabuf-vfio-v9-1-d7f71607f371@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-20 12:00:47 -07:00
Vikas Gupta
8ac050ec3b bng_en: Add RoCE aux device support
Add an auxiliary (aux) device to support RoCE. The base driver is
responsible for creating the auxiliary device and allocating the
required resources to it, which will be owned by the bnge RoCE
driver in future patches.

Signed-off-by: Vikas Gupta <vikas.gupta@broadcom.com>
Link: https://patch.msgid.link/20251117171136.128193-2-siva.kallam@broadcom.com
Reviewed-by: Siva Reddy Kallam <siva.kallam@broadcom.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-20 10:35:56 -05:00
Kalesh AP
fecaa0c74f RDMA/bnxt_re: Fix wrong check for CQ coalesc support
Driver is not creating the debugfs hooks for CQ coalesc parameters
because of a wrong check. Fixed the condition check inside
bnxt_re_init_cq_coal_debugfs().

Fixes: cf27490790 ("RDMA/bnxt_re: Add a debugfs entry for CQE coalescing tuning")
Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Link: https://patch.msgid.link/20251117061306.1140588-1-kalesh-anakkur.purayil@broadcom.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-20 10:30:52 -05:00
Lu Baolu
6cbc09b771 iommu/vt-d: Restore previous domain::aperture_end calculation
Commit d373449d8e ("iommu/vt-d: Use the generic iommu page table")
changed the calculation of domain::aperture_end. Previously, it was
calculated as:

    domain->domain.geometry.aperture_end =
            __DOMAIN_MAX_ADDR(domain->gaw - 1);

where domain->gaw was limited to less than MGAW.

Currently, it is calculated purely based on the max level of the page
table that the hardware supports. This is incorrect as stated in Section
3.6 of the VT-d spec:

  "Software using first-stage translation structures to translate an IO
   Virtual Address (IOVA) must use canonical addresses. Additionally,
   software must limit addresses to less than the minimum of MGAW and the
   lower canonical address width implied by FSPM (i.e., 47-bit when FSPM
   is 4-level and 56-bit when FSPM is 5-level)."

Restore the previous calculation method for domain::aperture_end to avoid
violating the spec. Incorrect aperture calculation causes GPU hangs
without generating VT-d faults on some Intel client platforms.

Fixes: d373449d8e ("iommu/vt-d: Use the generic iommu page table")
Reported-by: Chaitanya Kumar Borah <chaitanya.kumar.borah@intel.com>
Closes: https://lore.kernel.org/r/4f15cf3b-6fad-4cd8-87e5-6d86c0082673@intel.com
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Suggested-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-20 11:35:15 +01:00
Aashish Sharma
6b38a108ee iommu/vt-d: Fix unused invalidation hint in qi_desc_iotlb
Invalidation hint (ih) in the function 'qi_desc_iotlb' is initialized
to zero and never used. It is embedded in the 0th bit of the 'addr'
parameter. Get the correct 'ih' value from there.

Fixes: f701c9f36b ("iommu/vt-d: Factor out invalidation descriptor composition")
Signed-off-by: Aashish Sharma <aashish@aashishsharma.net>
Link: https://lore.kernel.org/r/20251009010903.1323979-1-aashish@aashishsharma.net
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-20 11:33:04 +01:00
Vineeth Pillai (Google)
cb3db5a39e iommu/vt-d: Set INTEL_IOMMU_FLOPPY_WA depend on BLK_DEV_FD
INTEL_IOMMU_FLOPPY_WA workaround was introduced to create direct mappings
for first 16MB for floppy devices as the floppy drivers were not using
dma apis. We need not do this direct map if floppy driver is not
enabled.

INTEL_IOMMU_FLOPPY_WA is generally not a good idea. Iommu will be
mapping pages in this address range while kernel would also be
allocating from this range(mostly on memory stress). A misbehaving
device using this domain will have access to the pages that the
kernel might be actively using. We noticed this while running a test
that was trying to figure out if any pages used by kernel is in iommu
page tables.

This patch reduces the scope of the above issue by disabling the
workaround when floppy driver is not enabled. But we would still need to
fix the floppy driver to use dma apis so that we need not do direct map
without reserving the pages. Or the other option is to reserve this
memory range in firmware so that kernel will not use the pages.

Fixes: d850c2ee5f ("iommu/vt-d: Expose ISA direct mapping region via iommu_get_resv_regions")
Fixes: 49a0429e53 ("Intel IOMMU: Iommu floppy workaround")
Signed-off-by: Vineeth Pillai (Google) <vineeth@bitbyteword.org>
Link: https://lore.kernel.org/r/20251002161625.1155133-1-vineeth@bitbyteword.org
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-20 11:33:04 +01:00
Serge Hallyn
9891d2f79a Clarify the rootid_owns_currentns
Split most of the rootid_owns_currentns() functionality
into a more generic rootid_owns_ns() function which
will be easier to write tests for.

Rename the functions and variables to make clear that
the ids being tested could be any uid.

Signed-off-by: Serge Hallyn <serge@hallyn.com>
CC: Ryan Foster <foster.ryan.r@gmail.com>
CC: Christian Brauner <brauner@kernel.org>

---
v2: change the function parameter documentation to mollify the bot.
2025-11-18 18:00:19 -06:00
Dave Jiang
ea5514e300 Merge branch 'for-6.19/cxl-misc' into cxl-for-next
- Remove ret_limit race condition in mock_get_event()
- Assign overflow_err_count from log->nr_overflow
2025-11-18 16:27:58 -07:00
Alison Schofield
f1840efdb2 cxl/test: Assign overflow_err_count from log->nr_overflow
mock_get_event() uses an uninitialized local variable, nr_overflow, to
populate the overflow_err_count field. That results in incorrect
overflow_err_count values in mocked cxl_overflow trace events, such as
this case where the records are reported as 0 and should be non-zero:

[] cxl_overflow: memdev=mem7 host=cxl_mem.6 serial=7: log=Failure : 0 records from 1763228189130895685 to 1763228193130896180

Fix by using log->nr_overflow and remove the unused local variable.

A follow-up change was considered in cxl_mem_get_records_log() to
confirm that the overflow_err_count is non-zero when the overflow flag
is set [1]. Since the driver has no functional dependency on this
constraint, and a device that violates this specific requirement does
not cause incorrect driver behavior, no validation check is added.

[1] CXL 3.2, Table 8-65 Get Event Records Output Payload

Signed-off-by: Alison Schofield <alison.schofield@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>> ---
Link: https://patch.msgid.link/20251116013036.1713313-1-alison.schofield@intel.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
2025-11-18 16:21:57 -07:00
Alison Schofield
b6369daf0d cxl/test: Remove ret_limit race condition in mock_get_event()
Commit 364ee9f326 ("cxl/test: Enhance event testing") changed the
loop iterator in mock_get_event() from a static constant,
CXL_TEST_EVENT_CNT, to a dynamic global variable, ret_limit. The
intent was to vary the number of events returned per call to simulate
events occurring while logs are being read.

However, ret_limit is modified without synchronization. When multiple
threads call mock_get_event() concurrently, one thread may read
ret_limit, another thread may increment it, and the first thread's
loop condition and size calculation see and use the updated value.

This is visible during cxl_test module load when all memdevs are
initializing simultaneously, which includes getting event records. It
is not tied to the cxl-events.sh unit test specifically, as that
operates on a single memdev.

While no actual harm results (the buffer is always large enough and
the record count fields correctly reflect what was written), this is
a correctness issue. The race creates an inconsistent state within
mock_get_event() and adding variability based on a race appears
unintended.

Make ret_limit a local variable populated from an atomic counter. Each
call gets a stable value that won't change during execution. That
preserves the intended behavior of varying the return counts across
calls while eliminating the race condition.

This implementation uses "+ 1" to produce the full range of 1 to
CXL_TEST_EVENT_RET_MAX (4) records. Previously only 1, 2, 3 were
produced.

Signed-off-by: Alison Schofield <alison.schofield@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>> ---
Link: https://patch.msgid.link/20251116013819.1713780-1-alison.schofield@intel.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
2025-11-18 16:19:48 -07:00
Dave Jiang
b5bea8cee5 Merge branch 'for-6.19/cxl-misc' into cxl-for-next
- remove unused mock function for cxl_rcd_component_reg_phys()
2025-11-18 15:41:53 -07:00
Alejandro Lucero
26c5b0d9c0 cxl/test: remove unused mock function for cxl_rcd_component_reg_phys()
Since commit 733b57f262 ("cxl/pci: Early setup RCH dport component registers from RCRB")
is not necessary under mocking tests.

[ dj: Fixup commit representation flagged by checkpatch. ]
[ dj: Ammend subject line to indicate which function. ]

Signed-off-by: Alejandro Lucero <alucerop@amd.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>> ---
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Link: https://patch.msgid.link/20251118182202.2083244-1-alejandro.lucero-palau@amd.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
2025-11-18 15:33:50 -07:00
René Rebe
1d779fa996 ata: pata_pcmcia: Add Iomega Clik! PCMCIA ATA/ATAPI Adapter
Add Iomega Clik! "PCMCIA ATA/ATAPI Adapter" ID to pata_pcmcia.

Signed-off-by: René Rebe <rene@exactco.de>
Signed-off-by: Niklas Cassel <cassel@kernel.org>
2025-11-18 21:58:41 +01:00
Dave Jiang
7ec9db66cc Merge branch 'for-6.19/cxl-elc-test' into cxl-for-next
Extended linear cache unit testing support
  - Standardize CXL auto region size
  - Add cxl_test CFMWS support for extended linear cache
  - Add support for acpi extended linear cache
2025-11-17 11:05:47 -07:00
Dave Jiang
68f4a852e1 cxl/test: Add support for acpi extended linear cache
Add the mock wrappers for hmat_get_extended_linear_cache_size() in order
to emulate the ACPI helper function for the regions that are mock'd by
cxl_test.

Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Tested-by: Alison Schofield <alison.schofield@intel.com>
Reviewed-by: Alison Schofield <alison.schofield@intel.com>
Reviewed-by: Fabio M. De Francesco <fabio.m.de.francesco@linux.intel.com>
Link: https://patch.msgid.link/20251117144611.903692-4-dave.jiang@intel.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
2025-11-17 09:46:44 -07:00
Dave Jiang
4b1c0466c8 cxl/test: Add cxl_test CFMWS support for extended linear cache
Add a module parameter to allow activation of extended linear cache
on the auto region for cxl_test. The current platform implementation
for extended linear cache is 1:1 of DRAM and CXL memory. A CFMWS is
created with the size of both memory together where DRAM takes the
first part of the memory range and CXL covers the second part. The
current CXL auto region on cxl_test consists of 2 256M devices that
creates a 512M region. The new extended linear cache setup will have
512M DRAM and 512M CXL memory for a total of 1G CFMWS. The hardware
decoders must have their starting offset moved to after the DRAM region
to handle the CXL regions.

[ dj: Fixup commenting style. (Jonathan) ]

Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Tested-by: Alison Schofield <alison.schofield@intel.com>
Reviewed-by: Alison Schofield <alison.schofield@intel.com>
Reviewed-by: Fabio M. De Francesco <fabio.m.de.francesco@linux.intel.com>
Link: https://patch.msgid.link/20251117144611.903692-3-dave.jiang@intel.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
2025-11-17 09:46:42 -07:00
Dave Jiang
fa59c35167 cxl/test: Standardize CXL auto region size
Create a global define for the size of the mock CXL auto region used
in cxl_test. Remove the declared size in mock_init_hdm_decoder()
function.

Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Tested-by: Alison Schofield <alison.schofield@intel.com>
Reviewed-by: Alison Schofield <alison.schofield@intel.com>
Reviewed-by: Fabio M. De Francesco <fabio.m.de.francesco@linux.intel.com>
Link: https://patch.msgid.link/20251117144611.903692-2-dave.jiang@intel.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
2025-11-17 09:44:38 -07:00
Johan Hovold
c08934a612 iommu/tegra: fix device leak on probe_device()
Make sure to drop the reference taken to the iommu platform device when
looking up its driver data during probe_device().

Note that commit 9826e393e4 ("iommu/tegra-smmu: Fix missing
put_device() call in tegra_smmu_find") fixed the leak in an error path,
but the reference is still leaking on success.

Fixes: 8918465163 ("memory: Add NVIDIA Tegra memory controller support")
Cc: stable@vger.kernel.org	# 3.19: 9826e393e4
Cc: Miaoqian Lin <linmq006@gmail.com>
Acked-by: Robin Murphy <robin.murphy@arm.com>
Acked-by: Thierry Reding <treding@nvidia.com>
Signed-off-by: Johan Hovold <johan@kernel.org>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-17 09:49:45 +01:00
Johan Hovold
f916109bf5 iommu/sun50i: fix device leak on of_xlate()
Make sure to drop the reference taken to the iommu platform device when
looking up its driver data during of_xlate().

Fixes: 4100b8c229 ("iommu: Add Allwinner H6 IOMMU driver")
Cc: stable@vger.kernel.org	# 5.8
Cc: Maxime Ripard <mripard@kernel.org>
Acked-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Johan Hovold <johan@kernel.org>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-17 09:49:45 +01:00
Johan Hovold
13e1d629d8 iommu/omap: simplify probe_device() error handling
Simplify the probe_device() error handling by dropping the iommu OF node
reference sooner.

Acked-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Johan Hovold <johan@kernel.org>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-17 09:49:45 +01:00
Johan Hovold
b587069106 iommu/omap: fix device leaks on probe_device()
Make sure to drop the references taken to the iommu platform devices
when looking up their driver data during probe_device().

Note that the arch data device pointer added by commit 604629bcb5
("iommu/omap: add support for late attachment of iommu devices") has
never been used. Remove it to underline that the references are not
needed.

Fixes: 9d5018deec ("iommu/omap: Add support to program multiple iommus")
Fixes: 7d6827748d ("iommu/omap: Fix iommu archdata name for DT-based devices")
Cc: stable@vger.kernel.org	# 3.18
Cc: Suman Anna <s-anna@ti.com>
Acked-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Johan Hovold <johan@kernel.org>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-17 09:49:44 +01:00
Johan Hovold
ab31cf041e iommu/mediatek-v1: add missing larb count sanity check
Add the missing larb count sanity check to avoid writing beyond a fixed
sized array in case of a malformed devicetree.

Acked-by: Robin Murphy <robin.murphy@arm.com>
Reviewed-by: Yong Wu <yong.wu@mediatek.com>
Signed-off-by: Johan Hovold <johan@kernel.org>
Reviewed-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-17 09:49:44 +01:00
Johan Hovold
46207625c9 iommu/mediatek-v1: fix device leaks on probe()
Make sure to drop the references taken to the larb devices during
probe on probe failure (e.g. probe deferral) and on driver unbind.

Fixes: b17336c55d ("iommu/mediatek: add support for mtk iommu generation one HW")
Cc: stable@vger.kernel.org	# 4.8
Cc: Honghui Zhang <honghui.zhang@mediatek.com>
Acked-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Johan Hovold <johan@kernel.org>
Reviewed-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-17 09:49:44 +01:00
Johan Hovold
c77ad28bfe iommu/mediatek-v1: fix device leak on probe_device()
Make sure to drop the reference taken to the iommu platform device when
looking up its driver data during probe_device().

Fixes: b17336c55d ("iommu/mediatek: add support for mtk iommu generation one HW")
Cc: stable@vger.kernel.org	# 4.8
Cc: Honghui Zhang <honghui.zhang@mediatek.com>
Acked-by: Robin Murphy <robin.murphy@arm.com>
Reviewed-by: Yong Wu <yong.wu@mediatek.com>
Signed-off-by: Johan Hovold <johan@kernel.org>
Reviewed-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-17 09:49:44 +01:00
Johan Hovold
4f2a4aec1c iommu/mediatek: simplify dt parsing error handling
As previously documented by commit 2659392856 ("iommu/mediatek: Add
error path for loop of mm_dts_parse"), the id mapping may not be linear
so the whole larb array needs to be iterated on devicetree parsing
errors.

Simplify the loop by iterating from index zero while dropping the
redundant NULL check for consistency with later cleanups.

Also add back the comment which was removed by commit 462e768b55
("iommu/mediatek: Fix forever loop in error handling") to prevent anyone
from trying to optimise the loop by iterating backwards from 'i'.

Cc: Yong Wu <yong.wu@mediatek.com>
Acked-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Johan Hovold <johan@kernel.org>
Reviewed-by: Yong Wu <yong.wu@mediatek.com>
Reviewed-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-17 09:49:43 +01:00
Johan Hovold
de83d4617f iommu/mediatek: fix use-after-free on probe deferral
The driver is dropping the references taken to the larb devices during
probe after successful lookup as well as on errors. This can
potentially lead to a use-after-free in case a larb device has not yet
been bound to its driver so that the iommu driver probe defers.

Fix this by keeping the references as expected while the iommu driver is
bound.

Fixes: 2659392856 ("iommu/mediatek: Add error path for loop of mm_dts_parse")
Cc: stable@vger.kernel.org
Cc: Yong Wu <yong.wu@mediatek.com>
Acked-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Johan Hovold <johan@kernel.org>
Reviewed-by: Yong Wu <yong.wu@mediatek.com>
Reviewed-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-17 09:49:43 +01:00
Johan Hovold
b3f1ee1828 iommu/mediatek: fix device leak on of_xlate()
Make sure to drop the reference taken to the iommu platform device when
looking up its driver data during of_xlate().

Fixes: 0df4fabe20 ("iommu/mediatek: Add mt8173 IOMMU driver")
Cc: stable@vger.kernel.org	# 4.6
Acked-by: Robin Murphy <robin.murphy@arm.com>
Reviewed-by: Yong Wu <yong.wu@mediatek.com>
Signed-off-by: Johan Hovold <johan@kernel.org>
Reviewed-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-17 09:49:43 +01:00
Johan Hovold
80aa518452 iommu/ipmmu-vmsa: fix device leak on of_xlate()
Make sure to drop the reference taken to the iommu platform device when
looking up its driver data during of_xlate().

Fixes: 7b2d59611f ("iommu/ipmmu-vmsa: Replace local utlb code with fwspec ids")
Cc: stable@vger.kernel.org	# 4.14
Cc: Magnus Damm <damm+renesas@opensource.se>
Acked-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Johan Hovold <johan@kernel.org>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-17 09:49:43 +01:00
Johan Hovold
05913cc43c iommu/exynos: fix device leak on of_xlate()
Make sure to drop the reference taken to the iommu platform device when
looking up its driver data during of_xlate().

Note that commit 1a26044954 ("iommu/exynos: add missing put_device()
call in exynos_iommu_of_xlate()") fixed the leak in a couple of error
paths, but the reference is still leaking on success.

Fixes: aa759fd376 ("iommu/exynos: Add callback for initializing devices from device tree")
Cc: stable@vger.kernel.org	# 4.2: 1a26044954
Cc: Yu Kuai <yukuai3@huawei.com>
Acked-by: Robin Murphy <robin.murphy@arm.com>
Acked-by: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Johan Hovold <johan@kernel.org>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-17 09:49:42 +01:00
Johan Hovold
6a3908ce56 iommu/qcom: fix device leak on of_xlate()
Make sure to drop the reference taken to the iommu platform device when
looking up its driver data during of_xlate().

Note that commit e2eae09939 ("iommu/qcom: add missing put_device()
call in qcom_iommu_of_xlate()") fixed the leak in a couple of error
paths, but the reference is still leaking on success and late failures.

Fixes: 0ae349a0f3 ("iommu/qcom: Add qcom_iommu")
Cc: stable@vger.kernel.org	# 4.14: e2eae09939
Cc: Rob Clark <robin.clark@oss.qualcomm.com>
Cc: Yu Kuai <yukuai3@huawei.com>
Acked-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Johan Hovold <johan@kernel.org>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-17 09:49:42 +01:00
Johan Hovold
a6eaa872c5 iommu/apple-dart: fix device leak on of_xlate()
Make sure to drop the reference taken to the iommu platform device when
looking up its driver data during of_xlate().

Fixes: 46d1fb072e ("iommu/dart: Add DART iommu driver")
Cc: stable@vger.kernel.org	# 5.15
Cc: Sven Peter <sven@kernel.org>
Acked-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Johan Hovold <johan@kernel.org>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-17 09:49:42 +01:00
Bagas Sanjaya
4fd2467686 iommupt: Actually correct pt_test_sw_bit_{acquire_release}() parameter description
In review comment for v1 of genpt documentation fixes [1], Randy
suggested that pt_test_sw_bit_acquire() parameters description
should be written using "to read". Commit e4dfaf25df ("iommupt:
Describe @bitnr parameter"), however, misunderstood the review by
instead using "to read" on @bitnr parameter on both
pt_test_sw_bit_acquire() and pt_test_sw_bit_release().

Actually correct the description.

[1]: https://lore.kernel.org/linux-doc/9dba0eb7-6f32-41b7-b70b-12379364585f@infradead.org/

Fixes: e4dfaf25df ("iommupt: Describe @bitnr parameter")
Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-17 09:46:22 +01:00
Dave Jiang
33bedb92d2 Merge branch 'for-6.19/cxl-prm' into cxl-for-next
- Simplify cxl_rd_ops allocation
- Group xor arithmetric setup code
- Remove local variable @inc in cxl_port_setup_targets()
2025-11-14 11:11:46 -07:00
Robert Richter
7e71fa6e01 cxl/region: Remove local variable @inc in cxl_port_setup_targets()
Simplify the code by removing local variable @inc. The variable is not
used elsewhere, remove it and directly increment the target number.

Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Signed-off-by: Robert Richter <rrichter@amd.com>
Link: https://patch.msgid.link/20251114075844.1315805-4-rrichter@amd.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
2025-11-14 10:37:29 -07:00
Robert Richter
c42a4d2ee3 cxl/acpi: Group xor arithmetric setup code in a single block
Simplify the xor arithmetric setup code by grouping it in a single
block. No need to split the block for QoS setup.

It is safe to reorder the call of cxl_setup_extended_linear_cache()
because there are no dependencies.

Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Signed-off-by: Robert Richter <rrichter@amd.com>
Tested-by: Gregory Price <gourry@gourry.net>
Link: https://patch.msgid.link/20251114075844.1315805-3-rrichter@amd.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
2025-11-14 10:37:29 -07:00
Robert Richter
6123133ee9 cxl: Simplify cxl_rd_ops allocation and handling
A root decoder's callback handlers are collected in struct cxl_rd_ops.
The structure is dynamically allocated, though it contains only a few
pointers in it. This also requires to check two pointes to check for
the existence of a callback.

Simplify the allocation, release and handler check by embedding the
ops statically in struct cxl_root_decoder.

Implementation is equivalent to how struct cxl_root_ops handles the
callbacks.

[ dj: Fix spelling error in commit log. ]

Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Signed-off-by: Robert Richter <rrichter@amd.com>
Link: https://patch.msgid.link/20251114075844.1315805-2-rrichter@amd.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
2025-11-14 10:37:13 -07:00
Dave Jiang
482dc84e91 Merge branch 'for-6.19/cxl-elc' into cxl-for-next
- Add extended linear cache size sysfs attribute.
- Adjust failure emission of extended linear cache detection in cxl_acpi.
2025-11-13 08:47:49 -07:00
Dave Jiang
87c69670da Merge branch 'for-6.19/cxl-addr-xlat' into cxl-for-next
Enable unit testing for XOR address translation of SPA to DPA and vice versa.
2025-11-13 08:44:00 -07:00
Dave Jiang
2be575434e Merge branch 'for-6.19/cxl-misc' into cxl-for-next
Misc patches for CXL 6.19
- Remove incorrect page-allocator quirk section in documentation.
- Remove unused devm_cxl_port_enumerate_dports() function.
- Fix typo in cdat.c code comment.
- Replace use of system_wq with system_percpu_wq
- Add locked decoder support
- Return when generic target updated
- Rename region_res_match_cxl_range() to spa_maps_hpa()
- Clarify comment in spa_maps_hpa()
2025-11-13 08:42:46 -07:00
Mostafa Saleh
7e06063a43 iommu/io-pgtable-arm-selftests: Use KUnit
Integrate the selftests as part of kunit.

Now instead of the test only being run at boot, it can run:

- With CONFIG_IOMMU_IO_PGTABLE_LPAE_KUNIT_TEST=y
  It will automatically run at boot as before.

- Otherwise with CONFIG_IOMMU_IO_PGTABLE_LPAE_KUNIT_TEST=m:
  1) on module load:
     Once the module load the self test will run
     # modprobe io-pgtable-arm-selftests

  2) debugfs
     With CONFIG_KUNIT_DEBUGFS=y You can run the test with
     # echo 1 > /sys/kernel/debug/kunit/io-pgtable-arm-test/run

  3) Using kunit.py
     You can also use the helper script which uses Qemu in the background

     # ./tools/testing/kunit/kunit.py run --build_dir build_kunit_arm64 --arch arm64 \
       --make_options LLVM=1 --kunitconfig ./kunit/kunitconfig
      [18:01:09] ============= io-pgtable-arm-test (1 subtest) ==============
      [18:01:09] [PASSED] arm_lpae_do_selftests
      [18:01:09] =============== [PASSED] io-pgtable-arm-test ===============
      [18:01:09] ============================================================

Suggested-by: Jason Gunthorpe <jgg@ziepe.ca>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Pranjal Shrivastava <praan@google.com>
Signed-off-by: Mostafa Saleh <smostafa@google.com>
Acked-by: Will Deacon <will@kernel.org>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-13 16:25:32 +01:00
Mostafa Saleh
a3c24b6d7c iommu/io-pgtable-arm-selftests: Modularize the test
Remove the __init constraint, as the test will be converted to KUnit,
it can run on-demand after later.

Also, as KUnit can be a module, make this test modular.

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Pranjal Shrivastava <praan@google.com>
Signed-off-by: Mostafa Saleh <smostafa@google.com>
Acked-by: Will Deacon <will@kernel.org>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-13 16:25:32 +01:00
Mostafa Saleh
699b059962 iommu/io-pgtable-arm: Move selftests to a separate file
Clean up the io-pgtable-arm library by moving the selftests out.
Next the tests will be registered with kunit.

This is useful also to factor out kernel specific code out, so
it can compiled as part of the hypervisor object.

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Pranjal Shrivastava <praan@google.com>
Signed-off-by: Mostafa Saleh <smostafa@google.com>
Acked-by: Will Deacon <will@kernel.org>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-13 16:25:31 +01:00
Mostafa Saleh
d8546833cf iommu/io-pgtable-arm: Remove arm_lpae_dump_ops()
At the moment, if the selftest fails it prints a lot of information
about the page table (size, levels...) this requires access to many
internals, which has to be exposed in the next patch moving the
tests out.

Instead, we can simplify the print to only print the fmt and
for each test ias, oas and pgsize_bitmap are already printed.
That is enough to identify the failed case, and the rest can
be deduced from the code.

Signed-off-by: Mostafa Saleh <smostafa@google.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Acked-by: Will Deacon <will@kernel.org>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-13 16:25:31 +01:00
Jinhui Guo
75ba146c26 iommu/amd: Fix pci_segment memleak in alloc_pci_segment()
Fix a memory leak of struct amd_iommu_pci_segment in alloc_pci_segment()
when system memory (or contiguous memory) is insufficient.

Fixes: 04230c1199 ("iommu/amd: Introduce per PCI segment device table")
Fixes: eda797a277 ("iommu/amd: Introduce per PCI segment rlookup table")
Fixes: 99fc4ac3d2 ("iommu/amd: Introduce per PCI segment alias_table")
Cc: stable@vger.kernel.org
Signed-off-by: Jinhui Guo <guojinhui.liam@bytedance.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-13 16:15:56 +01:00
Dheeraj Kumar Srivastava
d1e281f832 iommu/amd: Enhance "Completion-wait Time-out" error message
Current IOMMU driver prints "Completion-wait Time-out" error message with
insufficient information to further debug the issue.

Enhancing the error message as following:
1. Log IOMMU PCI device ID in the error message.
2. With "amd_iommu_dump=1" kernel command line option, dump entire
   command buffer entries including Head and Tail offset.

Dump the entire command buffer only on the first 'Completion-wait Time-out'
to avoid dmesg spam.

Signed-off-by: Dheeraj Kumar Srivastava <dheerajkumar.srivastava@amd.com>
Reviewed-by: Ankit Soni <Ankit.Soni@amd.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-13 16:13:26 +01:00
Li RongQing
d056bc45b6 RDMA/core: Prevent soft lockup during large user memory region cleanup
When a process exits with numerous large, pinned memory regions consisting
of 4KB pages, the cleanup of the memory region through __ib_umem_release()
may cause soft lockups. This is because unpin_user_page_range_dirty_lock()
is called in a tight loop for unpin and releasing page without yielding the
CPU.

 watchdog: BUG: soft lockup - CPU#44 stuck for 26s! [python3:73464]
 Kernel panic - not syncing: softlockup: hung tasks
 CPU: 44 PID: 73464 Comm: python3 Tainted: G           OEL

 asm_sysvec_apic_timer_interrupt+0x1b/0x20
 RIP: 0010:free_unref_page+0xff/0x190

  ? free_unref_page+0xe3/0x190
  __put_page+0x77/0xe0
  put_compound_head+0xed/0x100
  unpin_user_page_range_dirty_lock+0xb2/0x180
  __ib_umem_release+0x57/0xb0 [ib_core]
  ib_umem_release+0x3f/0xd0 [ib_core]
  mlx5_ib_dereg_mr+0x2e9/0x440 [mlx5_ib]
  ib_dereg_mr_user+0x43/0xb0 [ib_core]
  uverbs_free_mr+0x15/0x20 [ib_uverbs]
  destroy_hw_idr_uobject+0x21/0x60 [ib_uverbs]
  uverbs_destroy_uobject+0x38/0x1b0 [ib_uverbs]
  __uverbs_cleanup_ufile+0xd1/0x150 [ib_uverbs]
  uverbs_destroy_ufile_hw+0x3f/0x100 [ib_uverbs]
  ib_uverbs_close+0x1f/0xb0 [ib_uverbs]
  __fput+0x9c/0x280
  ____fput+0xe/0x20
  task_work_run+0x6a/0xb0
  do_exit+0x217/0x3c0
  do_group_exit+0x3b/0xb0
  get_signal+0x150/0x900
  arch_do_signal_or_restart+0xde/0x100
  exit_to_user_mode_loop+0xc4/0x160
  exit_to_user_mode_prepare+0xa0/0xb0
  syscall_exit_to_user_mode+0x27/0x50
  do_syscall_64+0x63/0xb0

Fix soft lockup issues by incorporating cond_resched() calls within
__ib_umem_release(), and this SG entries are typically grouped in 2MB
chunks on x86_64, adding cond_resched() should has minimal performance
impact.

Signed-off-by: Li RongQing <lirongqing@baidu.com>
Link: https://patch.msgid.link/20251113095317.2628-1-lirongqing@baidu.com
Acked-by: Junxian Huang <huangjunxian6@hisilicon.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-13 08:35:16 -05:00
Kalesh AP
d43358cda7 RDMA/restrack: Fix typos in the comments
Fix couple of occurrences of the misspelled word "reource"
in the comments with the correct spelling "resource".

Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Link: https://patch.msgid.link/20251113105457.879903-1-kalesh-anakkur.purayil@broadcom.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-13 06:27:39 -05:00
Jason Gunthorpe
56c069307d vfio: Remove the get_region_info op
No driver uses it now, all are using get_region_info_caps().

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/22-v2-2a9e24d62f1b+e10a-vfio_get_region_info_op_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-12 15:06:41 -07:00
Jason Gunthorpe
dc10734610 vfio: Move the remaining drivers to get_region_info_caps
Remove the duplicate code and change info to a pointer. caps are not used.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Pranjal Shrivastava <praan@google.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/21-v2-2a9e24d62f1b+e10a-vfio_get_region_info_op_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-12 15:06:41 -07:00
Jason Gunthorpe
182c62861b vfio/platform: Convert to get_region_info_caps
Remove the duplicate code and change info to a pointer. caps are not used.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Mostafa Saleh <smostafa@google.com>
Reviewed-by: Pranjal Shrivastava <praan@google.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20-v2-2a9e24d62f1b+e10a-vfio_get_region_info_op_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-12 15:06:41 -07:00
Jason Gunthorpe
1b0ecb5baf vfio/pci: Convert all PCI drivers to get_region_info_caps
Since the core function signature changes it has to flow up to all
drivers.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Pranjal Shrivastava <praan@google.com>
Reviewed-by: Brett Creeley <brett.creeley@amd.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/19-v2-2a9e24d62f1b+e10a-vfio_get_region_info_op_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-12 15:06:40 -07:00
Jason Gunthorpe
973af0c40e vfio/ccw: Convert to get_region_info_caps
Remove the duplicate code and flatten the call chain.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Eric Farman <farman@linux.ibm.com>
Link: https://lore.kernel.org/r/18-v2-2a9e24d62f1b+e10a-vfio_get_region_info_op_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-12 15:05:03 -07:00
Jason Gunthorpe
93165757c0 vfio/gvt: Convert to get_region_info_caps
Remove the duplicate code and change info to a pointer.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/17-v2-2a9e24d62f1b+e10a-vfio_get_region_info_op_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-12 15:05:03 -07:00
Jason Gunthorpe
45f9fa1810 vfio/mbochs: Convert mbochs to use vfio_info_add_capability()
This driver open codes the cap chain manipulations. Instead use
vfio_info_add_capability() and the get_region_info_caps() op.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/16-v2-2a9e24d62f1b+e10a-vfio_get_region_info_op_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-12 15:05:02 -07:00
Jason Gunthorpe
775f726a74 vfio: Add get_region_info_caps op
This op does the copy to/from user for the info and can return back
a cap chain through a vfio_info_cap * result.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Pranjal Shrivastava <praan@google.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/15-v2-2a9e24d62f1b+e10a-vfio_get_region_info_op_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-12 15:05:02 -07:00
Jason Gunthorpe
f978595038 vfio: Require drivers to implement get_region_info
Remove the fallback through the ioctl callback, no drivers use this now.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Pranjal Shrivastava <praan@google.com>
Reviewed-by: Mostafa Saleh <smostafa@google.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/14-v2-2a9e24d62f1b+e10a-vfio_get_region_info_op_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-12 15:05:02 -07:00
Jason Gunthorpe
e664067b60 vfio/gvt: Provide a get_region_info op
Move it out of intel_vgpu_ioctl() and re-indent it.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/13-v2-2a9e24d62f1b+e10a-vfio_get_region_info_op_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-12 15:05:02 -07:00
Jason Gunthorpe
61b3f7b5a7 vfio/ccw: Provide a get_region_info op
Move it out of vfio_ccw_mdev_ioctl() and re-indent it.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Eric Farman <farman@linux.ibm.com>
Link: https://lore.kernel.org/r/12-v2-2a9e24d62f1b+e10a-vfio_get_region_info_op_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-12 15:05:02 -07:00
Jason Gunthorpe
b9827eff6b vfio/cdx: Provide a get_region_info op
Change the signature of vfio_cdx_ioctl_get_region_info() and hook it to
the op.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Pranjal Shrivastava <praan@google.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/11-v2-2a9e24d62f1b+e10a-vfio_get_region_info_op_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-12 15:05:02 -07:00
Jason Gunthorpe
6cdae5d0c3 vfio/fsl: Provide a get_region_info op
Move it out of vfio_fsl_mc_ioctl() and re-indent it.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Pranjal Shrivastava <praan@google.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/10-v2-2a9e24d62f1b+e10a-vfio_get_region_info_op_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-12 15:05:02 -07:00
Jason Gunthorpe
d4635df279 vfio/platform: Provide a get_region_info op
Move it out of vfio_platform_ioctl() and re-indent it. Add it to all
platform drivers.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Pranjal Shrivastava <praan@google.com>
Reviewed-by: Mostafa Saleh <smostafa@google.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/9-v2-2a9e24d62f1b+e10a-vfio_get_region_info_op_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-12 15:05:02 -07:00
Jason Gunthorpe
8339fccda8 vfio/mbochs: Provide a get_region_info op
Move it out of mbochs_ioctl() and re-indent it.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Pranjal Shrivastava <praan@google.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/8-v2-2a9e24d62f1b+e10a-vfio_get_region_info_op_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-12 15:05:02 -07:00
Jason Gunthorpe
cf16acc0af vfio/mdpy: Provide a get_region_info op
Move it out of mdpy_ioctl() and re-indent it.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Pranjal Shrivastava <praan@google.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/7-v2-2a9e24d62f1b+e10a-vfio_get_region_info_op_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-12 15:05:02 -07:00
Jason Gunthorpe
0787755271 vfio/mtty: Provide a get_region_info op
Move it out of mtty_ioctl() and re-indent it.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Pranjal Shrivastava <praan@google.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/6-v2-2a9e24d62f1b+e10a-vfio_get_region_info_op_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-12 15:05:02 -07:00
Jason Gunthorpe
f3fddb71dd vfio/pci: Fill in the missing get_region_info ops
Now that every variant driver provides a get_region_info op remove the
ioctl based dispatch from vfio_pci_core_ioctl().

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Pranjal Shrivastava <praan@google.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/5-v2-2a9e24d62f1b+e10a-vfio_get_region_info_op_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-12 15:05:02 -07:00
Jason Gunthorpe
5ac7206474 vfio/nvgrace: Convert to the get_region_info op
Change the signature of nvgrace_gpu_ioctl_get_region_info()

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Ankit Agrawal <ankita@nvidia.com>
Reviewed-by: Pranjal Shrivastava <praan@google.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/4-v2-2a9e24d62f1b+e10a-vfio_get_region_info_op_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-12 15:05:02 -07:00
Jason Gunthorpe
c044eefa47 vfio/virtio: Convert to the get_region_info op
Remove virtiovf_vfio_pci_core_ioctl() and change the signature of
virtiovf_pci_ioctl_get_region_info().

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Pranjal Shrivastava <praan@google.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/3-v2-2a9e24d62f1b+e10a-vfio_get_region_info_op_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-12 15:05:02 -07:00
Jason Gunthorpe
e238f147d5 vfio/hisi: Convert to the get_region_info op
Change the function signature of hisi_acc_vfio_pci_ioctl()
and re-indent it.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Pranjal Shrivastava <praan@google.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/2-v2-2a9e24d62f1b+e10a-vfio_get_region_info_op_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-12 15:05:01 -07:00
Dave Jiang
8d27dd0b21 cxl: Clarify comment in spa_maps_hpa()
Update the comment in spa_maps_hpa() to clearly convey the construction
of extended linear cache.

Suggested-by: Dan Williams <dan.j.williams@intel.com>
Link: https://lore.kernel.org/linux-cxl/68eea19c7e67e_2f899100a8@dwillia2-mobl4.notmuch/
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Reviewed-by: Gregory Price <gourry@gourry.net>
Link: https://patch.msgid.link/20251106170108.1468304-3-dave.jiang@intel.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
2025-11-12 15:04:10 -07:00
Dave Jiang
c43521b9db cxl: Rename region_res_match_cxl_range() to spa_maps_hpa()
The function name region_res_match_cxl_range() does not accurately
convey the operation of address comparison with cache size. Rename
to spa_maps_hpa() to provide a better function name.

Suggested-by: Dan Williams <dan.j.williams@intel.com>
Link: https://lore.kernel.org/linux-cxl/68eea19c7e67e_2f899100a8@dwillia2-mobl4.notmuch/
Reviewed-by: Jonathan Cameron <jonathan.cameron@huwei.com>
Reviewed-by: Gregory Price <gourry@gourry.net>
Link: https://patch.msgid.link/20251106170108.1468304-2-dave.jiang@intel.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
2025-11-12 15:04:10 -07:00
Jason Gunthorpe
113557b040 vfio: Provide a get_region_info op
Instead of hooking the general ioctl op, have the core code directly
decode VFIO_DEVICE_GET_REGION_INFO and call an op just for it.

This is intended to allow mechanical changes to the drivers to pull their
VFIO_DEVICE_GET_REGION_INFO int oa function. Later patches will improve
the function signature to consolidate more code.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Pranjal Shrivastava <praan@google.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/1-v2-2a9e24d62f1b+e10a-vfio_get_region_info_op_jgg@nvidia.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-12 14:58:23 -07:00
Dave Jiang
15e1426788 acpi/hmat: Return when generic target is updated
With the current code flow, once the generic target is updated
target->registered is set and the remaining code is skipped.
So return immediately instead of going through the checks and
then skip.

Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Link: https://patch.msgid.link/20251105235115.85062-2-dave.jiang@intel.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
2025-11-12 14:15:33 -07:00
Dave Jiang
2230c4bdc4 cxl: Add handling of locked CXL decoder
When a decoder is locked, it means that its configuration cannot be
changed. CXL spec r3.2 8.2.4.20.13 discusses the details regarding
locked decoders. Locking happens when bit 8 of the decoder control
register is set and then the decoder is committed afterwards (CXL
spec r3.2 8.2.4.20.7).

Given that the driver creates a virtual decoder for each CFMWS, the
Fixed Device Configuration (bit 4) of the Window Restriction field is
considered as locking for the virtual decoder by the driver.

The current driver code disregards the locked status and a region can
be destroyed regardless of the locking state.

Add a region flag to indicate the region is in a locked configuration.
The driver will considered a region locked if the CFMWS or any decoder
is configured as locked. The consideration is all or nothing regarding
the locked state. It is reasonable to determine the region "locked"
status while the region is being assembled based on the decoders.

Add a check in region commit_store() to intercept when a 0 is written
to the commit sysfs attribute in order to prevent the destruction of a
region when in locked state. This should be the only entry point from user
space to destroy a region.

Add a check is added to cxl_decoder_reset() to prevent resetting a locked
decoder within the kernel driver.

Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Link: https://patch.msgid.link/20251105201826.2901915-1-dave.jiang@intel.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
2025-11-12 13:17:19 -07:00
Tuo Li
a49a9f4555 RDMA/irdma: Remove redundant NULL check of udata in irdma_create_user_ah()
The variable udata cannot be NULL because irdma_create_user_ah() always
receives it. Therefore, the if() check can be safely removed.

Signed-off-by: Tuo Li <islituo@gmail.com>
Link: https://patch.msgid.link/20251112120253.68945-1-islituo@gmail.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-12 07:10:25 -05:00
Randy Dunlap
4e5cba5bb6 RDMA/cm: Correct typedef and bad line warnings
In include/rdma/ib_cm.h:

Correct a typedef's kernel-doc notation by adding the 'typedef' keyword
to it to avoid a warning.
Add a leading " *" to a kernel-doc line to avoid a warning.

Warning: ib_cm.h:289 function parameter 'ib_cm_handler' not described
 in 'int'
Warning: ib_cm.h:289 expecting prototype for ib_cm_handler().  Prototype
 was for int() instead
Warning: ib_cm.h:484 bad line: connection message in case duplicates
 are received.

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Link: https://patch.msgid.link/20251112062908.2711007-1-rdunlap@infradead.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-12 04:39:12 -05:00
Leon Romanovsky
736c561950 Expose definition for 1600Gbps link mode
Single patch to expose new link mode for 1600Gbps, utilizing 8 lanes at
200Gbps per lane.

Signed-off-by: Leon Romanovsky <leon@kernel.org>

* mlx5-next:
  net/mlx5: Expose definition for 1600Gbps link mode
2025-11-12 03:52:18 -05:00
Ma Ke
a338d6e849 RDMA/rtrs: server: Fix error handling in get_or_create_srv
After device_initialize() is called, use put_device() to release the
device according to kernel device management rules. While direct
kfree() work in this case, using put_device() is more correct.

Found by code review.

Fixes: 9cb8374804 ("RDMA/rtrs: server: main functionality")
Signed-off-by: Ma Ke <make24@iscas.ac.cn>
Link: https://patch.msgid.link/20251110005158.13394-1-make24@iscas.ac.cn
Acked-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-10 03:09:41 -05:00
Marco Crivellari
5c467151f6 IB/isert: add WQ_PERCPU to alloc_workqueue users
Currently if a user enqueues a work item using schedule_delayed_work() the
used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to
schedule_work() that is using system_wq and queue_work(), that makes use
again of WORK_CPU_UNBOUND.
This lack of consistency cannot be addressed without refactoring the API.

alloc_workqueue() treats all queues as per-CPU by default, while unbound
workqueues must opt-in via WQ_UNBOUND.

This default is suboptimal: most workloads benefit from unbound queues,
allowing the scheduler to place worker threads where they’re needed and
reducing noise when CPUs are isolated.

This continues the effort to refactor workqueue APIs, which began with
the introduction of new workqueues and a new alloc_workqueue flag in:

commit 128ea9f6cc ("workqueue: Add system_percpu_wq and system_dfl_wq")
commit 930c2ea566 ("workqueue: Add new WQ_PERCPU flag")

This change adds a new WQ_PERCPU flag to explicitly request
alloc_workqueue() to be per-cpu when WQ_UNBOUND has not been specified.

With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND),
any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND
must now use WQ_PERCPU.

Once migration is complete, WQ_UNBOUND can be removed and unbound will
become the implicit default.

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Link: https://patch.msgid.link/20251107133626.190952-1-marco.crivellari@suse.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-09 06:30:58 -05:00
Marco Crivellari
65d21dee53 IB/iser: add WQ_PERCPU to alloc_workqueue users
Currently if a user enqueues a work item using schedule_delayed_work() the
used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to
schedule_work() that is using system_wq and queue_work(), that makes use
again of WORK_CPU_UNBOUND.
This lack of consistency cannot be addressed without refactoring the API.

alloc_workqueue() treats all queues as per-CPU by default, while unbound
workqueues must opt-in via WQ_UNBOUND.

This default is suboptimal: most workloads benefit from unbound queues,
allowing the scheduler to place worker threads where they’re needed and
reducing noise when CPUs are isolated.

This continues the effort to refactor workqueue APIs, which began with
the introduction of new workqueues and a new alloc_workqueue flag in:

commit 128ea9f6cc ("workqueue: Add system_percpu_wq and system_dfl_wq")
commit 930c2ea566 ("workqueue: Add new WQ_PERCPU flag")

This change adds a new WQ_PERCPU flag to explicitly request
alloc_workqueue() to be per-cpu when WQ_UNBOUND has not been specified.

With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND),
any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND
must now use WQ_PERCPU.

Once migration is complete, WQ_UNBOUND can be removed and unbound will
become the implicit default.

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Link: https://patch.msgid.link/20251107133306.187939-1-marco.crivellari@suse.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-09 06:30:48 -05:00
Jacob Moroni
5dd68a5914 RDMA/irdma: Remove unused CQ registry
The CQ registry was never actually used (ceq->reg_cq was always NULL),
so remove the dead code.

Signed-off-by: Jacob Moroni <jmoroni@google.com>
Link: https://patch.msgid.link/20251105162841.31786-1-jmoroni@google.com
Acked-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
2025-11-09 06:13:57 -05:00
Patrisious Haddad
6e79e21005 RDMA/mlx5: Add other eswitch support to userspace tables
Allows the creation of RDMA TRANSPORT tables over VFs/SFs that
belong to another eswitch manager. Which is only possible for PFs that
were connected via a create_lag PRM command.

Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Signed-off-by: Edward Srouji <edwards@nvidia.com>
Link: https://patch.msgid.link/20251029-support-other-eswitch-v1-7-98bb707b5d57@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-09 05:21:46 -05:00
Patrisious Haddad
f277662b73 RDMA/mlx5: Refactor _get_prio() function
Refactor the _get_prio() function to remove redundant arguments by
reusing the existing flow table attributes struct instead of passing
attributes separately. This improves code clarity and maintainability.

In addition allows downstream patch to add new parameter without
needing to change __get_prio() arguments.

Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Signed-off-by: Edward Srouji <edwards@nvidia.com>
Link: https://patch.msgid.link/20251029-support-other-eswitch-v1-6-98bb707b5d57@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-09 05:21:42 -05:00
Patrisious Haddad
5939decc64 RDMA/mlx5: Add other_eswitch support for devx destruction
When building a devx object destruction command for steering objects add
consideration for other_eswitch argument to allow proper destruction for
objects that were created with it.

Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Edward Srouji <edwards@nvidia.com>
Link: https://patch.msgid.link/20251029-support-other-eswitch-v1-5-98bb707b5d57@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-09 05:21:38 -05:00
Patrisious Haddad
3506242da0 RDMA/mlx5: Change default device for LAG slaves in RDMA TRANSPORT namespaces
In case of a LAG configuration change the root namespace core device for
all of the LAG slaves to be the core device of the master device for
RDMA_TRANSPORT namespaces, in order to ensure all tables are created
through the master device.
Once the LAG is disabled revert back to the native core device.

Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Signed-off-by: Edward Srouji <edwards@nvidia.com>
Link: https://patch.msgid.link/20251029-support-other-eswitch-v1-4-98bb707b5d57@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-09 05:21:33 -05:00
Leon Romanovsky
d06ccdc952 Add other eswitch support
When the device in switchdev mode, the RDMA device manages all the
vports which belong to its representors, which can lead to a situation
where the PF that is used to manage the RDMA device isn't the native PF
of some of the vports it manages.

Add infrastructure to allow the master PF to manage all the hardware
resources for the vports under its management.
Whereas currently the only such resource is RDMA TRANSPORT steering
domains.

That is done by adding new FW argument other_eswitch which is passed by
the driver to the FW to allow the master PF to properly manage vports
belonging to other native PF.

Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-09 05:17:37 -05:00
Kalesh AP
cf27490790 RDMA/bnxt_re: Add a debugfs entry for CQE coalescing tuning
This patch adds debugfs interfaces that allows the user to
enable/disable the RoCE CQ coalescing and fine tune certain
CQ coalescing parameters which would be helpful during debug.

Signed-off-by: Damodharam Ammepalli <damodharam.ammepalli@broadcom.com>
Reviewed-by: Hongguang Gao <hongguang.gao@broadcom.com>
Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Link: https://patch.msgid.link/20251103043425.234846-1-kalesh-anakkur.purayil@broadcom.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-09 04:02:27 -05:00
Joerg Roedel
9ad648017b iommu/iommupt: Fix build error in genericpt unit-tests
Fix the include of iommu-pages.h in the KUnit tests for the IOMMU
generic page-table code to make the tests compile again.

Reported-by: Thorsten Leemhuis <linux@leemhuis.info>
Closes: https://lore.kernel.org/all/9844d4cb-f517-478b-9911-b6dc1a963b8e@leemhuis.info/
Reported-by: "Longia, Amandeep Kaur" <amandeepkaur.longia@amd.com>
Closes: https://lore.kernel.org/all/e641c955-25ad-4eae-b3fe-4392966cf768@amd.com/
Fixes: 1dd4187f53 ("iommupt: Add a kunit test for Generic Page Table")
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Tested-by: Thorsten Leemhuis <linux@leemhuis.info>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-07 16:06:19 +01:00
Jason Gunthorpe
5cb637d942 iommupt: Documentation fixes
Some adjustments pointed out by Randy:

 "decodes an full 64-bit" -> "decodes the full 64 bit"

 Correct the function parameter name for iova_to_phys()

 Use the recommended section heading style.

Suggested-by: Randy Dunlap <rdunlap@infradead.org>
Fixes: ab0b572847 ("genpt: Add Documentation/ files")
Fixes: 879ced2bab ("iommupt: Add the AMD IOMMU v1 page table format")
Fixes: 9d4c274cd7 ("iommupt: Add iova_to_phys op")
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Tested-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-07 11:12:03 +01:00
Bagas Sanjaya
e4dfaf25df iommupt: Describe @bitnr parameter
Sphinx reports kernel-doc warnings when making htmldocs:

WARNING: ./drivers/iommu/generic_pt/pt_common.h:361 function parameter 'bitnr' not described in 'pt_test_sw_bit_acquire'
WARNING: ./drivers/iommu/generic_pt/pt_common.h:371 function parameter 'bitnr' not described in 'pt_set_sw_bit_release'

Describe @bitnr to squash them.

Fixes: bcc64b57b4 ("iommupt: Add basic support for SW bits in the page table")
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-07 11:07:04 +01:00
Bagas Sanjaya
6573d552e2 Documentation: genpt: Don't use code block marker before iommu_amdv1.c include listing
Stephen Rothwell reports htmldocs warning when merging iommu tree:

Documentation/driver-api/generic_pt.rst:32: WARNING: Literal block expected; none found. [docutils]

This is because of duplicate double colon code block markers: one after
generic_pt/fmt/iommu_amdv1.c and the one in its preceding paragraph. The
resulting htmldocs, however, only marks the include listing (after the
former) up as it should be.

Drop the latter to fix the warning.

Fixes: ab0b572847 ("genpt: Add Documentation/ files")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Closes: https://lore.kernel.org/linux-next/20251106143925.578e411b@canb.auug.org.au/
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Tested-by: Randy Dunlap <rdunlap@infradead.org>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-07 11:07:04 +01:00
Marco Crivellari
13f4d99582 ata: libata-sff: add WQ_PERCPU to alloc_workqueue users
Currently if a user enqueue a work item using schedule_delayed_work() the
used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to
schedule_work() that is using system_wq and queue_work(), that makes use
again of WORK_CPU_UNBOUND.
This lack of consistency cannot be addressed without refactoring the API.

alloc_workqueue() treats all queues as per-CPU by default, while unbound
workqueues must opt-in via WQ_UNBOUND.

This default is suboptimal: most workloads benefit from unbound queues,
allowing the scheduler to place worker threads where they’re needed and
reducing noise when CPUs are isolated.

This continues the effort to refactor workqueue APIs, which began with
the introduction of new workqueues and a new alloc_workqueue flag in:

commit 128ea9f6cc ("workqueue: Add system_percpu_wq and system_dfl_wq")
commit 930c2ea566 ("workqueue: Add new WQ_PERCPU flag")

This change adds a new WQ_PERCPU flag to explicitly request
alloc_workqueue() to be per-cpu when WQ_UNBOUND has not been specified.

With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND),
any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND
must now use WQ_PERCPU.

Once migration is complete, WQ_UNBOUND can be removed and unbound will
become the implicit default.

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Niklas Cassel <cassel@kernel.org>
2025-11-07 09:42:36 +01:00
Raghavendra Rao Ananta
0ed3a30fd9 hisi_acc_vfio_pci: Add .match_token_uuid callback in hisi_acc_vfio_pci_migrn_ops
The commit, <86624ba3b522> ("vfio/pci: Do vf_token checks for
VFIO_DEVICE_BIND_IOMMUFD") accidentally ignored including the
.match_token_uuid callback in the hisi_acc_vfio_pci_migrn_ops struct.
Introduce the missed callback here.

Fixes: 86624ba3b5 ("vfio/pci: Do vf_token checks for VFIO_DEVICE_BIND_IOMMUFD")
Cc: stable@vger.kernel.org
Suggested-by: Longfang Liu <liulongfang@huawei.com>
Signed-off-by: Raghavendra Rao Ananta <rananta@google.com>
Reviewed-by: Longfang Liu <liulongfang@huawei.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20251031170603.2260022-3-rananta@google.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-06 14:42:04 -07:00
Raghavendra Rao Ananta
2f03f21fe7 vfio: Fix ksize arg while copying user struct in vfio_df_ioctl_bind_iommufd()
For the cases where user includes a non-zero value in 'token_uuid_ptr'
field of 'struct vfio_device_bind_iommufd', the copy_struct_from_user()
in vfio_df_ioctl_bind_iommufd() fails with -E2BIG. For the 'minsz' passed,
copy_struct_from_user() expects the newly introduced field to be zero-ed,
which would be incorrect in this case.

Fix this by passing the actual size of the kernel struct. If working
with a newer userspace, copy_struct_from_user() would copy the
'token_uuid_ptr' field, and if working with an old userspace, it would
zero out this field, thus still retaining backward compatibility.

Fixes: 86624ba3b5 ("vfio/pci: Do vf_token checks for VFIO_DEVICE_BIND_IOMMUFD")
Cc: stable@vger.kernel.org
Signed-off-by: Raghavendra Rao Ananta <rananta@google.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20251031170603.2260022-2-rananta@google.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-06 14:42:04 -07:00
Randy Dunlap
512c832657 IB/rdmavt: rdmavt_qp.h: clean up kernel-doc comments
Correct the kernel-doc comments format to avoid around 35 kernel-doc
warnings:

- use struct keyword to introduce struct kernel-doc comments
- use correct variable name for some struct members
- use correct function name in comments for some functions
- fix spelling in a few comments
- use a ':' instead of '-' to separate struct members from their
  descriptions
- add a function name heading for rvt_div_mtu()

This leaves one struct member that is not described:
rdmavt_qp.h:206: warning: Function parameter or struct member 'wq'
 not described in 'rvt_krwq'

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Link: https://patch.msgid.link/20251105045127.106822-1-rdunlap@infradead.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-06 02:23:23 -05:00
Marco Crivellari
7196156b0c IB/rdmavt: WQ_PERCPU added to alloc_workqueue users
Currently if a user enqueue a work item using schedule_delayed_work() the
used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to
schedule_work() that is using system_wq and queue_work(), that makes use
again of WORK_CPU_UNBOUND.
This lack of consistentcy cannot be addressed without refactoring the API.

alloc_workqueue() treats all queues as per-CPU by default, while unbound
workqueues must opt-in via WQ_UNBOUND.

This default is suboptimal: most workloads benefit from unbound queues,
allowing the scheduler to place worker threads where they’re needed and
reducing noise when CPUs are isolated.

This change adds a new WQ_PERCPU flag to explicitly request
alloc_workqueue() to be per-cpu when WQ_UNBOUND has not been specified.

With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND),
any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND
must now use WQ_PERCPU.

Once migration is complete, WQ_UNBOUND can be removed and unbound will
become the implicit default.

CC: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Link: https://patch.msgid.link/20251101163121.78400-6-marco.crivellari@suse.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-06 02:23:23 -05:00
Marco Crivellari
5267feda50 RDMA/mlx4: WQ_PERCPU added to alloc_workqueue users
Currently if a user enqueue a work item using schedule_delayed_work() the
used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to
schedule_work() that is using system_wq and queue_work(), that makes use
again of WORK_CPU_UNBOUND.
This lack of consistentcy cannot be addressed without refactoring the API.

alloc_workqueue() treats all queues as per-CPU by default, while unbound
workqueues must opt-in via WQ_UNBOUND.

This default is suboptimal: most workloads benefit from unbound queues,
allowing the scheduler to place worker threads where they’re needed and
reducing noise when CPUs are isolated.

This change adds a new WQ_PERCPU flag to explicitly request
alloc_workqueue() to be per-cpu when WQ_UNBOUND has not been specified.

With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND),
any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND
must now use WQ_PERCPU.

Once migration is complete, WQ_UNBOUND can be removed and unbound will
become the implicit default.

CC: Yishai Hadas <yishaih@nvidia.com>
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Link: https://patch.msgid.link/20251101163121.78400-5-marco.crivellari@suse.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-06 02:23:23 -05:00
Marco Crivellari
5f93287fa9 hfi1: WQ_PERCPU added to alloc_workqueue users
Currently if a user enqueue a work item using schedule_delayed_work() the
used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to
schedule_work() that is using system_wq and queue_work(), that makes use
again of WORK_CPU_UNBOUND.
This lack of consistentcy cannot be addressed without refactoring the API.

alloc_workqueue() treats all queues as per-CPU by default, while unbound
workqueues must opt-in via WQ_UNBOUND.

This default is suboptimal: most workloads benefit from unbound queues,
allowing the scheduler to place worker threads where they’re needed and
reducing noise when CPUs are isolated.

This change adds a new WQ_PERCPU flag to explicitly request
alloc_workqueue() to be per-cpu when WQ_UNBOUND has not been specified.

With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND),
any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND
must now use WQ_PERCPU.

Once migration is complete, WQ_UNBOUND can be removed and unbound will
become the implicit default.

CC: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Link: https://patch.msgid.link/20251101163121.78400-4-marco.crivellari@suse.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-06 02:23:23 -05:00
Marco Crivellari
e60c5583b6 RDMA/core: WQ_PERCPU added to alloc_workqueue users
Currently if a user enqueue a work item using schedule_delayed_work() the
used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to
schedule_work() that is using system_wq and queue_work(), that makes use
again of WORK_CPU_UNBOUND.
This lack of consistentcy cannot be addressed without refactoring the API.

alloc_workqueue() treats all queues as per-CPU by default, while unbound
workqueues must opt-in via WQ_UNBOUND.

This default is suboptimal: most workloads benefit from unbound queues,
allowing the scheduler to place worker threads where they’re needed and
reducing noise when CPUs are isolated.

This change adds a new WQ_PERCPU flag to explicitly request
alloc_workqueue() to be per-cpu when WQ_UNBOUND has not been specified.

With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND),
any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND
must now use WQ_PERCPU.

Once migration is complete, WQ_UNBOUND can be removed and unbound will
become the implicit default.

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Link: https://patch.msgid.link/20251101163121.78400-3-marco.crivellari@suse.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-06 02:23:23 -05:00
Marco Crivellari
f673fb3449 RDMA/core: RDMA/mlx5: replace use of system_unbound_wq with system_dfl_wq
Currently if a user enqueue a work item using schedule_delayed_work() the
used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to
schedule_work() that is using system_wq and queue_work(), that makes use
again of WORK_CPU_UNBOUND.

This lack of consistency cannot be addressed without refactoring the API.

system_unbound_wq should be the default workqueue so as not to enforce
locality constraints for random work whenever it's not required.

Adding system_dfl_wq to encourage its use when unbound work should be used.

The old system_unbound_wq will be kept for a few release cycles.

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Link: https://patch.msgid.link/20251101163121.78400-2-marco.crivellari@suse.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-06 02:23:23 -05:00
Jay Bhat
da58d4223b RDMA/irdma: Take a lock before moving SRQ tail in poll_cq
Need to take an SRQ lock in poll_cq before moving SRQ tail.

Signed-off-by: Jay Bhat <jay.bhat@intel.com>
Reviewed-by: Jacob Moroni <jmoroni@google.com>
Signed-off-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com>
Link: https://patch.msgid.link/20251031021726.1003-7-tatyana.e.nikolova@intel.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-06 02:23:23 -05:00
Longfang Liu
2131c1517f hisi_acc_vfio_pci: adapt to new migration configuration
On new platforms greater than QM_HW_V3, the migration region has been
relocated from the VF to the PF. The VF's own configuration space is
restored to the complete 64KB, and there is no need to divide the
size of the BAR configuration space equally. The driver should be
modified accordingly to adapt to the new hardware device.

On the older hardware platform QM_HW_V3, the live migration configuration
region is placed in the latter 32K portion of the VF's BAR2 configuration
space. On the new hardware platform QM_HW_V4, the live migration
configuration region also exists in the same 32K area immediately following
the VF's BAR2, just like on QM_HW_V3.

However, access to this region is now controlled by hardware. Additionally,
a copy of the live migration configuration region is present in the PF's
BAR2 configuration space. On the new hardware platform QM_HW_V4, when an
older version of the driver is loaded, it behaves like QM_HW_V3 and uses
the configuration region in the VF, ensuring that the live migration
function continues to work normally. When the new version of the driver is
loaded, it directly uses the configuration region in the PF. Meanwhile,
hardware configuration disables the live migration configuration region
in the VF's BAR2: reads return all 0xF values, and writes are silently
ignored.

Signed-off-by: Longfang Liu <liulongfang@huawei.com>
Reviewed-by: Shameer Kolothum <shameerkolothum@gmail.com>
Link: https://lore.kernel.org/r/20251030015744.131771-3-liulongfang@huawei.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-05 15:20:20 -07:00
Longfang Liu
4868d2d52d crypto: hisilicon - qm updates BAR configuration
On new platforms greater than QM_HW_V3, the configuration region for the
live migration function of the accelerator device is no longer
placed in the VF, but is instead placed in the PF.

Therefore, the configuration region of the live migration function
needs to be opened when the QM driver is loaded. When the QM driver
is uninstalled, the driver needs to clear this configuration.

Signed-off-by: Longfang Liu <liulongfang@huawei.com>
Reviewed-by: Shameer Kolothum <shameerkolothum@gmail.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Link: https://lore.kernel.org/r/20251030015744.131771-2-liulongfang@huawei.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-05 14:56:16 -07:00
Chu Guangqing
abe7d63640 vfio/mtty: Fix spelling typo in samples/vfio-mdev
The comment incorrectly used "atleast" instead of "at least".

Signed-off-by: Chu Guangqing <chuguangqing@inspur.com>
Link: https://lore.kernel.org/r/20251015015954.2363-2-chuguangqing@inspur.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-05 12:10:34 -07:00
David Matlack
a52b1a7112 vfio: selftests: Store libvfio build outputs in $(OUTPUT)/libvfio
Store the tools/testing/selftests/vfio/lib outputs (e.g. object files)
in $(OUTPUT)/libvfio rather than in $(OUTPUT)/lib. This is in
preparation for building the VFIO selftests library into the KVM
selftests (see Link below).

Specifically this will avoid name conflicts between
tools/testing/selftests/{vfio,kvm/lib and also avoid leaving behind
empty directories under tools/testing/selftests/kvm after a make clean.

Link: https://lore.kernel.org/kvm/20250912222525.2515416-2-dmatlack@google.com/
Signed-off-by: David Matlack <dmatlack@google.com>
Link: https://lore.kernel.org/r/20250922224857.2528737-1-dmatlack@google.com
Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-11-05 12:02:14 -07:00
Jason Gunthorpe
6303c0187f iommupt: Add a kunit test for the SW bits
Add some basic checks that the sw_bit APIs work as expected.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-05 09:50:21 +01:00
Jason Gunthorpe
101a285411 iommu/vt-d: Follow PT_FEAT_DMA_INCOHERENT into the PASID entry
Currently a incoherent walk domain cannot be attached to a coherent
capable iommu. Kevin says HW probably doesn't exist with such a mixture,
but making the driver support it makes logical sense anyhow.

When building the PASID entry the PWSNP (Page Walk Snoop) bit tells the HW
if it should issue snoops. If the page table is cache flushed because of
PT_FEAT_DMA_INCOHERENT then it is fine to set this bit to 0 even if the HW
supports 1.

Weaken the compatible check to permit a coherent instance to accept an
incoherent table and fix the PASID table construction to set PWSNP from
PT_FEAT_DMA_INCOHERENT.

SVA always sets PWSNP.

Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-05 09:50:20 +01:00
Jason Gunthorpe
d373449d8e iommu/vt-d: Use the generic iommu page table
Replace the VT-d iommu_domain implementation of the VT-d second stage and
first stage page tables with the iommupt VTDSS and x86_64
pagetables. x86_64 is shared with the AMD driver.

There are a couple notable things in VT-d:
- Like AMD the second stage format is not sign extended, unlike AMD it
  cannot decode a full 64 bits. The first stage format is a normal sign
  extended x86 page table
- The HW caps can indicate how many levels, how many address bits and what
  leaf page sizes are supported in HW. As before the highest number of
  levels that can translate the entire supported address width is used.
  The supported page sizes are adjusted directly from the dedicated
  first/second stage cap bits.
- VTD requires flushing 'write buffers'. This logic is left unchanged,
  the write buffer flushes on any gather flush or through iotlb_sync_map.
- Like ARM, VTD has an optional non-coherent page table walker that
  requires cache flushing. This is supported through PT_FEAT_DMA_INCOHERENT
  the same as ARM, however x86 can't use the DMA API for flush, it must
  call the arch function clflush_cache_range()
- The PT_FEAT_DYNAMIC_TOP can probably be supported on VT-d someday for the
  second stage when it uses 128 bit atomic stores for the HW context
  structures.
- PT_FEAT_VTDSS_FORCE_WRITEABLE is used to work around ERRATA_772415_SPR17
- A kernel command line parameter "sp_off" disables all page sizes except
  4k

Remove all the unused iommu_domain page table code. The debugfs paths have
their own independent page table walker that is left alone for now.

This corrects a race with the non-coherent walker that the ARM
implementations have fixed:

     CPU 0                               CPU 1
  pfn_to_dma_pte()                    pfn_to_dma_pte()
   pte = &parent[offset];
   if (!dma_pte_present(pte)) {
     try_cmpxchg64(&pte->val)
					pte = &parent[offset];
					.. dma_pte_present(pte) ..
				        [...]
					// iommu_map() completes
					// Device does DMA
     domain_flush_cache(pte)

The CPU 1 mapping operation shares a page table level with the CPU 0
mapping operation. CPU 0 installed a new page table level but has not
flushed it yet. CPU1 returns from iommu_map() and the device does DMA. The
non coherent walker fails to see the new table level installed by CPU 0
and fails the DMA with non-present.

The iommupt PT_FEAT_DMA_INCOHERENT implementation uses the ARM design of
storing a flag when CPU 0 completes the flush. If the flag is not set CPU
1 will also flush to ensure the HW can fully walk to the PTE being
installed.

Cc: Tina Zhang <tina.zhang@intel.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-05 09:50:19 +01:00
Jason Gunthorpe
ef7bfe5bbf iommupt/x86: Support SW bits and permit PT_FEAT_DMA_INCOHERENT
VT-d requires PT_FEAT_DMA_INCOHERENT for the x86 page table as well,
implement the required SW bits and enable the feature.

Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-05 09:50:19 +01:00
Jason Gunthorpe
1978fac281 iommupt/x86: Set the dirty bit only for writable PTEs
AMD and VTD are historically different here, adopt the VTD version of
setting the D bit only on writable PTEs as it makes more sense.

Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-05 09:50:19 +01:00
Jason Gunthorpe
5448c1558f iommupt: Add the Intel VT-d second stage page table format
The VT-d second stage format is almost the same as the x86 PAE format,
except the bit encodings in the PTE are different and a few new PTE
features, like force coherency are present.

Among all the formats it is unique in not having a designated present bit.

Comparing the performance of several operations to the existing version:

iommu_map()
   pgsz  ,avg new,old ns, min new,old ns  , min % (+ve is better)
     2^12,     53,66    ,      50,64      ,  21.21
     2^21,     59,70    ,      56,67      ,  16.16
     2^30,     54,66    ,      52,63      ,  17.17
 256*2^12,    384,524   ,     337,516     ,  34.34
 256*2^21,    387,632   ,     336,626     ,  46.46
 256*2^30,    376,629   ,     323,623     ,  48.48

iommu_unmap()
   pgsz  ,avg new,old ns, min new,old ns  , min % (+ve is better)
     2^12,     67,86    ,      63,84      ,  25.25
     2^21,     64,84    ,      59,80      ,  26.26
     2^30,     59,78    ,      56,74      ,  24.24
 256*2^12,    216,335   ,     198,317     ,  37.37
 256*2^21,    245,350   ,     232,344     ,  32.32
 256*2^30,    248,345   ,     226,339     ,  33.33

Cc: Tina Zhang <tina.zhang@intel.com>
Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-05 09:50:17 +01:00
Jason Gunthorpe
efa03dab7c iommupt: Flush the CPU cache after any writes to the page table
Flush the CPU cache for the page table memory after each set of writes to
the page table. The iommu should have visibility to the updated entries as
soon as the map/unmap/etc operations return, like normal coherent hardware
does.

The caches also have to be flushed before any gather can be submitted to
the driver.

Implement the same solution to the race as io-pgtable-arm by using a
software PTE bit to track if a table entry has been flushed or not. If
another thread is still flushing then another concurrent map operation
could return without IOMMU visibility to a required table entry. The SW
bit will tell the second thread to also flush the cache.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-05 09:47:45 +01:00
Jason Gunthorpe
aefd967dab iommupt: Use the incoherent start/stop functions for PT_FEAT_DMA_INCOHERENT
This is the first step to supporting an incoherent walker, start and stop
the incoherence around the allocation and frees of the page table memory.

The iommu_pages API maps this to dma_map/unmap_single(), or arch cache
flushing calls.

Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-05 09:47:44 +01:00
Jason Gunthorpe
bcc64b57b4 iommupt: Add basic support for SW bits in the page table
SW bits can be placed on items, including table entries, single OA's and
individual items within a contiguous OA. They are guaranteed to be ignored
by the HW. The API is very basic since the only use case so far is a
single bit.

Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-05 09:47:44 +01:00
Jason Gunthorpe
36ae67b139 iommu/pages: Add support for incoherent IOMMU page table walkers
Some IOMMU HW cannot snoop the CPU cache when it walks the IO page tables.
The CPU is required to flush the cache to make changes visible to the HW.

Provide some helpers from iommu-pages to manage this. The helpers combine
both the ARM and x86 (used in Intel VT-d) versions of the cache flushing
under a single API.

The ARM version uses the DMA API to access the cache flush on the
assumption that the iommu is using a direct mapping and is already marked
incoherent. The helpers will do the DMA API calls to set things up and
keep track of DMA mapped folios using a bit in the ioptdesc so that
unmapping on error paths is cleaner.

The Intel version just calls the arch cache flush call directly and has no
need to cleanup prior to destruction.

Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-05 09:47:43 +01:00
Jason Gunthorpe
bc5233c090 iommupt: Add a kunit test for the IOMMU implementation
This intends to have high coverage of the page table format functions and
the IOMMU implementation itself, exercising the various corner cases.

The kunit tests can be run in the kunit framework, using commands like:

tools/testing/kunit/kunit.py run --build_dir build_kunit_arm64 --arch arm64 --make_options LLVM=-19 --kunitconfig ./drivers/iommu/generic_pt/.kunitconfig
tools/testing/kunit/kunit.py run --build_dir build_kunit_uml --kunitconfig ./drivers/iommu/generic_pt/.kunitconfig
tools/testing/kunit/kunit.py run --build_dir build_kunit_x86_64 --arch x86_64 --kunitconfig ./drivers/iommu/generic_pt/.kunitconfig
tools/testing/kunit/kunit.py run --build_dir build_kunit_i386 --arch i386 --kunitconfig ./drivers/iommu/generic_pt/.kunitconfig
tools/testing/kunit/kunit.py run --build_dir build_kunit_i386pae --arch i386 --kunitconfig ./drivers/iommu/generic_pt/.kunitconfig --kconfig_add CONFIG_X86_PAE=y

There are several interesting corner cases on the 32 bit platforms that
need checking.

Like the generic tests, these are run on the format's configuration list
using kunit "params". This also checks the core iommu parts of the page
table code as it enters the logic through a mock iommu_domain.

The following are checked:
 - PT_FEAT_DYNAMIC_TOP properly adds levels one by one
 - Every page size can be iommu_map()'d, and mapping creates that size
 - iommu_iova_to_phys() works with every page size
 - Test converting OA -> non present -> OA when the two OAs overlap and
   free table levels
 - Test that unmap stops at holes, unmap doesn't split, and unmap returns
   the right values for partial unmap requests
 - Randomly map/unmap. Checks map with random sizes, that map fails when
   hitting collisions doing nothing, unmap/map with random intersections and
   full unmap of random sizes. Also checks iommu_iova_to_phys() with random
   sizes
 - Check for memory leaks by monitoring NR_SECONDARY_PAGETABLE

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Tested-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-05 09:08:58 +01:00
Jason Gunthorpe
2fdf6db436 iommu/amd: Remove AMD io_pgtable support
None of this is used anymore, delete it.

Reviewed-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Tested-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-05 09:08:57 +01:00
Alejandro Jimenez
789a5913b2 iommu/amd: Use the generic iommu page table
Replace the io_pgtable versions with pt_iommu versions. The v2 page table
uses the x86 implementation that will be eventually shared with VT-d.

This supports the same special features as the original code:
 - increase_top for the v1 format to allow scaling from 3 to 6 levels
 - non-present flushing
 - Dirty tracking for v1 only
 - __sme_set() to adjust the PTEs for CC
 - Optimization for flushing with virtualization to minimize the range
 - amd_iommu_pgsize_bitmap override of the native page sizes
 - page tables allocate from the device's NUMA node

Rework the domain ops so that v1/v2 get their own ops. Make dedicated
allocation functions for v1 and v2. Hook up invalidation for a top change
to struct pt_iommu_flush_ops. Delete some of the iopgtable related code
that becomes unused in this patch. The next patch will delete the rest of
it.

This fixes a race bug in AMD's increase_address_space() implementation. It
stores the top level and top pointer in different memory, which prevents
other threads from reading a coherent version:

   increase_address_space()   alloc_pte()
                                level = pgtable->mode - 1;
	pgtable->root  = pte;
	pgtable->mode += 1;
                                pte = &pgtable->root[PM_LEVEL_INDEX(level, address)];

The iommupt version is careful to put mode and root under a single
READ_ONCE and then is careful to only READ_ONCE a single time per
walk.

Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Tested-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-05 09:08:56 +01:00
Jason Gunthorpe
aef5de756e iommupt: Add the x86 64 bit page table format
This is used by x86 CPUs and can be used in AMD/VT-d x86 IOMMUs. When a
x86 IOMMU is running SVA the MM will be using this format.

This implementation follows the AMD v2 io-pgtable version.

There is nothing remarkable here, the format can have 4 or 5 levels and
limited support for different page sizes. No contiguous pages support.

x86 uses a sign extension mechanism where the top bits of the VA must
match the sign bit. The core code supports this through
PT_FEAT_SIGN_EXTEND which creates and upper and lower VA range. All the
new operations will work correctly in both spaces, however currently there
is no way to report the upper space to other layers. Future patches can
improve that.

In principle this can support 3 page tables levels matching the 32 bit PAE
table format, but no iommu driver needs this. The focus is on the modern
64 bit 4 and 5 level formats.

Comparing the performance of several operations to the existing version:

iommu_map()
   pgsz  ,avg new,old ns, min new,old ns  , min % (+ve is better)
     2^12,     71,61    ,      66,58      , -13.13
     2^21,     66,60    ,      61,55      , -10.10
     2^30,     59,56    ,      56,54      ,  -3.03
 256*2^12,    392,1360  ,     345,1289    ,  73.73
 256*2^21,    383,1159  ,     335,1145    ,  70.70
 256*2^30,    378,965   ,     331,892     ,  62.62

iommu_unmap()
   pgsz  ,avg new,old ns, min new,old ns  , min % (+ve is better)
     2^12,     77,71    ,      73,68      ,  -7.07
     2^21,     76,70    ,      70,66      ,  -6.06
     2^30,     69,66    ,      66,63      ,  -4.04
 256*2^12,    225,899   ,     210,870     ,  75.75
 256*2^21,    262,722   ,     248,710     ,  65.65
 256*2^30,    251,643   ,     244,634     ,  61.61

The small -ve values in the iommu_unmap() are due to the core code calling
iommu_pgsize() before invoking the domain op. This is unncessary with this
implementation. Future work optimizes this and gets to 2%, 4%, 3%.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Tested-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-05 09:07:14 +01:00
Jason Gunthorpe
e93d5945ed iommufd: Change the selftest to use iommupt instead of xarray
The iommufd self test uses an xarray to store the pfns and their orders to
emulate a page table. Make it act more like a real iommu driver by
replacing the xarray with an iommupt based page table. The new AMDv1 mock
format behaves similarly to the xarray.

Add set_dirty() as a iommu_pt operation to allow the test suite to
simulate HW dirty.

Userspace can select between several formats including the normal AMDv1
format and a special MOCK_IOMMUPT_HUGE variation for testing huge page
dirty tracking. To make the dirty tracking test work the page table must
only store exactly 2M huge pages otherwise the logic the test uses
fails. They cannot be broken up or combined.

Aside from aligning the selftest with a real page table implementation,
this helps test the iommupt code itself.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Samiullah Khawaja <skhawaja@google.com>
Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Tested-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-05 09:07:13 +01:00
Jason Gunthorpe
e5359dcc61 iommupt: Add a mock pagetable format for iommufd selftest to use
The iommufd self test uses an xarray to store the pfns and their orders to
emulate a page table. Slightly modify the amdv1 page table to create a
real page table that has similar properties:

 - 2k base granule to simulate something like a 4k page table on a 64K
   PAGE_SIZE ARM system
 - Contiguous page support for every PFN order
 - Dirty tracking

AMDv1 is the closest format, as it is the only one that already supports
every page size. Tweak it to have only 5 levels and an 11 bit base granule
and compile it separately as a format variant.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Samiullah Khawaja <skhawaja@google.com>
Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Tested-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-05 09:07:13 +01:00
Jason Gunthorpe
1dd4187f53 iommupt: Add a kunit test for Generic Page Table
This intends to have high coverage of the page table format functions, it
uses the IOMMU implementation to create a tree which it then walks through
and directly calls the generic page table functions to test them.

It is a good starting point to test a new format header as it is often
able to find typos and inconsistencies much more directly, rather than
with an obscure failure in the iommu implementation.

The tests can be run with commands like:

tools/testing/kunit/kunit.py run --build_dir build_kunit_arm64 --arch arm64 --make_options LLVM=-19 --kunitconfig ./drivers/iommu/generic_pt/.kunitconfig
tools/testing/kunit/kunit.py run --build_dir build_kunit_uml --kunitconfig ./drivers/iommu/generic_pt/.kunitconfig --kconfig_add CONFIG_WERROR=n
tools/testing/kunit/kunit.py run --build_dir build_kunit_x86_64 --arch x86_64 --kunitconfig ./drivers/iommu/generic_pt/.kunitconfig
tools/testing/kunit/kunit.py run --build_dir build_kunit_i386 --arch i386 --kunitconfig ./drivers/iommu/generic_pt/.kunitconfig
tools/testing/kunit/kunit.py run --build_dir build_kunit_i386pae --arch i386 --kunitconfig ./drivers/iommu/generic_pt/.kunitconfig --kconfig_add CONFIG_X86_PAE=y

There are several interesting corner cases on the 32 bit platforms that
need checking.

The format can declare a list of configurations that generate different
configurations the initialize the page table, for instance with different
top levels or other parameters. The kunit will turn these into "params"
which cause each test to run multiple times.

The tests are repeated to run at every table level to check that all the
item encoding formats work.

The following are checked:
 - Basic init works for each configuration
 - The various log2 functions have the expected behavior at the limits
 - pt_compute_best_pgsize() works
 - pt_table_pa() reads back what pt_install_table() writes
 - range.max_vasz_lg2 works properly
 - pt_table_oa_lg2sz() and pt_table_item_lg2sz() use a contiguous
   non-overlapping set of bits from the VA up to the defined max_va
 - pt_possible_sizes() and pt_can_have_leaf() produces a sensible layout
 - pt_item_oa(), pt_entry_oa(), and pt_entry_num_contig_lg2() read back
   what pt_install_leaf_entry() writes
 - pt_clear_entry() works
 - pt_attr_from_entry() reads back what pt_iommu_set_prot() &
   pt_install_leaf_entry() writes
 - pt_entry_set_write_clean(), pt_entry_make_write_dirty(), and
   pt_entry_write_is_dirty() work

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Tested-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-05 09:07:11 +01:00
Jason Gunthorpe
4a00f94348 iommupt: Add read_and_clear_dirty op
IOMMU HW now supports updating a dirty bit in an entry when a DMA writes
to the entry's VA range. iommufd has a uAPI to read and clear the dirty
bits from the tables.

This is a trivial recursive descent algorithm to read and optionally clear
the dirty bits. The format needs a function to tell if a contiguous entry
is dirty, and a function to clear a contiguous entry back to clean.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Samiullah Khawaja <skhawaja@google.com>
Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Tested-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-05 09:07:11 +01:00
Jason Gunthorpe
dcd6a011a8 iommupt: Add map_pages op
map is slightly complicated because it has to handle a number of special
edge cases:
 - Overmapping a previously shared, but now empty, table level with an OA.
   Requries validating and freeing the possibly empty tables
 - Doing the above across an entire to-be-created contiguous entry
 - Installing a new shared table level concurrently with another thread
 - Expanding the table by adding more top levels

Table expansion is a unique feature of AMDv1, this version is quite
similar except we handle racing concurrent lockless map. The table top
pointer and starting level are encoded in a single uintptr_t which ensures
we can READ_ONCE() without tearing. Any op will do the READ_ONCE() and use
that fixed point as its starting point. Concurrent expansion is handled
with a table global spinlock.

When inserting a new table entry map checks that the entire portion of the
table is empty. This includes freeing any empty lower tables that will be
overwritten by an OA. A separate free list is used while checking and
collecting all the empty lower tables so that writing the new entry is
uninterrupted, either the new entry fully writes or nothing changes.

A special fast path for PAGE_SIZE is implemented that does a direct walk
to the leaf level and installs a single entry. This gives ~15% improvement
for iommu_map() when mapping lists of single pages.

This version sits under the iommu_domain_ops as map_pages() but does not
require the external page size calculation. The implementation is actually
map_range() and can do arbitrary ranges, internally handling all the
validation and supporting any arrangment of page sizes. A future series
can optimize iommu_map() to take advantage of this.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Samiullah Khawaja <skhawaja@google.com>
Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Tested-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-05 09:07:10 +01:00
Jason Gunthorpe
7c53f4238a iommupt: Add unmap_pages op
unmap_pages removes mappings and any fully contained interior tables from
the given range. This follows the now-standard iommu_domain API definition
where it does not split up larger page sizes into smaller. The caller must
perform unmap only on ranges created by map or it must have somehow
otherwise determined safe cut points (eg iommufd/vfio use iova_to_phys to
scan for them)

A future work will provide 'cut' which explicitly does the page size split
if the HW can support it.

unmap is implemented with a recursive descent of the tree. If the caller
provides a VA range that spans an entire table item then the table memory
can be freed as well.

If an entire table item can be freed then this version will also check the
leaf-only level of the tree to ensure that all entries are present to
generate -EINVAL. Many of the existing drivers don't do this extra check.

This version sits under the iommu_domain_ops as unmap_pages() but does not
require the external page size calculation. The implementation is actually
unmap_range() and can do arbitrary ranges, internally handling all the
validation and supporting any arrangment of page sizes. A future series
can optimize __iommu_unmap() to take advantage of this.

Freed page table memory is batched up in the gather and will be freed in
the driver's iotlb_sync() callback after the IOTLB flush completes.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Samiullah Khawaja <skhawaja@google.com>
Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Tested-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-05 09:07:10 +01:00
Jason Gunthorpe
9d4c274cd7 iommupt: Add iova_to_phys op
iova_to_phys is a performance path for the DMA API and iommufd, implement
it using an unrolled get_user_pages() like function waterfall scheme.

The implementation itself is fairly trivial.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Samiullah Khawaja <skhawaja@google.com>
Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Tested-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-05 09:07:09 +01:00
Jason Gunthorpe
879ced2bab iommupt: Add the AMD IOMMU v1 page table format
AMD IOMMU v1 is unique in supporting contiguous pages with a variable size
and it can decode the full 64 bit VA space. Unlike other x86 page tables
this explicitly does not do sign extension as part of allowing the entire
64 bit VA space to be supported.

The general design is quite similar to the x86 PAE format, except with a
6th level and quite different PTE encoding.

This format is the only one that uses the PT_FEAT_DYNAMIC_TOP feature in
the existing code as the existing AMDv1 code starts out with a 3 level
table and adds levels on the fly if more IOVA is needed.

Comparing the performance of several operations to the existing version:

iommu_map()
   pgsz  ,avg new,old ns, min new,old ns  , min % (+ve is better)
     2^12,     65,64    ,      62,61      ,  -1.01
     2^13,     70,66    ,      67,62      ,  -8.08
     2^14,     73,69    ,      71,65      ,  -9.09
     2^15,     78,75    ,      75,71      ,  -5.05
     2^16,     89,89    ,      86,84      ,  -2.02
     2^17,    128,121   ,     124,112     , -10.10
     2^18,    175,175   ,     170,163     ,  -4.04
     2^19,    264,306   ,     261,279     ,   6.06
     2^20,    444,525   ,     438,489     ,  10.10
     2^21,     60,62    ,      58,59      ,   1.01
 256*2^12,    381,1833  ,     367,1795    ,  79.79
 256*2^21,    375,1623  ,     356,1555    ,  77.77
 256*2^30,    356,1338  ,     349,1277    ,  72.72

iommu_unmap()
   pgsz  ,avg new,old ns, min new,old ns  , min % (+ve is better)
     2^12,     76,89    ,      71,86      ,  17.17
     2^13,     79,89    ,      75,86      ,  12.12
     2^14,     78,90    ,      74,86      ,  13.13
     2^15,     82,89    ,      74,86      ,  13.13
     2^16,     79,89    ,      74,86      ,  13.13
     2^17,     81,89    ,      77,87      ,  11.11
     2^18,     90,92    ,      87,89      ,   2.02
     2^19,     91,93    ,      88,90      ,   2.02
     2^20,     96,95    ,      91,92      ,   1.01
     2^21,     72,88    ,      68,85      ,  20.20
 256*2^12,    372,6583  ,     364,6251    ,  94.94
 256*2^21,    398,6032  ,     392,5758    ,  93.93
 256*2^30,    396,5665  ,     389,5258    ,  92.92

The ~5-17x speedup when working with mutli-PTE map/unmaps is because the
AMD implementation rewalks the entire table on every new PTE while this
version retains its position. The same speedup will be seen with dirtys as
well.

The old implementation triggers a compiler optimization that ends up
generating a "rep stos" memset for contiguous PTEs. Since AMD can have
contiguous PTEs that span 2Kbytes of table this is a huge win compared to
a normal movq loop. It is why the unmap side has a fairly flat runtime as
the contiguous PTE sides increases. This version makes it explicit with a
memset64() call.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Tested-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-05 09:07:08 +01:00
Jason Gunthorpe
cdb39d9185 iommupt: Add the basic structure of the iommu implementation
The existing IOMMU page table implementations duplicate all of the working
algorithms for each format. By using the generic page table API a single C
version of the IOMMU algorithms can be created and re-used for all of the
different formats used in the drivers. The implementation will provide a
single C version of the iommu domain operations: iova_to_phys, map, unmap,
and read_and_clear_dirty.

Further, adding new algorithms and techniques becomes easy to do across
the entire fleet of drivers and formats.

The C functions are drop in compatible with the existing iommu_domain_ops
using the IOMMU_PT_DOMAIN_OPS() macro. Each per-format implementation
compilation unit will produce exported symbols following the pattern
pt_iommu_FMT_map_pages() which the macro directly maps to the
iommu_domain_ops members. This avoids the additional function pointer
indirection like io-pgtable has.

The top level struct used by the drivers is pt_iommu_table_FMT. It
contains the other structs to allow container_of() to move between the
driver, iommu page table, generic page table, and generic format layers.

   struct pt_iommu_table_amdv1 {
       struct pt_iommu {
	      struct iommu_domain domain;
       } iommu;
       struct pt_amdv1 {
	      struct pt_common common;
       } amdpt;
   };

The driver is expected to union the pt_iommu_table_FMT with its own
existing domain struct:

   struct driver_domain {
       union {
	       struct iommu_domain domain;
	       struct pt_iommu_table_amdv1 amdv1;
       };
   };
   PT_IOMMU_CHECK_DOMAIN(struct driver_domain, amdv1, domain);

To create an alias to avoid renaming 'domain' in a lot of driver code.

This allows all the layers to access all the necessary functions to
implement their different roles with no change to any of the existing
iommu core code.

Implement the basic starting point: pt_iommu_init(), get_info() and
deinit().

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Samiullah Khawaja <skhawaja@google.com>
Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Tested-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-05 09:07:07 +01:00
Jason Gunthorpe
ab0b572847 genpt: Add Documentation/ files
Add some general description and pull in the kdoc comments from the source
file to index most of the useful functions.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Samiullah Khawaja <skhawaja@google.com>
Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Tested-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-05 09:07:07 +01:00
Jason Gunthorpe
7c5b184db7 genpt: Generic Page Table base API
The generic API is intended to be separated from the implementation of
page table algorithms. It contains only accessors for walking and
manipulating the table and helpers that are useful for building an
implementation. Memory management is not in the generic API, but part of
the implementation.

Using a multi-compilation approach the implementation module would include
headers in this order:

  common.h
  defs_FMT.h
  pt_defs.h
  FMT.h
  pt_common.h
  IMPLEMENTATION.h

Where each compilation unit would have a combination of FMT and
IMPLEMENTATION to produce a per-format per-implementation module.

The API is designed so that the format headers have minimal logic, and
default implementations are provided if the format doesn't include one.

Generally formats provide their code via an inline function using the
pattern:

  static inline FMTpt_XX(..) {}
  #define pt_XX FMTpt_XX

The common code then enforces a function signature so that there is no
drift in function arguments, or accidental polymorphic functions (as has
been slightly troublesome in mm). Use of function-like #defines are
avoided in the format even though many of the functions are small enough.

Provide kdocs for the API surface.

This is enough to implement the 8 initial format variations with all of
their features:
 * Entries comprised of contiguous blocks of IO PTEs for larger page
   sizes (AMDv1, ARMv8)
 * Multi-level tables, up to 6 levels. Runtime selected top level
 * The size of the top table level can be selected at runtime (ARM's
   concatenated tables)
 * The number of levels in the table can optionally increase dynamically
   during map (AMDv1)
 * Optional leaf entries at any level
 * 32 bit/64 bit virtual and output addresses, using every bit
 * Sign extended addressing (x86)
 * Dirty tracking

A basic simple format takes about 200 lines to declare the require inline
functions.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Samiullah Khawaja <skhawaja@google.com>
Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Tested-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-05 09:07:04 +01:00
Songtang Liu
a0c7005333 iommu/amd: Fix potential out-of-bounds read in iommu_mmio_show
In iommu_mmio_write(), it validates the user-provided offset with the
check: `iommu->dbg_mmio_offset > iommu->mmio_phys_end - 4`.
This assumes a 4-byte access. However, the corresponding
show handler, iommu_mmio_show(), uses readq() to perform an 8-byte
(64-bit) read.

If a user provides an offset equal to `mmio_phys_end - 4`, the check
passes, and will lead to a 4-byte out-of-bounds read.

Fix this by adjusting the boundary check to use sizeof(u64), which
corresponds to the size of the readq() operation.

Fixes: 7a4ee419e8 ("iommu/amd: Add debugfs support to dump IOMMU MMIO registers")
Signed-off-by: Songtang Liu <liusongtang@bytedance.com>
Reviewed-by: Dheeraj Kumar Srivastava <dheerajkumar.srivastava@amd.com>
Tested-by: Dheeraj Kumar Srivastava <dheerajkumar.srivastava@amd.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-11-04 11:06:00 +01:00
Dave Jiang
d6602e2581 cxl/region: Add support to indicate region has extended linear cache
Add a region sysfs attribute to show the size of the extended linear
cache if there is any. The attribute is invisible when the cache
size is 0, which indicates it does not exist.

Moved the cxl_region_visible() location in order to pick up the
new sysfs attribute definition.

[ dj: Fixed spelling errors noted by Benjamin ]

Reviewed-by: Alison Schofield <alison.schofield@intel.com>
Reviewed-by: Ben Cheatham <benjamin.cheatham@amd.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Link: https://patch.msgid.link/20251022203052.4078527-1-dave.jiang@intel.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
2025-11-03 09:29:00 -07:00
Dave Jiang
f0c5d3bc28 cxl: Adjust extended linear cache failure emission in cxl_acpi
The cxl_acpi module spams "Extended linear cache calculation failed"
when the hmat memory target is not found for a node. This is normal
when the memory target does not contain extended linear cache
attributes. Adjust cxl_acpi_set_cache_size() to just return 0 if error
is returned from hmat_get_extended_linear_cache_size(). That is the
only error returned from hmat_get_extended_linear_cache_size() as
-ENOENT.

Also remove the check for -EOPNOTSUPP in cxl_setup_extended_linear_cache()
since that errno is never returned by cxl_acpi_set_cache_size().

[dj: Flipped minor return logic suggested by Jonathan ]
Suggested-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Alison Schofield <alison.schofield@intel.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Link: https://patch.msgid.link/20251003185509.3215900-1-dave.jiang@intel.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
2025-11-03 09:28:59 -07:00
Alison Schofield
06377c54a1 cxl/test: Add cxl_translate module for address translation testing
Add a loadable test module that validates CXL address translation
calculations using parameterized test vectors. The module tests both
host-to-device and device-to-host address translations for Modulo and
XOR interleave arithmetic.

Two types of testing are provided:

1. Parameterized test vectors:
   Test vectors are passed as module parameters in the format:
	"dpa pos r_eiw r_eig hb_ways math expected_spa".
   Round-trip validation is performed:
   - Translate a DPA and position to a SPA
   - Verify the result matches expected SPA
   - Translate that SPA back to a DPA and position
   - Verify round-trip consistency

2. Internal validation testing:
   When no test vectors are provided, the module performs validation
   of the translation functions by checking parameter boundaries and
   running 10,000 iterations of randomly generated valid parameters
   to exercise the core calculation functions.

The module uses the CXL Driver translation functions through symbols
exported exclusively for cxl_translate.

Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Alison Schofield <alison.schofield@intel.com>
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
2025-11-03 09:27:32 -07:00
Alison Schofield
4fe516d2ad cxl/acpi: Make the XOR calculations available for testing
In preparation for adding a test module that can exercise the address
translation functions performed by the CXL Driver, refactor the XOR
implementation like this:

- Extract the core calculation into a standalone helper function,
- Export the new function for use by test module cxl_translate only,
- Enhance the parameter validation since this new function will be
  called from a test module with no guarantee of valid parameters,
- Move the define of struct cxl_cxims_data to cxl.h so the test module
  can build xormaps.

Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Signed-off-by: Alison Schofield <alison.schofield@intel.com>
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
2025-11-03 09:27:32 -07:00
Alison Schofield
b78b9e7b79 cxl/region: Refactor address translation funcs for testing
In preparation for adding a test module that exercises the address
translation calculations, extract the core calculations into stand-
alone functions that operate on base parameters without dependencies
on struct cxl_region.

Perform additional parameter validation to protect against a test
module sending bad parameters. Export the validation function, as
well as the three core translation functions for use by test module
cxl_translate only.

This refactoring enables unit testing of the address translation logic
with controlled inputs, while preserving identical functionality in
the existing code paths.

Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Alison Schofield <alison.schofield@intel.com>
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
2025-11-03 09:27:32 -07:00
Marco Crivellari
952e9057e6 cxl/pci: replace use of system_wq with system_percpu_wq
Currently if a user enqueue a work item using schedule_delayed_work() the
used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to
schedule_work() that is using system_wq and queue_work(), that makes use
again of WORK_CPU_UNBOUND.

This lack of consistency cannot be addressed without refactoring the API.

system_wq should be the per-cpu workqueue, yet in this name nothing makes
that clear, so replace system_wq with system_percpu_wq.

The old wq (system_wq) will be kept for a few release cycles.

See 128ea9f6cc ("workqueue: Add system_percpu_wq and system_dfl_wq")
for cause of changes.

[ dj: Add reference to commit that initiated the change. ]

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Acked-by: Davidlohr Bueso <dave@stgolabs.net>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>> ---
Link: https://patch.msgid.link/20251030163839.307752-1-marco.crivellari@suse.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
2025-11-03 09:16:03 -07:00
Alok Tiwari
040acb49bf cxl: fix typos in cdat.c comments
- Corrected spelling of "bandwdith" -> "bandwidth"
- Fixed "wht" -> "with"

Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
2025-11-03 09:16:03 -07:00
Li Ming
3f5b8f7f34 cxl/port: Remove devm_cxl_port_enumerate_dports()
devm_cxl_port_enumerate_dports() is not longer used after below commit
commit 4f06d81e7c ("cxl: Defer dport allocation for switch ports")

Delete it and the relevant interface implemented in cxl_test.

Signed-off-by: Li Ming <ming.li@zohomail.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
2025-11-03 09:16:02 -07:00
Gregory Price
82b5d7e30b Documentation/driver-api/cxl: remove page-allocator quirk section
The node/zone quirk section of the cxl documentation is incorrect.
The actual reason for fallback allocation misbehavior in the
described configuration is due to a kswapd/reclaim thrashing scenario
fixed by the linked patch.  Remove this section.

Link: https://lore.kernel.org/linux-mm/20250919162134.1098208-1-hannes@cmpxchg.org/
Signed-off-by: Gregory Price <gourry@gourry.net>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
2025-11-03 09:16:02 -07:00
Abel Vesa
617937d4d5 iommu/arm-smmu-qcom: Add Glymur MDSS compatible
Add the Glymur DPU compatible to clients compatible list, as it needs
the workarounds.

Signed-off-by: Abel Vesa <abel.vesa@linaro.org>
Reviewed-by: Bjorn Andersson <andersson@kernel.org>
Signed-off-by: Will Deacon <will@kernel.org>
2025-11-03 15:55:59 +00:00
Jingyi Wang
45859c059c dt-bindings: arm-smmu: Add compatible for Kaanapali and Glymur SoCs
Qualcomm Kaanapali and Glymur SoCs include apps smmu that implements
arm,mmu-500, which is used to translate device-visible virtual addresses
to physical addresses. Add compatible for these items.

Co-developed-by: Pankaj Patil <pankaj.patil@oss.qualcomm.com>
Signed-off-by: Pankaj Patil <pankaj.patil@oss.qualcomm.com>
Signed-off-by: Jingyi Wang <jingyi.wang@oss.qualcomm.com>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Signed-off-by: Will Deacon <will@kernel.org>
2025-11-03 15:55:07 +00:00
Jay Bhat
0a19274555 RDMA/irdma: CQ size and shadow update changes for GEN3
CQ shadow area should not be updated at the end of a page (once every
64th CQ entry), except when CQ has no more CQEs. SW must also increase
the requested CQ size by 1 and make sure the CQ is not exactly one page
in size. This is to address a quirk in the hardware.

Signed-off-by: Jay Bhat <jay.bhat@intel.com>
Signed-off-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com>
Link: https://patch.msgid.link/20251031021726.1003-4-tatyana.e.nikolova@intel.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-02 06:52:58 -05:00
Jay Bhat
cd84d8001e RDMA/irdma: Silently consume unsignaled completions
In case we get an unsignaled error completion, we silently consume the CQE by
pretending the QP does not exist. Without this, bookkeeping for signaled
completions does not work correctly.

Signed-off-by: Jay Bhat <jay.bhat@intel.com>
Signed-off-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com>
Link: https://patch.msgid.link/20251031021726.1003-5-tatyana.e.nikolova@intel.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-02 06:52:51 -05:00
Jay Bhat
153243086e RDMA/irdma: Initialize cqp_cmds_info to prevent resource leaks
Failure to initialize info.create field to false in certain cases
was resulting in incorrect status code going to rdma-core when dereg_mr
failed during reset.  To fix this, memset entire cqp_request->info
in irdma_alloc_and_get_cqp_request() function, so that this is not spread
all over the code.

Signed-off-by: Bhat, Jay <jay.bhat@intel.com>
Reviewed-by: Jacob Moroni <jmoroni@google.com>
Signed-off-by: Krzysztof Czurylo <krzysztof.czurylo@intel.com>
Signed-off-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com>
Link: https://patch.msgid.link/20251031021726.1003-2-tatyana.e.nikolova@intel.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-02 06:52:44 -05:00
Jacob Moroni
69e8e429bc RDMA/irdma: Enforce local fence for LOCAL_INV WRs
Enforce local fence for LOCAL_INV WRs to
avoid spurious FASTREG_VALID_MKEY async events
during heavy invalidation/registration activity.

Signed-off-by: Jacob Moroni <jmoroni@google.com>
Signed-off-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com>
Link: https://patch.msgid.link/20251031021726.1003-3-tatyana.e.nikolova@intel.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-11-02 06:52:33 -05:00
Zhu Yanjun
503a5e4690 RDMA/rxe: Fix null deref on srq->rq.queue after resize failure
A NULL pointer dereference can occur in rxe_srq_chk_attr() when
ibv_modify_srq() is invoked twice in succession under certain error
conditions. The first call may fail in rxe_queue_resize(), which leads
rxe_srq_from_attr() to set srq->rq.queue = NULL. The second call then
triggers a crash (null deref) when accessing
srq->rq.queue->buf->index_mask.

Call Trace:
<TASK>
rxe_modify_srq+0x170/0x480 [rdma_rxe]
? __pfx_rxe_modify_srq+0x10/0x10 [rdma_rxe]
? uverbs_try_lock_object+0x4f/0xa0 [ib_uverbs]
? rdma_lookup_get_uobject+0x1f0/0x380 [ib_uverbs]
ib_uverbs_modify_srq+0x204/0x290 [ib_uverbs]
? __pfx_ib_uverbs_modify_srq+0x10/0x10 [ib_uverbs]
? tryinc_node_nr_active+0xe6/0x150
? uverbs_fill_udata+0xed/0x4f0 [ib_uverbs]
ib_uverbs_handler_UVERBS_METHOD_INVOKE_WRITE+0x2c0/0x470 [ib_uverbs]
? __pfx_ib_uverbs_handler_UVERBS_METHOD_INVOKE_WRITE+0x10/0x10 [ib_uverbs]
? uverbs_fill_udata+0xed/0x4f0 [ib_uverbs]
ib_uverbs_run_method+0x55a/0x6e0 [ib_uverbs]
? __pfx_ib_uverbs_handler_UVERBS_METHOD_INVOKE_WRITE+0x10/0x10 [ib_uverbs]
ib_uverbs_cmd_verbs+0x54d/0x800 [ib_uverbs]
? __pfx_ib_uverbs_cmd_verbs+0x10/0x10 [ib_uverbs]
? __pfx___raw_spin_lock_irqsave+0x10/0x10
? __pfx_do_vfs_ioctl+0x10/0x10
? ioctl_has_perm.constprop.0.isra.0+0x2c7/0x4c0
? __pfx_ioctl_has_perm.constprop.0.isra.0+0x10/0x10
ib_uverbs_ioctl+0x13e/0x220 [ib_uverbs]
? __pfx_ib_uverbs_ioctl+0x10/0x10 [ib_uverbs]
__x64_sys_ioctl+0x138/0x1c0
do_syscall_64+0x82/0x250
? fdget_pos+0x58/0x4c0
? ksys_write+0xf3/0x1c0
? __pfx_ksys_write+0x10/0x10
? do_syscall_64+0xc8/0x250
? __pfx_vm_mmap_pgoff+0x10/0x10
? fget+0x173/0x230
? fput+0x2a/0x80
? ksys_mmap_pgoff+0x224/0x4c0
? do_syscall_64+0xc8/0x250
? do_user_addr_fault+0x37b/0xfe0
? clear_bhb_loop+0x50/0xa0
? clear_bhb_loop+0x50/0xa0
? clear_bhb_loop+0x50/0xa0
entry_SYSCALL_64_after_hwframe+0x76/0x7e

Fixes: 8700e3e7c4 ("Soft RoCE driver")
Tested-by: Liu Yi <asatsuyu.liu@gmail.com>
Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Link: https://patch.msgid.link/20251027215203.1321-1-yanjun.zhu@linux.dev
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-10-28 04:04:07 -04:00
Johan Hovold
fbcf539840 iommu: tegra: enable compile testing
There seems to be nothing preventing this driver from being compile
tested so enable that to increase build coverage.

Signed-off-by: Johan Hovold <johan@kernel.org>
Reviewed-by: Jon Hunter <jonathanh@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-10-27 15:09:55 +01:00
Johan Hovold
4c7ed7fef7 amba: tegra-ahb: enable compile testing
There seems to be nothing preventing this driver from being compile
tested so enable that to increase build coverage.

This also allows for compile testing of further Tegra drivers (e.g.
IOMMU driver).

Signed-off-by: Johan Hovold <johan@kernel.org>
Reviewed-by: Jon Hunter <jonathanh@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-10-27 15:09:55 +01:00
Nicolin Chen
fd714986e4 iommu: Pass in old domain to attach_dev callback functions
The IOMMU core attaches each device to a default domain on probe(). Then,
every new "attach" operation has a fundamental meaning of two-fold:
 - detach from its currently attached (old) domain
 - attach to a given new domain

Modern IOMMU drivers following this pattern usually want to clean up the
things related to the old domain, so they call iommu_get_domain_for_dev()
to fetch the old domain.

Pass in the old domain pointer from the core to drivers, aligning with the
set_dev_pasid op that does so already.

Ensure all low-level attach fcuntions in the core can forward the correct
old domain pointer. Thus, rework those functions as well.

Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-10-27 13:55:35 +01:00
Nicolin Chen
2b33598e66 iommu: Do not revert set_domain for the last gdev
The last gdev is the device that failed the __iommu_device_set_domain().
So, it doesn't need to be reverted, given it's attached to group->domain
already.

This is not a problem currently, since it's a simply re-attach. However,
the core will need to pass in the old domain to __iommu_device_set_domain
so the old domain pointers would be inconsistent between a failed device
and all its prior succeeded devices, as all the prior devices need to be
reverted.

Avoid the re-attach for the last gdev, by breaking before the revert.

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-10-27 13:55:35 +01:00
Nicolin Chen
c21b34762e iommu/amd: Set release_domain to blocked_domain
The set_dev_pasid for a release domain never gets called anyhow. So, there
is no point in defining a separate release_domain from the blocked_domain.

Simply reuse the blocked_domain.

Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-10-27 13:55:35 +01:00
Nicolin Chen
680a6a60fc iommu/exynos-iommu: Set release_domain to exynos_identity_domain
Following a coming core change to pass in the old domain pointer into the
attach_dev op and its callbacks, exynos_iommu_identity_attach() will need
this new argument too, which the release_device op doesn't provide.

Instead, the core provides a release_domain to attach to the device prior
to invoking the release_device callback. Thus, simply use that.

Acked-by: Marek Szyprowski <m.szyprowski@samsung.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-10-27 13:55:35 +01:00
Nicolin Chen
52f77fb176 iommu/arm-smmu-v3: Set release_domain to arm_smmu_blocked_domain
Since the core now takes care of the require_direct case for the release
domain, simply use that via the release_domain op.

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-10-27 13:55:34 +01:00
Jason Gunthorpe
e94160488e iommu: Generic support for RMRs during device release
Generally an IOMMU driver should leave the translation as BLOCKED until the
translation entry is probed onto a struct device. When the struct device is
removed, the translation should be put back to BLOCKED.

Drivers that are able to work like this can set their release_domain to the
blocking domain, and the core code handles this work.

The exception is when the device has an IOMMU_RESV_DIRECT region, in which
case the OS should continuously allow translations for the given range. And
the core code generally prevents using a BLOCKED domain with this device.

Continue this logic for the device release and hoist some open coding from
drivers. If the device has dev->iommu->require_direct and the driver uses a
BLOCKED release_domain, override it to IDENTITY to preserve the semantics.

The only remaining required driver code for IOMMU_RESV_DIRECT should preset
an IDENTITY translation during early IOMMU startup for those devices.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-10-27 13:55:34 +01:00
Zhengnan Chen
5a70826910 iommu/mediatek: mt8189: Add MM IOMMUs support
Add support for mt8189 MM IOMMUs.

Signed-off-by: Zhengnan Chen <zhengnan.chen@mediatek.com>
Reviewed-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
Reviewed-by: Yong Wu <yong.wu@mediatek.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-10-27 13:42:50 +01:00
Zhengnan Chen
f4b97d3469 iommu/mediatek: mt8189: Add INFRA IOMMUs support
Add support for mt8189 INFRA IOMMUs.

Signed-off-by: Zhengnan Chen <zhengnan.chen@mediatek.com>
Reviewed-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
Reviewed-by: Yong Wu <yong.wu@mediatek.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-10-27 13:42:49 +01:00
Zhengnan Chen
7e87226700 iommu/mediatek: mt8189: Add APU IOMMUs support
Add support for mt8189 APU IOMMUs.

Signed-off-by: Zhengnan Chen <zhengnan.chen@mediatek.com>
Reviewed-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
Reviewed-by: Yong Wu <yong.wu@mediatek.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-10-27 13:42:49 +01:00
Zhengnan Chen
97c7b66e36 iommu/mediatek: Add a flag DL_WITH_MULTI_LARB
Add DL_WITH_MULTI_LARB flag to support the HW which connect with
multiple larbs. Prepare for mt8189. In mt8189, the display connect
with larb1 and larb2 at the same time. Thus, we should add link
between disp-dev with these two larbs.

Signed-off-by: Zhengnan Chen <zhengnan.chen@mediatek.com>
Reviewed-by: Matthias Brugger <matthias.bgg@gmail.com>
Reviewed-by: Yong Wu <yong.wu@mediatek.com>
Reviewed-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-10-27 13:42:49 +01:00
Zhengnan Chen
812df545e3 dt-bindings: mediatek: mt8189: Add bindings for MM & APU & INFRA IOMMU
There are three iommu in total, namely MM_IOMMU, APU_IOMMU, INFRA_IOMMU,
Add bindings for them.

Signed-off-by: Zhengnan Chen <zhengnan.chen@mediatek.com>
Reviewed-by: Matthias Brugger <matthias.bgg@gmail.com>
Acked-by: Conor Dooley <conor.dooley@microchip.com>
Reviewed-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
Acked-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-10-27 13:42:48 +01:00
Pedro Demarchi Gomes
db340b02b2 iommu/pages: use folio_nr_pages() instead of shift operation
folio_nr_pages() is a faster helper function to get the number of pages when
NR_PAGES_IN_LARGE_FOLIO is enabled.

Signed-off-by: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
2025-10-27 12:46:57 +01:00
Håkon Bugge
58aca1f3de RDMA/cm: Base cm_id destruction timeout on CMA values
When a GSI MAD packet is sent on the QP, it will potentially be
retried CMA_MAX_CM_RETRIES times with a timeout value of:

    4.096usec * 2 ^ CMA_CM_RESPONSE_TIMEOUT

The above equates to ~64 seconds using the default CMA values.

The cm_id_priv's refcount will be incremented for this period.
Therefore, the timeout value waiting for a cm_id destruction must be
based on the effective timeout of MAD packets.  To provide additional
leeway, we add 25% to this timeout and use that instead of the
constant 10 seconds timeout, which may result in false negatives.

Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
Link: https://patch.msgid.link/20251021132738.4179604-1-haakon.bugge@oracle.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-10-27 07:36:39 -04:00
Rob Herring (Arm)
095d495cb8 dt-bindings: ata: snps,dwc-ahci: Allow 'iommus' property
The AMD Seattle DWC AHCI is behind an IOMMU and has 1-3 entries, so add
the 'iommus' property. There's not a specific compatible, so we can't
limit it to Seattle.

Signed-off-by: Rob Herring (Arm) <robh@kernel.org>
Signed-off-by: Niklas Cassel <cassel@kernel.org>
2025-10-23 14:29:41 +02:00
Thorsten Blum
4ea303d9e9 ata: pata_it821x: Replace deprecated strcpy with strscpy in it821x_display_disk
strcpy() is deprecated; use strscpy() instead.

Replace the hard-coded buffer size 8 with sizeof(mbuf) when using
snprintf() while we're at it.

Link: https://github.com/KSPP/linux/issues/88
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Niklas Cassel <cassel@kernel.org>
2025-10-23 14:23:40 +02:00
Randy Dunlap
be180c847a RDMA/uverbs: fix some kernel-doc warnings
Fix 49 kernel-doc warnings in ib_verbs.h:

- Add struct short description for rdma_stat_desc, rdma_hw_stats.
- Fix kernel-doc format for struct members (use ':' instead of '-') for
  several structs.
- Don't use "/**" kernel-doc notation for struct members in ib_device_ops
  (most members are not documented and most of the kernel-doc was
  not formatted correctly).
- Spell function parameters correctly in ib_dma_map_sgtable_attrs(),
  ib_device_try_get(), rdma_roce_rescan_device().
- Add kernel-doc for the function parameter in
  rdma_flow_label_to_udp_sport().

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Link: https://patch.msgid.link/20251020034320.3011094-1-rdunlap@infradead.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-10-20 11:45:46 -04:00
Colin Ian King
1511efaca0 RDMA/rxe: Remove redundant assignment to variable page_offset
The variable page_offset is being assigned a value at the start of
a loop and being redundantly zero'd at the end of the loop, there
is no code that reads the zero'd value. The assignment is redundant
and can be removed.

Signed-off-by: Colin Ian King <coking@nvidia.com>
Link: https://patch.msgid.link/20251014120343.2528608-1-coking@nvidia.com
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-10-19 08:06:21 -04:00
Stefan Metzmacher
879424832d RDMA/core: let rdma_connect_locked() call lockdep_assert_held(&id_priv->handler_mutex)
rdma_accept() also has this, so this is now more consistent and may
prevent bugs in future.

Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: linux-rdma@vger.kernel.org
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Link: https://patch.msgid.link/20251008165913.444276-1-metze@samba.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-10-19 07:03:25 -04:00
Alok Tiwari
cfd51fcf11 RDMA/cxgb4: fix typo in write_pbl() debug message
Correct the debug log format string from "pdb_addr" -> "pbl_addr"
to match the actual variable name.

Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com>
Link: https://patch.msgid.link/20250929091102.50384-1-alok.a.tiwari@oracle.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-10-19 06:51:46 -04:00
Yulin Lu
c9d869fb29 dt-bindings: ata: eswin: Document for EIC7700 SoC ahci
Document the SATA AHCI controller on the EIC7700 SoC platform,
including descriptions of its hardware configurations.

Signed-off-by: Yulin Lu <luyulin@eswincomputing.com>
Reviewed-by: Rob Herring (Arm) <robh@kernel.org>
Signed-off-by: Niklas Cassel <cassel@kernel.org>
2025-10-13 09:34:33 +02:00
348 changed files with 20451 additions and 5878 deletions

View File

@@ -415,6 +415,7 @@ ForEachMacros:
- 'for_each_prop_dlc_cpus'
- 'for_each_prop_dlc_platforms'
- 'for_each_property_of_node'
- 'for_each_pt_level_entry'
- 'for_each_rdt_resource'
- 'for_each_reg'
- 'for_each_reg_filtered'

View File

@@ -345,7 +345,8 @@ Jayachandran C <c.jayachandran@gmail.com> <jayachandranc@netlogicmicro.com>
Jayachandran C <c.jayachandran@gmail.com> <jchandra@broadcom.com>
Jayachandran C <c.jayachandran@gmail.com> <jchandra@digeo.com>
Jayachandran C <c.jayachandran@gmail.com> <jnair@caviumnetworks.com>
<jean-philippe@linaro.org> <jean-philippe.brucker@arm.com>
Jean-Philippe Brucker <jpb@kernel.org> <jean-philippe.brucker@arm.com>
Jean-Philippe Brucker <jpb@kernel.org> <jean-philippe@linaro.org>
Jean-Michel Hautbois <jeanmichel.hautbois@yoseli.org> <jeanmichel.hautbois@ideasonboard.com>
Jean Tourrilhes <jt@hpl.hp.com>
Jeevan Shriram <quic_jshriram@quicinc.com> <jshriram@codeaurora.org>

View File

@@ -496,8 +496,17 @@ Description:
changed, only freed by writing 0. The kernel makes no guarantees
that data is maintained over an address space freeing event, and
there is no guarantee that a free followed by an allocate
results in the same address being allocated.
results in the same address being allocated. If extended linear
cache is present, the size indicates extended linear cache size
plus the CXL region size.
What: /sys/bus/cxl/devices/regionZ/extended_linear_cache_size
Date: October, 2025
KernelVersion: v6.19
Contact: linux-cxl@vger.kernel.org
Description:
(RO) The size of extended linear cache, if there is an extended
linear cache. Otherwise the attribute will not be visible.
What: /sys/bus/cxl/devices/regionZ/mode
Date: January, 2023

View File

@@ -0,0 +1,79 @@
# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
%YAML 1.2
---
$id: http://devicetree.org/schemas/ata/eswin,eic7700-ahci.yaml#
$schema: http://devicetree.org/meta-schemas/core.yaml#
title: Eswin EIC7700 SoC SATA Controller
maintainers:
- Yulin Lu <luyulin@eswincomputing.com>
- Huan He <hehuan1@eswincomputing.com>
description:
AHCI SATA controller embedded into the EIC7700 SoC is based on the DWC AHCI
SATA v5.00a IP core.
select:
properties:
compatible:
const: eswin,eic7700-ahci
required:
- compatible
allOf:
- $ref: snps,dwc-ahci-common.yaml#
properties:
compatible:
items:
- const: eswin,eic7700-ahci
- const: snps,dwc-ahci
clocks:
minItems: 2
maxItems: 2
clock-names:
items:
- const: pclk
- const: aclk
resets:
maxItems: 1
reset-names:
const: arst
ports-implemented:
const: 1
required:
- compatible
- reg
- interrupts
- clocks
- clock-names
- resets
- reset-names
- phys
- phy-names
- ports-implemented
unevaluatedProperties: false
examples:
- |
sata@50420000 {
compatible = "eswin,eic7700-ahci", "snps,dwc-ahci";
reg = <0x50420000 0x10000>;
interrupt-parent = <&plic>;
interrupts = <58>;
clocks = <&clock 171>, <&clock 186>;
clock-names = "pclk", "aclk";
phys = <&sata_phy>;
phy-names = "sata-phy";
ports-implemented = <0x1>;
resets = <&reset 96>;
reset-names = "arst";
};

View File

@@ -33,6 +33,10 @@ properties:
- description: SPEAr1340 AHCI SATA device
const: snps,spear-ahci
iommus:
minItems: 1
maxItems: 3
patternProperties:
"^sata-port@[0-9a-e]$":
$ref: /schemas/ata/snps,dwc-ahci-common.yaml#/$defs/dwc-ahci-port

View File

@@ -35,6 +35,8 @@ properties:
- description: Qcom SoCs implementing "qcom,smmu-500" and "arm,mmu-500"
items:
- enum:
- qcom,glymur-smmu-500
- qcom,kaanapali-smmu-500
- qcom,milos-smmu-500
- qcom,qcm2290-smmu-500
- qcom,qcs615-smmu-500

View File

@@ -82,6 +82,9 @@ properties:
- mediatek,mt8188-iommu-vdo # generation two
- mediatek,mt8188-iommu-vpp # generation two
- mediatek,mt8188-iommu-infra # generation two
- mediatek,mt8189-iommu-apu # generation two
- mediatek,mt8189-iommu-infra # generation two
- mediatek,mt8189-iommu-mm # generation two
- mediatek,mt8192-m4u # generation two
- mediatek,mt8195-iommu-vdo # generation two
- mediatek,mt8195-iommu-vpp # generation two
@@ -128,6 +131,7 @@ properties:
This is the mtk_m4u_id according to the HW. Specifies the mtk_m4u_id as
defined in
dt-binding/memory/mediatek,mt8188-memory-port.h for mt8188,
dt-binding/memory/mediatek,mt8189-memory-port.h for mt8189,
dt-binding/memory/mt2701-larb-port.h for mt2701 and mt7623,
dt-binding/memory/mt2712-larb-port.h for mt2712,
dt-binding/memory/mt6779-larb-port.h for mt6779,
@@ -164,6 +168,7 @@ allOf:
- mediatek,mt8186-iommu-mm
- mediatek,mt8188-iommu-vdo
- mediatek,mt8188-iommu-vpp
- mediatek,mt8189-iommu-mm
- mediatek,mt8192-m4u
- mediatek,mt8195-iommu-vdo
- mediatek,mt8195-iommu-vpp
@@ -180,6 +185,7 @@ allOf:
- mediatek,mt8186-iommu-mm
- mediatek,mt8188-iommu-vdo
- mediatek,mt8188-iommu-vpp
- mediatek,mt8189-iommu-mm
- mediatek,mt8192-m4u
- mediatek,mt8195-iommu-vdo
- mediatek,mt8195-iommu-vpp
@@ -208,6 +214,8 @@ allOf:
contains:
enum:
- mediatek,mt8188-iommu-infra
- mediatek,mt8189-iommu-apu
- mediatek,mt8189-iommu-infra
- mediatek,mt8195-iommu-infra
then:

View File

@@ -32,14 +32,18 @@ properties:
- const: qcom,msm-iommu-v2
clocks:
minItems: 2
items:
- description: Clock required for IOMMU register group access
- description: Clock required for underlying bus access
- description: Clock required for Translation Buffer Unit access
clock-names:
minItems: 2
items:
- const: iface
- const: bus
- const: tbu
power-domains:
maxItems: 1

View File

@@ -41,37 +41,6 @@ To simplify this, the page allocator will prefer :code:`ZONE_MOVABLE` over
will fallback to allocate from :code:`ZONE_NORMAL`.
Zone and Node Quirks
====================
Let's consider a configuration where the local DRAM capacity is largely onlined
into :code:`ZONE_NORMAL`, with no :code:`ZONE_MOVABLE` capacity present. The
CXL capacity has the opposite configuration - all onlined in
:code:`ZONE_MOVABLE`.
Under the default allocation policy, the page allocator will completely skip
:code:`ZONE_MOVABLE` as a valid allocation target. This is because, as of
Linux v6.15, the page allocator does (approximately) the following: ::
for (each zone in local_node):
for (each node in fallback_order):
attempt_allocation(gfp_flags);
Because the local node does not have :code:`ZONE_MOVABLE`, the CXL node is
functionally unreachable for direct allocation. As a result, the only way
for CXL capacity to be used is via `demotion` in the reclaim path.
This configuration also means that if the DRAM ndoe has :code:`ZONE_MOVABLE`
capacity - when that capacity is depleted, the page allocator will actually
prefer CXL :code:`ZONE_MOVABLE` pages over DRAM :code:`ZONE_NORMAL` pages.
We may wish to invert this priority in future Linux versions.
If `demotion` and `swap` are disabled, Linux will begin to cause OOM crashes
when the DRAM nodes are depleted. See the reclaim section for more details.
CGroups and CPUSets
===================
Finally, assuming CXL memory is reachable via the page allocation (i.e. onlined

View File

@@ -0,0 +1,137 @@
.. SPDX-License-Identifier: GPL-2.0
========================
Generic Radix Page Table
========================
.. kernel-doc:: include/linux/generic_pt/common.h
:doc: Generic Radix Page Table
.. kernel-doc:: drivers/iommu/generic_pt/pt_defs.h
:doc: Generic Page Table Language
Usage
=====
Generic PT is structured as a multi-compilation system. Since each format
provides an API using a common set of names there can be only one format active
within a compilation unit. This design avoids function pointers around the low
level API.
Instead the function pointers can end up at the higher level API (i.e.
map/unmap, etc.) and the per-format code can be directly inlined into the
per-format compilation unit. For something like IOMMU each format will be
compiled into a per-format IOMMU operations kernel module.
For this to work the .c file for each compilation unit will include both the
format headers and the generic code for the implementation. For instance in an
implementation compilation unit the headers would normally be included as
follows:
generic_pt/fmt/iommu_amdv1.c::
#include <linux/generic_pt/common.h>
#include "defs_amdv1.h"
#include "../pt_defs.h"
#include "amdv1.h"
#include "../pt_common.h"
#include "../pt_iter.h"
#include "../iommu_pt.h" /* The IOMMU implementation */
iommu_pt.h includes definitions that will generate the operations functions for
map/unmap/etc. using the definitions provided by AMDv1. The resulting module
will have exported symbols named like pt_iommu_amdv1_init().
Refer to drivers/iommu/generic_pt/fmt/iommu_template.h for an example of how the
IOMMU implementation uses multi-compilation to generate per-format ops structs
pointers.
The format code is written so that the common names arise from #defines to
distinct format specific names. This is intended to aid debuggability by
avoiding symbol clashes across all the different formats.
Exported symbols and other global names are mangled using a per-format string
via the NS() helper macro.
The format uses struct pt_common as the top-level struct for the table,
and each format will have its own struct pt_xxx which embeds it to store
format-specific information.
The implementation will further wrap struct pt_common in its own top-level
struct, such as struct pt_iommu_amdv1.
Format functions at the struct pt_common level
----------------------------------------------
.. kernel-doc:: include/linux/generic_pt/common.h
:identifiers:
.. kernel-doc:: drivers/iommu/generic_pt/pt_common.h
Iteration Helpers
-----------------
.. kernel-doc:: drivers/iommu/generic_pt/pt_iter.h
Writing a Format
----------------
It is best to start from a simple format that is similar to the target. x86_64
is usually a good reference for something simple, and AMDv1 is something fairly
complete.
The required inline functions need to be implemented in the format header.
These should all follow the standard pattern of::
static inline pt_oaddr_t amdv1pt_entry_oa(const struct pt_state *pts)
{
[..]
}
#define pt_entry_oa amdv1pt_entry_oa
where a uniquely named per-format inline function provides the implementation
and a define maps it to the generic name. This is intended to make debug symbols
work better. inline functions should always be used as the prototypes in
pt_common.h will cause the compiler to validate the function signature to
prevent errors.
Review pt_fmt_defaults.h to understand some of the optional inlines.
Once the format compiles then it should be run through the generic page table
kunit test in kunit_generic_pt.h using kunit. For example::
$ tools/testing/kunit/kunit.py run --build_dir build_kunit_x86_64 --arch x86_64 --kunitconfig ./drivers/iommu/generic_pt/.kunitconfig amdv1_fmt_test.*
[...]
[11:15:08] Testing complete. Ran 9 tests: passed: 9
[11:15:09] Elapsed time: 3.137s total, 0.001s configuring, 2.368s building, 0.311s running
The generic tests are intended to prove out the format functions and give
clearer failures to speed up finding the problems. Once those pass then the
entire kunit suite should be run.
IOMMU Invalidation Features
---------------------------
Invalidation is how the page table algorithms synchronize with a HW cache of the
page table memory, typically called the TLB (or IOTLB for IOMMU cases).
The TLB can store present PTEs, non-present PTEs and table pointers, depending
on its design. Every HW has its own approach on how to describe what has changed
to have changed items removed from the TLB.
PT_FEAT_FLUSH_RANGE
~~~~~~~~~~~~~~~~~~~
PT_FEAT_FLUSH_RANGE is the easiest scheme to understand. It tries to generate a
single range invalidation for each operation, over-invalidating if there are
gaps of VA that don't need invalidation. This trades off impacted VA for number
of invalidation operations. It does not keep track of what is being invalidated;
however, if pages have to be freed then page table pointers have to be cleaned
from the walk cache. The range can start/end at any page boundary.
PT_FEAT_FLUSH_RANGE_NO_GAPS
~~~~~~~~~~~~~~~~~~~~~~~~~~~
PT_FEAT_FLUSH_RANGE_NO_GAPS is similar to PT_FEAT_FLUSH_RANGE; however, it tries
to minimize the amount of impacted VA by issuing extra flush operations. This is
useful if the cost of processing VA is very high, for instance because a
hypervisor is processing the page table with a shadowing algorithm.

View File

@@ -93,6 +93,7 @@ Subsystem-specific APIs
frame-buffer
aperture
generic-counter
generic_pt
gpio/index
hsi
hte/index

View File

@@ -9,22 +9,48 @@ between two devices on the bus. This type of transaction is henceforth
called Peer-to-Peer (or P2P). However, there are a number of issues that
make P2P transactions tricky to do in a perfectly safe way.
One of the biggest issues is that PCI doesn't require forwarding
transactions between hierarchy domains, and in PCIe, each Root Port
defines a separate hierarchy domain. To make things worse, there is no
simple way to determine if a given Root Complex supports this or not.
(See PCIe r4.0, sec 1.3.1). Therefore, as of this writing, the kernel
only supports doing P2P when the endpoints involved are all behind the
same PCI bridge, as such devices are all in the same PCI hierarchy
domain, and the spec guarantees that all transactions within the
hierarchy will be routable, but it does not require routing
between hierarchies.
For PCIe the routing of Transaction Layer Packets (TLPs) is well-defined up
until they reach a host bridge or root port. If the path includes PCIe switches
then based on the ACS settings the transaction can route entirely within
the PCIe hierarchy and never reach the root port. The kernel will evaluate
the PCIe topology and always permit P2P in these well-defined cases.
The second issue is that to make use of existing interfaces in Linux,
memory that is used for P2P transactions needs to be backed by struct
pages. However, PCI BARs are not typically cache coherent so there are
a few corner case gotchas with these pages so developers need to
be careful about what they do with them.
However, if the P2P transaction reaches the host bridge then it might have to
hairpin back out the same root port, be routed inside the CPU SOC to another
PCIe root port, or routed internally to the SOC.
The PCIe specification doesn't define the forwarding of transactions between
hierarchy domains and kernel defaults to blocking such routing. There is an
allow list to allow detecting known-good HW, in which case P2P between any
two PCIe devices will be permitted.
Since P2P inherently is doing transactions between two devices it requires two
drivers to be co-operating inside the kernel. The providing driver has to convey
its MMIO to the consuming driver. To meet the driver model lifecycle rules the
MMIO must have all DMA mapping removed, all CPU accesses prevented, all page
table mappings undone before the providing driver completes remove().
This requires the providing and consuming driver to actively work together to
guarantee that the consuming driver has stopped using the MMIO during a removal
cycle. This is done by either a synchronous invalidation shutdown or waiting
for all usage refcounts to reach zero.
At the lowest level the P2P subsystem offers a naked struct p2p_provider that
delegates lifecycle management to the providing driver. It is expected that
drivers using this option will wrap their MMIO memory in DMABUF and use DMABUF
to provide an invalidation shutdown. These MMIO addresess have no struct page, and
if used with mmap() must create special PTEs. As such there are very few
kernel uAPIs that can accept pointers to them; in particular they cannot be used
with read()/write(), including O_DIRECT.
Building on this, the subsystem offers a layer to wrap the MMIO in a ZONE_DEVICE
pgmap of MEMORY_DEVICE_PCI_P2PDMA to create struct pages. The lifecycle of
pgmap ensures that when the pgmap is destroyed all other drivers have stopped
using the MMIO. This option works with O_DIRECT flows, in some cases, if the
underlying subsystem supports handling MEMORY_DEVICE_PCI_P2PDMA through
FOLL_PCI_P2PDMA. The use of FOLL_LONGTERM is prevented. As this relies on pgmap
it also relies on architecture support along with alignment and minimum size
limitations.
Driver Writer's Guide
@@ -114,14 +140,39 @@ allocating scatter-gather lists with P2P memory.
Struct Page Caveats
-------------------
Driver writers should be very careful about not passing these special
struct pages to code that isn't prepared for it. At this time, the kernel
interfaces do not have any checks for ensuring this. This obviously
precludes passing these pages to userspace.
While the MEMORY_DEVICE_PCI_P2PDMA pages can be installed in VMAs,
pin_user_pages() and related will not return them unless FOLL_PCI_P2PDMA is set.
P2P memory is also technically IO memory but should never have any side
effects behind it. Thus, the order of loads and stores should not be important
and ioreadX(), iowriteX() and friends should not be necessary.
The MEMORY_DEVICE_PCI_P2PDMA pages require care to support in the kernel. The
KVA is still MMIO and must still be accessed through the normal
readX()/writeX()/etc helpers. Direct CPU access (e.g. memcpy) is forbidden, just
like any other MMIO mapping. While this will actually work on some
architectures, others will experience corruption or just crash in the kernel.
Supporting FOLL_PCI_P2PDMA in a subsystem requires scrubbing it to ensure no CPU
access happens.
Usage With DMABUF
=================
DMABUF provides an alternative to the above struct page-based
client/provider/orchestrator system and should be used when struct page
doesn't exist. In this mode the exporting driver will wrap
some of its MMIO in a DMABUF and give the DMABUF FD to userspace.
Userspace can then pass the FD to an importing driver which will ask the
exporting driver to map it to the importer.
In this case the initiator and target pci_devices are known and the P2P subsystem
is used to determine the mapping type. The phys_addr_t-based DMA API is used to
establish the dma_addr_t.
Lifecycle is controlled by DMABUF move_notify(). When the exporting driver wants
to remove() it must deliver an invalidation shutdown to all DMABUF importing
drivers through move_notify() and synchronously DMA unmap all the MMIO.
No importing driver can continue to have a DMA map to the MMIO after the
exporting driver has destroyed its p2p_provider.
P2P DMA Support Library

View File

@@ -388,7 +388,7 @@ B: https://bugzilla.kernel.org
F: drivers/acpi/*thermal*
ACPI VIOT DRIVER
M: Jean-Philippe Brucker <jean-philippe@linaro.org>
M: Jean-Philippe Brucker <jpb@kernel.org>
L: linux-acpi@vger.kernel.org
L: iommu@lists.linux.dev
S: Maintained
@@ -2269,7 +2269,7 @@ F: drivers/iommu/arm/
F: drivers/iommu/io-pgtable-arm*
ARM SMMU SVA SUPPORT
R: Jean-Philippe Brucker <jean-philippe@linaro.org>
R: Jean-Philippe Brucker <jpb@kernel.org>
F: drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c
ARM SUB-ARCHITECTURES
@@ -5243,6 +5243,13 @@ W: http://www.broadcom.com
F: drivers/infiniband/hw/bnxt_re/
F: include/uapi/rdma/bnxt_re-abi.h
BROADCOM 800 GIGABIT ROCE DRIVER
M: Siva Reddy Kallam <siva.kallam@broadcom.com>
L: linux-rdma@vger.kernel.org
S: Supported
W: http://www.broadcom.com
F: drivers/infiniband/hw/bng_re/
BROADCOM NVRAM DRIVER
M: Rafał Miłecki <zajec5@gmail.com>
L: linux-mips@vger.kernel.org
@@ -27214,6 +27221,13 @@ L: virtualization@lists.linux.dev
S: Maintained
F: drivers/vfio/pci/virtio
VFIO XE PCI DRIVER
M: Michał Winiarski <michal.winiarski@intel.com>
L: kvm@vger.kernel.org
L: intel-xe@lists.freedesktop.org
S: Supported
F: drivers/vfio/pci/xe
VGA_SWITCHEROO
R: Lukas Wunner <lukas@wunner.de>
S: Maintained
@@ -27455,7 +27469,7 @@ F: drivers/virtio/virtio_input.c
F: include/uapi/linux/virtio_input.h
VIRTIO IOMMU DRIVER
M: Jean-Philippe Brucker <jean-philippe@linaro.org>
M: Jean-Philippe Brucker <jpb@kernel.org>
L: virtualization@lists.linux.dev
S: Maintained
F: drivers/iommu/virtio-iommu.c

View File

@@ -9,6 +9,9 @@
#define _ASM_POWERPC_MEM_ENCRYPT_H
#include <asm/svm.h>
#include <linux/types.h>
struct device;
static inline bool force_dma_unencrypted(struct device *dev)
{

View File

@@ -1156,7 +1156,8 @@ EXPORT_SYMBOL_GPL(iommu_add_device);
*/
static int
spapr_tce_platform_iommu_attach_dev(struct iommu_domain *platform_domain,
struct device *dev)
struct device *dev,
struct iommu_domain *old)
{
struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
struct iommu_table_group *table_group;
@@ -1189,7 +1190,7 @@ static struct iommu_domain spapr_tce_platform_domain = {
static int
spapr_tce_blocked_iommu_attach_dev(struct iommu_domain *platform_domain,
struct device *dev)
struct device *dev, struct iommu_domain *old)
{
struct iommu_group *grp = iommu_group_get(dev);
struct iommu_table_group *table_group;

View File

@@ -84,7 +84,7 @@ static inline bool blk_can_dma_map_iova(struct request *req,
static bool blk_dma_map_bus(struct blk_dma_iter *iter, struct phys_vec *vec)
{
iter->addr = pci_p2pdma_bus_addr_map(&iter->p2pdma, vec->paddr);
iter->addr = pci_p2pdma_bus_addr_map(iter->p2pdma.mem, vec->paddr);
iter->len = vec->len;
return true;
}

View File

@@ -910,12 +910,13 @@ static void hmat_register_target(struct memory_target *target)
* Register generic port perf numbers. The nid may not be
* initialized and is still NUMA_NO_NODE.
*/
mutex_lock(&target_lock);
if (*(u16 *)target->gen_port_device_handle) {
hmat_update_generic_target(target);
target->registered = true;
scoped_guard(mutex, &target_lock) {
if (*(u16 *)target->gen_port_device_handle) {
hmat_update_generic_target(target);
target->registered = true;
return;
}
}
mutex_unlock(&target_lock);
hmat_hotplug_target(target);
}

View File

@@ -5,7 +5,7 @@ config ARM_AMBA
if ARM_AMBA
config TEGRA_AHB
bool
bool "Enable AHB driver for NVIDIA Tegra SoCs" if COMPILE_TEST
default y if ARCH_TEGRA
help
Adds AHB configuration functionality for NVIDIA Tegra SoCs,

View File

@@ -4216,6 +4216,10 @@ static const struct ata_dev_quirks_entry __ata_dev_quirks[] = {
/* Apacer models with LPM issues */
{ "Apacer AS340*", NULL, ATA_QUIRK_NOLPM },
/* Silicon Motion models with LPM issues */
{ "MD619HXCLDE3TC", "TCVAID", ATA_QUIRK_NOLPM },
{ "MD619GXCLDE3TC", "TCV35D", ATA_QUIRK_NOLPM },
/* These specific Samsung models/firmware-revs do not handle LPM well */
{ "SAMSUNG MZMPC128HBFU-000MV", "CXM14M1Q", ATA_QUIRK_NOLPM },
{ "SAMSUNG SSD PM830 mSATA *", "CXM13D1Q", ATA_QUIRK_NOLPM },

View File

@@ -3191,7 +3191,8 @@ void ata_sff_port_init(struct ata_port *ap)
int __init ata_sff_init(void)
{
ata_sff_wq = alloc_workqueue("ata_sff", WQ_MEM_RECLAIM, WQ_MAX_ACTIVE);
ata_sff_wq = alloc_workqueue("ata_sff", WQ_MEM_RECLAIM | WQ_PERCPU,
WQ_MAX_ACTIVE);
if (!ata_sff_wq)
return -ENOMEM;

View File

@@ -75,6 +75,7 @@
#include <linux/blkdev.h>
#include <linux/delay.h>
#include <linux/slab.h>
#include <linux/string.h>
#include <scsi/scsi_host.h>
#include <linux/libata.h>
@@ -632,9 +633,9 @@ static void it821x_display_disk(struct ata_port *ap, int n, u8 *buf)
cbl = "";
if (mode)
snprintf(mbuf, 8, "%5s%d", mtype, mode - 1);
snprintf(mbuf, sizeof(mbuf), "%5s%d", mtype, mode - 1);
else
strcpy(mbuf, "PIO");
strscpy(mbuf, "PIO");
if (buf[52] == 4)
ata_port_info(ap, "%d: %-6s %-8s %s %s\n",
n, mbuf, types[buf[52]], id, cbl);

View File

@@ -344,6 +344,7 @@ static const struct pcmcia_device_id pcmcia_devices[] = {
PCMCIA_DEVICE_PROD_ID2("NinjaATA-", 0xebe0bd79),
PCMCIA_DEVICE_PROD_ID12("PCMCIA", "CD-ROM", 0x281f1c5d, 0x66536591),
PCMCIA_DEVICE_PROD_ID12("PCMCIA", "PnPIDE", 0x281f1c5d, 0x0c694728),
PCMCIA_DEVICE_PROD_ID2("PCMCIA ATA/ATAPI Adapter", 0x888d7b73),
PCMCIA_DEVICE_PROD_ID12("SHUTTLE TECHNOLOGY LTD.", "PCCARD-IDE/ATAPI Adapter", 0x4a3f0ba0, 0x322560e1),
PCMCIA_DEVICE_PROD_ID12("SEAGATE", "ST1", 0x87c1b330, 0xe1f30883),
PCMCIA_DEVICE_PROD_ID12("SAMSUNG", "04/05/06", 0x43d74cb4, 0x6a22777d),

View File

@@ -230,42 +230,6 @@ struct tpm_chip *tpm_default_chip(void)
}
EXPORT_SYMBOL_GPL(tpm_default_chip);
/**
* tpm_find_get_ops() - find and reserve a TPM chip
* @chip: a &struct tpm_chip instance, %NULL for the default chip
*
* Finds a TPM chip and reserves its class device and operations. The chip must
* be released with tpm_put_ops() after use.
* This function is for internal use only. It supports existing TPM callers
* by accepting NULL, but those callers should be converted to pass in a chip
* directly.
*
* Return:
* A reserved &struct tpm_chip instance.
* %NULL if a chip is not found.
* %NULL if the chip is not available.
*/
struct tpm_chip *tpm_find_get_ops(struct tpm_chip *chip)
{
int rc;
if (chip) {
if (!tpm_try_get_ops(chip))
return chip;
return NULL;
}
chip = tpm_default_chip();
if (!chip)
return NULL;
rc = tpm_try_get_ops(chip);
/* release additional reference we got from tpm_default_chip() */
put_device(&chip->dev);
if (rc)
return NULL;
return chip;
}
/**
* tpm_dev_release() - free chip memory and the device number
* @dev: the character device for the TPM chip
@@ -282,7 +246,6 @@ static void tpm_dev_release(struct device *dev)
kfree(chip->work_space.context_buf);
kfree(chip->work_space.session_buf);
kfree(chip->allocated_banks);
#ifdef CONFIG_TCG_TPM2_HMAC
kfree(chip->auth);
#endif

View File

@@ -275,7 +275,8 @@ void tpm_common_release(struct file *file, struct file_priv *priv)
int __init tpm_dev_common_init(void)
{
tpm_dev_wq = alloc_workqueue("tpm_dev_wq", WQ_MEM_RECLAIM, 0);
tpm_dev_wq = alloc_workqueue("tpm_dev_wq", WQ_MEM_RECLAIM | WQ_PERCPU,
0);
return !tpm_dev_wq ? -ENOMEM : 0;
}

View File

@@ -313,10 +313,13 @@ int tpm_is_tpm2(struct tpm_chip *chip)
{
int rc;
chip = tpm_find_get_ops(chip);
if (!chip)
return -ENODEV;
rc = tpm_try_get_ops(chip);
if (rc)
return rc;
rc = (chip->flags & TPM_CHIP_FLAG_TPM2) != 0;
tpm_put_ops(chip);
@@ -338,10 +341,13 @@ int tpm_pcr_read(struct tpm_chip *chip, u32 pcr_idx,
{
int rc;
chip = tpm_find_get_ops(chip);
if (!chip)
return -ENODEV;
rc = tpm_try_get_ops(chip);
if (rc)
return rc;
if (chip->flags & TPM_CHIP_FLAG_TPM2)
rc = tpm2_pcr_read(chip, pcr_idx, digest, NULL);
else
@@ -369,10 +375,13 @@ int tpm_pcr_extend(struct tpm_chip *chip, u32 pcr_idx,
int rc;
int i;
chip = tpm_find_get_ops(chip);
if (!chip)
return -ENODEV;
rc = tpm_try_get_ops(chip);
if (rc)
return rc;
for (i = 0; i < chip->nr_allocated_banks; i++) {
if (digests[i].alg_id != chip->allocated_banks[i].alg_id) {
rc = -EINVAL;
@@ -492,10 +501,13 @@ int tpm_get_random(struct tpm_chip *chip, u8 *out, size_t max)
if (!out || max > TPM_MAX_RNG_DATA)
return -EINVAL;
chip = tpm_find_get_ops(chip);
if (!chip)
return -ENODEV;
rc = tpm_try_get_ops(chip);
if (rc)
return rc;
if (chip->flags & TPM_CHIP_FLAG_TPM2)
rc = tpm2_get_random(chip, out, max);
else

View File

@@ -267,7 +267,6 @@ static inline void tpm_msleep(unsigned int delay_msec)
int tpm_chip_bootstrap(struct tpm_chip *chip);
int tpm_chip_start(struct tpm_chip *chip);
void tpm_chip_stop(struct tpm_chip *chip);
struct tpm_chip *tpm_find_get_ops(struct tpm_chip *chip);
struct tpm_chip *tpm_chip_alloc(struct device *dev,
const struct tpm_class_ops *ops);

View File

@@ -799,11 +799,6 @@ int tpm1_pm_suspend(struct tpm_chip *chip, u32 tpm_suspend_pcr)
*/
int tpm1_get_pcr_allocation(struct tpm_chip *chip)
{
chip->allocated_banks = kcalloc(1, sizeof(*chip->allocated_banks),
GFP_KERNEL);
if (!chip->allocated_banks)
return -ENOMEM;
chip->allocated_banks[0].alg_id = TPM_ALG_SHA1;
chip->allocated_banks[0].digest_size = hash_digest_size[HASH_ALGO_SHA1];
chip->allocated_banks[0].crypto_id = HASH_ALGO_SHA1;

View File

@@ -550,11 +550,9 @@ ssize_t tpm2_get_pcr_allocation(struct tpm_chip *chip)
nr_possible_banks = be32_to_cpup(
(__be32 *)&buf.data[TPM_HEADER_SIZE + 5]);
chip->allocated_banks = kcalloc(nr_possible_banks,
sizeof(*chip->allocated_banks),
GFP_KERNEL);
if (!chip->allocated_banks) {
if (nr_possible_banks > TPM2_MAX_PCR_BANKS) {
pr_err("tpm: out of bank capacity: %u > %u\n",
nr_possible_banks, TPM2_MAX_PCR_BANKS);
rc = -ENOMEM;
goto out;
}

View File

@@ -179,6 +179,7 @@ static int crb_try_pluton_doorbell(struct crb_priv *priv, bool wait_for_complete
*
* @dev: crb device
* @priv: crb private data
* @loc: locality
*
* Write CRB_CTRL_REQ_GO_IDLE to TPM_CRB_CTRL_REQ
* The device should respond within TIMEOUT_C by clearing the bit.
@@ -233,6 +234,7 @@ static int crb_go_idle(struct tpm_chip *chip)
*
* @dev: crb device
* @priv: crb private data
* @loc: locality
*
* Write CRB_CTRL_REQ_CMD_READY to TPM_CRB_CTRL_REQ
* and poll till the device acknowledge it by clearing the bit.
@@ -412,7 +414,7 @@ static int crb_do_acpi_start(struct tpm_chip *chip)
#ifdef CONFIG_ARM64
/*
* This is a TPM Command Response Buffer start method that invokes a
* Secure Monitor Call to requrest the firmware to execute or cancel
* Secure Monitor Call to request the firmware to execute or cancel
* a TPM 2.0 command.
*/
static int tpm_crb_smc_start(struct device *dev, unsigned long func_id)

View File

@@ -265,8 +265,7 @@ static u8 tpm_tis_status(struct tpm_chip *chip)
/*
* Dump stack for forensics, as invalid TPM_STS.x could be
* potentially triggered by impaired tpm_try_get_ops() or
* tpm_find_get_ops().
* potentially triggered by impaired tpm_try_get_ops().
*/
dump_stack();
}

View File

@@ -3032,11 +3032,36 @@ static void qm_put_pci_res(struct hisi_qm *qm)
pci_release_mem_regions(pdev);
}
static void hisi_mig_region_clear(struct hisi_qm *qm)
{
u32 val;
/* Clear migration region set of PF */
if (qm->fun_type == QM_HW_PF && qm->ver > QM_HW_V3) {
val = readl(qm->io_base + QM_MIG_REGION_SEL);
val &= ~QM_MIG_REGION_EN;
writel(val, qm->io_base + QM_MIG_REGION_SEL);
}
}
static void hisi_mig_region_enable(struct hisi_qm *qm)
{
u32 val;
/* Select migration region of PF */
if (qm->fun_type == QM_HW_PF && qm->ver > QM_HW_V3) {
val = readl(qm->io_base + QM_MIG_REGION_SEL);
val |= QM_MIG_REGION_EN;
writel(val, qm->io_base + QM_MIG_REGION_SEL);
}
}
static void hisi_qm_pci_uninit(struct hisi_qm *qm)
{
struct pci_dev *pdev = qm->pdev;
pci_free_irq_vectors(pdev);
hisi_mig_region_clear(qm);
qm_put_pci_res(qm);
pci_disable_device(pdev);
}
@@ -5752,6 +5777,7 @@ int hisi_qm_init(struct hisi_qm *qm)
goto err_free_qm_memory;
qm_cmd_init(qm);
hisi_mig_region_enable(qm);
return 0;
@@ -5890,6 +5916,7 @@ static int qm_rebuild_for_resume(struct hisi_qm *qm)
}
qm_cmd_init(qm);
hisi_mig_region_enable(qm);
hisi_qm_dev_err_init(qm);
/* Set the doorbell timeout to QM_DB_TIMEOUT_CFG ns. */
writel(QM_DB_TIMEOUT_SET, qm->io_base + QM_DB_TIMEOUT_CFG);

View File

@@ -11,25 +11,36 @@
#include "cxlpci.h"
#include "cxl.h"
struct cxl_cxims_data {
int nr_maps;
u64 xormaps[] __counted_by(nr_maps);
};
static const guid_t acpi_cxl_qtg_id_guid =
GUID_INIT(0xF365F9A6, 0xA7DE, 0x4071,
0xA6, 0x6A, 0xB4, 0x0C, 0x0B, 0x4F, 0x8E, 0x52);
static u64 cxl_apply_xor_maps(struct cxl_root_decoder *cxlrd, u64 addr)
#define HBIW_TO_NR_MAPS_SIZE (CXL_DECODER_MAX_INTERLEAVE + 1)
static const int hbiw_to_nr_maps[HBIW_TO_NR_MAPS_SIZE] = {
[1] = 0, [2] = 1, [3] = 0, [4] = 2, [6] = 1, [8] = 3, [12] = 2, [16] = 4
};
static const int valid_hbiw[] = { 1, 2, 3, 4, 6, 8, 12, 16 };
u64 cxl_do_xormap_calc(struct cxl_cxims_data *cximsd, u64 addr, int hbiw)
{
struct cxl_cxims_data *cximsd = cxlrd->platform_data;
int hbiw = cxlrd->cxlsd.nr_targets;
int nr_maps_to_apply = -1;
u64 val;
int pos;
/* No xormaps for host bridge interleave ways of 1 or 3 */
if (hbiw == 1 || hbiw == 3)
return addr;
/*
* Strictly validate hbiw since this function is used for testing and
* that nullifies any expectation of trusted parameters from the CXL
* Region Driver.
*/
for (int i = 0; i < ARRAY_SIZE(valid_hbiw); i++) {
if (valid_hbiw[i] == hbiw) {
nr_maps_to_apply = hbiw_to_nr_maps[hbiw];
break;
}
}
if (nr_maps_to_apply == -1 || nr_maps_to_apply > cximsd->nr_maps)
return ULLONG_MAX;
/*
* In regions using XOR interleave arithmetic the CXL HPA may not
@@ -60,6 +71,14 @@ static u64 cxl_apply_xor_maps(struct cxl_root_decoder *cxlrd, u64 addr)
return addr;
}
EXPORT_SYMBOL_FOR_MODULES(cxl_do_xormap_calc, "cxl_translate");
static u64 cxl_apply_xor_maps(struct cxl_root_decoder *cxlrd, u64 addr)
{
struct cxl_cxims_data *cximsd = cxlrd->platform_data;
return cxl_do_xormap_calc(cximsd, addr, cxlrd->cxlsd.nr_targets);
}
struct cxl_cxims_context {
struct device *dev;
@@ -353,7 +372,7 @@ static int cxl_acpi_set_cache_size(struct cxl_root_decoder *cxlrd)
rc = hmat_get_extended_linear_cache_size(&res, nid, &cache_size);
if (rc)
return rc;
return 0;
/*
* The cache range is expected to be within the CFMWS.
@@ -378,21 +397,18 @@ static void cxl_setup_extended_linear_cache(struct cxl_root_decoder *cxlrd)
int rc;
rc = cxl_acpi_set_cache_size(cxlrd);
if (!rc)
return;
if (rc != -EOPNOTSUPP) {
if (rc) {
/*
* Failing to support extended linear cache region resize does not
* Failing to retrieve extended linear cache region resize does not
* prevent the region from functioning. Only causes cxl list showing
* incorrect region size.
*/
dev_warn(cxlrd->cxlsd.cxld.dev.parent,
"Extended linear cache calculation failed rc:%d\n", rc);
}
"Extended linear cache retrieval failed rc:%d\n", rc);
/* Ignoring return code */
cxlrd->cache_size = 0;
/* Ignoring return code */
cxlrd->cache_size = 0;
}
}
DEFINE_FREE(put_cxlrd, struct cxl_root_decoder *,
@@ -453,8 +469,6 @@ static int __cxl_parse_cfmws(struct acpi_cedt_cfmws *cfmws,
ig = CXL_DECODER_MIN_GRANULARITY;
cxld->interleave_granularity = ig;
cxl_setup_extended_linear_cache(cxlrd);
if (cfmws->interleave_arithmetic == ACPI_CEDT_CFMWS_ARITHMETIC_XOR) {
if (ways != 1 && ways != 3) {
cxims_ctx = (struct cxl_cxims_context) {
@@ -470,19 +484,14 @@ static int __cxl_parse_cfmws(struct acpi_cedt_cfmws *cfmws,
return -EINVAL;
}
}
cxlrd->ops.hpa_to_spa = cxl_apply_xor_maps;
cxlrd->ops.spa_to_hpa = cxl_apply_xor_maps;
}
cxl_setup_extended_linear_cache(cxlrd);
cxlrd->qos_class = cfmws->qtg_id;
if (cfmws->interleave_arithmetic == ACPI_CEDT_CFMWS_ARITHMETIC_XOR) {
cxlrd->ops = kzalloc(sizeof(*cxlrd->ops), GFP_KERNEL);
if (!cxlrd->ops)
return -ENOMEM;
cxlrd->ops->hpa_to_spa = cxl_apply_xor_maps;
cxlrd->ops->spa_to_hpa = cxl_apply_xor_maps;
}
rc = cxl_decoder_add(cxld);
if (rc)
return rc;

View File

@@ -826,7 +826,7 @@ static struct xarray *cxl_switch_gather_bandwidth(struct cxl_region *cxlr,
cxl_coordinates_combine(coords, coords, ctx->coord);
/*
* Take the min of the calculated bandwdith and the upstream
* Take the min of the calculated bandwidth and the upstream
* switch SSLBIS bandwidth if there's a parent switch
*/
if (!is_root)
@@ -949,7 +949,7 @@ static struct xarray *cxl_hb_gather_bandwidth(struct xarray *xa)
/**
* cxl_region_update_bandwidth - Update the bandwidth access coordinates of a region
* @cxlr: The region being operated on
* @input_xa: xarray holds cxl_perf_ctx wht calculated bandwidth per ACPI0017 instance
* @input_xa: xarray holds cxl_perf_ctx with calculated bandwidth per ACPI0017 instance
*/
static void cxl_region_update_bandwidth(struct cxl_region *cxlr,
struct xarray *input_xa)

View File

@@ -905,6 +905,9 @@ static void cxl_decoder_reset(struct cxl_decoder *cxld)
if ((cxld->flags & CXL_DECODER_F_ENABLE) == 0)
return;
if (test_bit(CXL_DECODER_F_LOCK, &cxld->flags))
return;
if (port->commit_end == id)
cxl_port_commit_reap(cxld);
else

View File

@@ -71,85 +71,6 @@ struct cxl_dport *__devm_cxl_add_dport_by_dev(struct cxl_port *port,
}
EXPORT_SYMBOL_NS_GPL(__devm_cxl_add_dport_by_dev, "CXL");
struct cxl_walk_context {
struct pci_bus *bus;
struct cxl_port *port;
int type;
int error;
int count;
};
static int match_add_dports(struct pci_dev *pdev, void *data)
{
struct cxl_walk_context *ctx = data;
struct cxl_port *port = ctx->port;
int type = pci_pcie_type(pdev);
struct cxl_register_map map;
struct cxl_dport *dport;
u32 lnkcap, port_num;
int rc;
if (pdev->bus != ctx->bus)
return 0;
if (!pci_is_pcie(pdev))
return 0;
if (type != ctx->type)
return 0;
if (pci_read_config_dword(pdev, pci_pcie_cap(pdev) + PCI_EXP_LNKCAP,
&lnkcap))
return 0;
rc = cxl_find_regblock(pdev, CXL_REGLOC_RBI_COMPONENT, &map);
if (rc)
dev_dbg(&port->dev, "failed to find component registers\n");
port_num = FIELD_GET(PCI_EXP_LNKCAP_PN, lnkcap);
dport = devm_cxl_add_dport(port, &pdev->dev, port_num, map.resource);
if (IS_ERR(dport)) {
ctx->error = PTR_ERR(dport);
return PTR_ERR(dport);
}
ctx->count++;
return 0;
}
/**
* devm_cxl_port_enumerate_dports - enumerate downstream ports of the upstream port
* @port: cxl_port whose ->uport_dev is the upstream of dports to be enumerated
*
* Returns a positive number of dports enumerated or a negative error
* code.
*/
int devm_cxl_port_enumerate_dports(struct cxl_port *port)
{
struct pci_bus *bus = cxl_port_to_pci_bus(port);
struct cxl_walk_context ctx;
int type;
if (!bus)
return -ENXIO;
if (pci_is_root_bus(bus))
type = PCI_EXP_TYPE_ROOT_PORT;
else
type = PCI_EXP_TYPE_DOWNSTREAM;
ctx = (struct cxl_walk_context) {
.port = port,
.bus = bus,
.type = type,
};
pci_walk_bus(bus, match_add_dports, &ctx);
if (ctx.count == 0)
return -ENODEV;
if (ctx.error)
return ctx.error;
return ctx.count;
}
EXPORT_SYMBOL_NS_GPL(devm_cxl_port_enumerate_dports, "CXL");
static int cxl_dvsec_mem_range_valid(struct cxl_dev_state *cxlds, int id)
{
struct pci_dev *pdev = to_pci_dev(cxlds->dev);
@@ -1217,6 +1138,14 @@ int cxl_gpf_port_setup(struct cxl_dport *dport)
return 0;
}
struct cxl_walk_context {
struct pci_bus *bus;
struct cxl_port *port;
int type;
int error;
int count;
};
static int count_dports(struct pci_dev *pdev, void *data)
{
struct cxl_walk_context *ctx = data;

View File

@@ -459,7 +459,6 @@ static void cxl_root_decoder_release(struct device *dev)
if (atomic_read(&cxlrd->region_id) >= 0)
memregion_free(atomic_read(&cxlrd->region_id));
__cxl_decoder_release(&cxlrd->cxlsd.cxld);
kfree(cxlrd->ops);
kfree(cxlrd);
}

View File

@@ -245,6 +245,9 @@ static void cxl_region_decode_reset(struct cxl_region *cxlr, int count)
struct cxl_region_params *p = &cxlr->params;
int i;
if (test_bit(CXL_REGION_F_LOCK, &cxlr->flags))
return;
/*
* Before region teardown attempt to flush, evict any data cached for
* this region, or scream loudly about missing arch / platform support
@@ -419,6 +422,9 @@ static ssize_t commit_store(struct device *dev, struct device_attribute *attr,
return len;
}
if (test_bit(CXL_REGION_F_LOCK, &cxlr->flags))
return -EPERM;
rc = queue_reset(cxlr);
if (rc)
return rc;
@@ -461,21 +467,6 @@ static ssize_t commit_show(struct device *dev, struct device_attribute *attr,
}
static DEVICE_ATTR_RW(commit);
static umode_t cxl_region_visible(struct kobject *kobj, struct attribute *a,
int n)
{
struct device *dev = kobj_to_dev(kobj);
struct cxl_region *cxlr = to_cxl_region(dev);
/*
* Support tooling that expects to find a 'uuid' attribute for all
* regions regardless of mode.
*/
if (a == &dev_attr_uuid.attr && cxlr->mode != CXL_PARTMODE_PMEM)
return 0444;
return a->mode;
}
static ssize_t interleave_ways_show(struct device *dev,
struct device_attribute *attr, char *buf)
{
@@ -754,6 +745,21 @@ static ssize_t size_show(struct device *dev, struct device_attribute *attr,
}
static DEVICE_ATTR_RW(size);
static ssize_t extended_linear_cache_size_show(struct device *dev,
struct device_attribute *attr,
char *buf)
{
struct cxl_region *cxlr = to_cxl_region(dev);
struct cxl_region_params *p = &cxlr->params;
ssize_t rc;
ACQUIRE(rwsem_read_intr, rwsem)(&cxl_rwsem.region);
if ((rc = ACQUIRE_ERR(rwsem_read_intr, &rwsem)))
return rc;
return sysfs_emit(buf, "%#llx\n", p->cache_size);
}
static DEVICE_ATTR_RO(extended_linear_cache_size);
static struct attribute *cxl_region_attrs[] = {
&dev_attr_uuid.attr,
&dev_attr_commit.attr,
@@ -762,9 +768,34 @@ static struct attribute *cxl_region_attrs[] = {
&dev_attr_resource.attr,
&dev_attr_size.attr,
&dev_attr_mode.attr,
&dev_attr_extended_linear_cache_size.attr,
NULL,
};
static umode_t cxl_region_visible(struct kobject *kobj, struct attribute *a,
int n)
{
struct device *dev = kobj_to_dev(kobj);
struct cxl_region *cxlr = to_cxl_region(dev);
/*
* Support tooling that expects to find a 'uuid' attribute for all
* regions regardless of mode.
*/
if (a == &dev_attr_uuid.attr && cxlr->mode != CXL_PARTMODE_PMEM)
return 0444;
/*
* Don't display extended linear cache attribute if there is no
* extended linear cache.
*/
if (a == &dev_attr_extended_linear_cache_size.attr &&
cxlr->params.cache_size == 0)
return 0;
return a->mode;
}
static const struct attribute_group cxl_region_group = {
.attrs = cxl_region_attrs,
.is_visible = cxl_region_visible,
@@ -838,16 +869,16 @@ static int match_free_decoder(struct device *dev, const void *data)
return 1;
}
static bool region_res_match_cxl_range(const struct cxl_region_params *p,
const struct range *range)
static bool spa_maps_hpa(const struct cxl_region_params *p,
const struct range *range)
{
if (!p->res)
return false;
/*
* If an extended linear cache region then the CXL range is assumed
* to be fronted by the DRAM range in current known implementation.
* This assumption will be made until a variant implementation exists.
* The extended linear cache region is constructed by a 1:1 ratio
* where the SPA maps equal amounts of DRAM and CXL HPA capacity with
* CXL decoders at the high end of the SPA range.
*/
return p->res->start + p->cache_size == range->start &&
p->res->end == range->end;
@@ -865,7 +896,7 @@ static int match_auto_decoder(struct device *dev, const void *data)
cxld = to_cxl_decoder(dev);
r = &cxld->hpa_range;
if (region_res_match_cxl_range(p, r))
if (spa_maps_hpa(p, r))
return 1;
return 0;
@@ -1059,6 +1090,16 @@ static int cxl_rr_assign_decoder(struct cxl_port *port, struct cxl_region *cxlr,
return 0;
}
static void cxl_region_set_lock(struct cxl_region *cxlr,
struct cxl_decoder *cxld)
{
if (!test_bit(CXL_DECODER_F_LOCK, &cxld->flags))
return;
set_bit(CXL_REGION_F_LOCK, &cxlr->flags);
clear_bit(CXL_REGION_F_NEEDS_RESET, &cxlr->flags);
}
/**
* cxl_port_attach_region() - track a region's interest in a port by endpoint
* @port: port to add a new region reference 'struct cxl_region_ref'
@@ -1170,6 +1211,8 @@ static int cxl_port_attach_region(struct cxl_port *port,
}
}
cxl_region_set_lock(cxlr, cxld);
rc = cxl_rr_ep_add(cxl_rr, cxled);
if (rc) {
dev_dbg(&cxlr->dev,
@@ -1328,7 +1371,7 @@ static int cxl_port_setup_targets(struct cxl_port *port,
struct cxl_endpoint_decoder *cxled)
{
struct cxl_root_decoder *cxlrd = to_cxl_root_decoder(cxlr->dev.parent);
int parent_iw, parent_ig, ig, iw, rc, inc = 0, pos = cxled->pos;
int parent_iw, parent_ig, ig, iw, rc, pos = cxled->pos;
struct cxl_port *parent_port = to_cxl_port(port->dev.parent);
struct cxl_region_ref *cxl_rr = cxl_rr_load(port, cxlr);
struct cxl_memdev *cxlmd = cxled_to_memdev(cxled);
@@ -1465,7 +1508,7 @@ static int cxl_port_setup_targets(struct cxl_port *port,
if (test_bit(CXL_REGION_F_AUTO, &cxlr->flags)) {
if (cxld->interleave_ways != iw ||
(iw > 1 && cxld->interleave_granularity != ig) ||
!region_res_match_cxl_range(p, &cxld->hpa_range) ||
!spa_maps_hpa(p, &cxld->hpa_range) ||
((cxld->flags & CXL_DECODER_F_ENABLE) == 0)) {
dev_err(&cxlr->dev,
"%s:%s %s expected iw: %d ig: %d %pr\n",
@@ -1520,9 +1563,8 @@ add_target:
cxlsd->target[cxl_rr->nr_targets_set] = ep->dport;
cxlsd->cxld.target_map[cxl_rr->nr_targets_set] = ep->dport->port_id;
}
inc = 1;
cxl_rr->nr_targets_set++;
out_target_set:
cxl_rr->nr_targets_set += inc;
dev_dbg(&cxlr->dev, "%s:%s target[%d] = %s for %s:%s @ %d\n",
dev_name(port->uport_dev), dev_name(&port->dev),
cxl_rr->nr_targets_set - 1, dev_name(ep->dport->dport_dev),
@@ -2439,6 +2481,7 @@ static struct cxl_region *cxl_region_alloc(struct cxl_root_decoder *cxlrd, int i
dev->bus = &cxl_bus_type;
dev->type = &cxl_region_type;
cxlr->id = id;
cxl_region_set_lock(cxlr, &cxlrd->cxlsd.cxld);
return cxlr;
}
@@ -2924,38 +2967,119 @@ static bool cxl_is_hpa_in_chunk(u64 hpa, struct cxl_region *cxlr, int pos)
return false;
}
static bool has_hpa_to_spa(struct cxl_root_decoder *cxlrd)
#define CXL_POS_ZERO 0
/**
* cxl_validate_translation_params
* @eiw: encoded interleave ways
* @eig: encoded interleave granularity
* @pos: position in interleave
*
* Callers pass CXL_POS_ZERO when no position parameter needs validating.
*
* Returns: 0 on success, -EINVAL on first invalid parameter
*/
int cxl_validate_translation_params(u8 eiw, u16 eig, int pos)
{
return cxlrd->ops && cxlrd->ops->hpa_to_spa;
}
int ways, gran;
static bool has_spa_to_hpa(struct cxl_root_decoder *cxlrd)
{
return cxlrd->ops && cxlrd->ops->spa_to_hpa;
}
u64 cxl_dpa_to_hpa(struct cxl_region *cxlr, const struct cxl_memdev *cxlmd,
u64 dpa)
{
struct cxl_root_decoder *cxlrd = to_cxl_root_decoder(cxlr->dev.parent);
u64 dpa_offset, hpa_offset, bits_upper, mask_upper, hpa;
struct cxl_region_params *p = &cxlr->params;
struct cxl_endpoint_decoder *cxled = NULL;
u16 eig = 0;
u8 eiw = 0;
int pos;
for (int i = 0; i < p->nr_targets; i++) {
cxled = p->targets[i];
if (cxlmd == cxled_to_memdev(cxled))
break;
if (eiw_to_ways(eiw, &ways)) {
pr_debug("%s: invalid eiw=%u\n", __func__, eiw);
return -EINVAL;
}
if (!cxled || cxlmd != cxled_to_memdev(cxled))
if (eig_to_granularity(eig, &gran)) {
pr_debug("%s: invalid eig=%u\n", __func__, eig);
return -EINVAL;
}
if (pos < 0 || pos >= ways) {
pr_debug("%s: invalid pos=%d for ways=%u\n", __func__, pos,
ways);
return -EINVAL;
}
return 0;
}
EXPORT_SYMBOL_FOR_MODULES(cxl_validate_translation_params, "cxl_translate");
u64 cxl_calculate_dpa_offset(u64 hpa_offset, u8 eiw, u16 eig)
{
u64 dpa_offset, bits_lower, bits_upper, temp;
int ret;
ret = cxl_validate_translation_params(eiw, eig, CXL_POS_ZERO);
if (ret)
return ULLONG_MAX;
pos = cxled->pos;
ways_to_eiw(p->interleave_ways, &eiw);
granularity_to_eig(p->interleave_granularity, &eig);
/*
* DPA offset: CXL Spec 3.2 Section 8.2.4.20.13
* Lower bits [IG+7:0] pass through unchanged
* (eiw < 8)
* Per spec: DPAOffset[51:IG+8] = (HPAOffset[51:IG+IW+8] >> IW)
* Clear the position bits to isolate upper section, then
* reverse the left shift by eiw that occurred during DPA->HPA
* (eiw >= 8)
* Per spec: DPAOffset[51:IG+8] = HPAOffset[51:IG+IW] / 3
* Extract upper bits from the correct bit range and divide by 3
* to recover the original DPA upper bits
*/
bits_lower = hpa_offset & GENMASK_ULL(eig + 7, 0);
if (eiw < 8) {
temp = hpa_offset &= ~GENMASK_ULL(eig + eiw + 8 - 1, 0);
dpa_offset = temp >> eiw;
} else {
bits_upper = div64_u64(hpa_offset >> (eig + eiw), 3);
dpa_offset = bits_upper << (eig + 8);
}
dpa_offset |= bits_lower;
return dpa_offset;
}
EXPORT_SYMBOL_FOR_MODULES(cxl_calculate_dpa_offset, "cxl_translate");
int cxl_calculate_position(u64 hpa_offset, u8 eiw, u16 eig)
{
unsigned int ways = 0;
u64 shifted, rem;
int pos, ret;
ret = cxl_validate_translation_params(eiw, eig, CXL_POS_ZERO);
if (ret)
return ret;
if (!eiw)
/* position is 0 if no interleaving */
return 0;
/*
* Interleave position: CXL Spec 3.2 Section 8.2.4.20.13
* eiw < 8
* Position is in the IW bits at HPA_OFFSET[IG+8+IW-1:IG+8].
* Per spec "remove IW bits starting with bit position IG+8"
* eiw >= 8
* Position is not explicitly stored in HPA_OFFSET bits. It is
* derived from the modulo operation of the upper bits using
* the total number of interleave ways.
*/
if (eiw < 8) {
pos = (hpa_offset >> (eig + 8)) & GENMASK(eiw - 1, 0);
} else {
shifted = hpa_offset >> (eig + 8);
eiw_to_ways(eiw, &ways);
div64_u64_rem(shifted, ways, &rem);
pos = rem;
}
return pos;
}
EXPORT_SYMBOL_FOR_MODULES(cxl_calculate_position, "cxl_translate");
u64 cxl_calculate_hpa_offset(u64 dpa_offset, int pos, u8 eiw, u16 eig)
{
u64 mask_upper, hpa_offset, bits_upper;
int ret;
ret = cxl_validate_translation_params(eiw, eig, pos);
if (ret)
return ULLONG_MAX;
/*
* The device position in the region interleave set was removed
@@ -2967,9 +3091,6 @@ u64 cxl_dpa_to_hpa(struct cxl_region *cxlr, const struct cxl_memdev *cxlmd,
* 8.2.4.19.13 Implementation Note: Device Decode Logic
*/
/* Remove the dpa base */
dpa_offset = dpa - cxl_dpa_resource_start(cxled);
mask_upper = GENMASK_ULL(51, eig + 8);
if (eiw < 8) {
@@ -2984,12 +3105,43 @@ u64 cxl_dpa_to_hpa(struct cxl_region *cxlr, const struct cxl_memdev *cxlmd,
/* The lower bits remain unchanged */
hpa_offset |= dpa_offset & GENMASK_ULL(eig + 7, 0);
return hpa_offset;
}
EXPORT_SYMBOL_FOR_MODULES(cxl_calculate_hpa_offset, "cxl_translate");
u64 cxl_dpa_to_hpa(struct cxl_region *cxlr, const struct cxl_memdev *cxlmd,
u64 dpa)
{
struct cxl_root_decoder *cxlrd = to_cxl_root_decoder(cxlr->dev.parent);
struct cxl_region_params *p = &cxlr->params;
struct cxl_endpoint_decoder *cxled = NULL;
u64 dpa_offset, hpa_offset, hpa;
u16 eig = 0;
u8 eiw = 0;
int pos;
for (int i = 0; i < p->nr_targets; i++) {
if (cxlmd == cxled_to_memdev(p->targets[i])) {
cxled = p->targets[i];
break;
}
}
if (!cxled)
return ULLONG_MAX;
pos = cxled->pos;
ways_to_eiw(p->interleave_ways, &eiw);
granularity_to_eig(p->interleave_granularity, &eig);
dpa_offset = dpa - cxl_dpa_resource_start(cxled);
hpa_offset = cxl_calculate_hpa_offset(dpa_offset, pos, eiw, eig);
/* Apply the hpa_offset to the region base address */
hpa = hpa_offset + p->res->start + p->cache_size;
/* Root decoder translation overrides typical modulo decode */
if (has_hpa_to_spa(cxlrd))
hpa = cxlrd->ops->hpa_to_spa(cxlrd, hpa);
if (cxlrd->ops.hpa_to_spa)
hpa = cxlrd->ops.hpa_to_spa(cxlrd, hpa);
if (!cxl_resource_contains_addr(p->res, hpa)) {
dev_dbg(&cxlr->dev,
@@ -2998,7 +3150,7 @@ u64 cxl_dpa_to_hpa(struct cxl_region *cxlr, const struct cxl_memdev *cxlmd,
}
/* Simple chunk check, by pos & gran, only applies to modulo decodes */
if (!has_hpa_to_spa(cxlrd) && (!cxl_is_hpa_in_chunk(hpa, cxlr, pos)))
if (!cxlrd->ops.hpa_to_spa && !cxl_is_hpa_in_chunk(hpa, cxlr, pos))
return ULLONG_MAX;
return hpa;
@@ -3016,8 +3168,6 @@ static int region_offset_to_dpa_result(struct cxl_region *cxlr, u64 offset,
struct cxl_root_decoder *cxlrd = to_cxl_root_decoder(cxlr->dev.parent);
struct cxl_endpoint_decoder *cxled;
u64 hpa, hpa_offset, dpa_offset;
u64 bits_upper, bits_lower;
u64 shifted, rem, temp;
u16 eig = 0;
u8 eiw = 0;
int pos;
@@ -3033,56 +3183,21 @@ static int region_offset_to_dpa_result(struct cxl_region *cxlr, u64 offset,
* If the root decoder has SPA to CXL HPA callback, use it. Otherwise
* CXL HPA is assumed to equal SPA.
*/
if (has_spa_to_hpa(cxlrd)) {
hpa = cxlrd->ops->spa_to_hpa(cxlrd, p->res->start + offset);
if (cxlrd->ops.spa_to_hpa) {
hpa = cxlrd->ops.spa_to_hpa(cxlrd, p->res->start + offset);
hpa_offset = hpa - p->res->start;
} else {
hpa_offset = offset;
}
/*
* Interleave position: CXL Spec 3.2 Section 8.2.4.20.13
* eiw < 8
* Position is in the IW bits at HPA_OFFSET[IG+8+IW-1:IG+8].
* Per spec "remove IW bits starting with bit position IG+8"
* eiw >= 8
* Position is not explicitly stored in HPA_OFFSET bits. It is
* derived from the modulo operation of the upper bits using
* the total number of interleave ways.
*/
if (eiw < 8) {
pos = (hpa_offset >> (eig + 8)) & GENMASK(eiw - 1, 0);
} else {
shifted = hpa_offset >> (eig + 8);
div64_u64_rem(shifted, p->interleave_ways, &rem);
pos = rem;
}
pos = cxl_calculate_position(hpa_offset, eiw, eig);
if (pos < 0 || pos >= p->nr_targets) {
dev_dbg(&cxlr->dev, "Invalid position %d for %d targets\n",
pos, p->nr_targets);
return -ENXIO;
}
/*
* DPA offset: CXL Spec 3.2 Section 8.2.4.20.13
* Lower bits [IG+7:0] pass through unchanged
* (eiw < 8)
* Per spec: DPAOffset[51:IG+8] = (HPAOffset[51:IG+IW+8] >> IW)
* Clear the position bits to isolate upper section, then
* reverse the left shift by eiw that occurred during DPA->HPA
* (eiw >= 8)
* Per spec: DPAOffset[51:IG+8] = HPAOffset[51:IG+IW] / 3
* Extract upper bits from the correct bit range and divide by 3
* to recover the original DPA upper bits
*/
bits_lower = hpa_offset & GENMASK_ULL(eig + 7, 0);
if (eiw < 8) {
temp = hpa_offset &= ~((u64)GENMASK(eig + eiw + 8 - 1, 0));
dpa_offset = temp >> eiw;
} else {
bits_upper = div64_u64(hpa_offset >> (eig + eiw), 3);
dpa_offset = bits_upper << (eig + 8);
}
dpa_offset |= bits_lower;
dpa_offset = cxl_calculate_dpa_offset(hpa_offset, eiw, eig);
/* Look-up and return the result: a memdev and a DPA */
for (int i = 0; i < p->nr_targets; i++) {
@@ -3398,7 +3513,7 @@ static int match_region_by_range(struct device *dev, const void *data)
p = &cxlr->params;
guard(rwsem_read)(&cxl_rwsem.region);
return region_res_match_cxl_range(p, r);
return spa_maps_hpa(p, r);
}
static int cxl_extended_linear_cache_resize(struct cxl_region *cxlr,
@@ -3479,6 +3594,10 @@ static int __construct_region(struct cxl_region *cxlr,
"Extended linear cache calculation failed rc:%d\n", rc);
}
rc = sysfs_update_group(&cxlr->dev.kobj, &cxl_region_group);
if (rc)
return rc;
rc = insert_resource(cxlrd->res, res);
if (rc) {
/*

View File

@@ -451,7 +451,7 @@ struct cxl_root_decoder {
void *platform_data;
struct mutex range_lock;
int qos_class;
struct cxl_rd_ops *ops;
struct cxl_rd_ops ops;
struct cxl_switch_decoder cxlsd;
};
@@ -517,6 +517,14 @@ enum cxl_partition_mode {
*/
#define CXL_REGION_F_NEEDS_RESET 1
/*
* Indicate whether this region is locked due to 1 or more decoders that have
* been locked. The approach of all or nothing is taken with regard to the
* locked attribute. CXL_REGION_F_NEEDS_RESET should not be set if this flag is
* set.
*/
#define CXL_REGION_F_LOCK 2
/**
* struct cxl_region - CXL region
* @dev: This region's device
@@ -738,6 +746,25 @@ static inline bool is_cxl_root(struct cxl_port *port)
return port->uport_dev == port->dev.parent;
}
/* Address translation functions exported to cxl_translate test module only */
int cxl_validate_translation_params(u8 eiw, u16 eig, int pos);
u64 cxl_calculate_hpa_offset(u64 dpa_offset, int pos, u8 eiw, u16 eig);
u64 cxl_calculate_dpa_offset(u64 hpa_offset, u8 eiw, u16 eig);
int cxl_calculate_position(u64 hpa_offset, u8 eiw, u16 eig);
struct cxl_cxims_data {
int nr_maps;
u64 xormaps[] __counted_by(nr_maps);
};
#if IS_ENABLED(CONFIG_CXL_ACPI)
u64 cxl_do_xormap_calc(struct cxl_cxims_data *cximsd, u64 addr, int hbiw);
#else
static inline u64 cxl_do_xormap_calc(struct cxl_cxims_data *cximsd, u64 addr, int hbiw)
{
return ULLONG_MAX;
}
#endif
int cxl_num_decoders_committed(struct cxl_port *port);
bool is_cxl_port(const struct device *dev);
struct cxl_port *to_cxl_port(const struct device *dev);

View File

@@ -127,7 +127,6 @@ static inline bool cxl_pci_flit_256(struct pci_dev *pdev)
return lnksta2 & PCI_EXP_LNKSTA2_FLIT;
}
int devm_cxl_port_enumerate_dports(struct cxl_port *port);
struct cxl_dev_state;
void read_cdat_data(struct cxl_port *port);
void cxl_cor_error_detected(struct pci_dev *pdev);

View File

@@ -136,7 +136,7 @@ static irqreturn_t cxl_pci_mbox_irq(int irq, void *id)
if (opcode == CXL_MBOX_OP_SANITIZE) {
mutex_lock(&cxl_mbox->mbox_mutex);
if (mds->security.sanitize_node)
mod_delayed_work(system_wq, &mds->security.poll_dwork, 0);
mod_delayed_work(system_percpu_wq, &mds->security.poll_dwork, 0);
mutex_unlock(&cxl_mbox->mbox_mutex);
} else {
/* short-circuit the wait in __cxl_pci_mbox_send_cmd() */

View File

@@ -1,6 +1,6 @@
# SPDX-License-Identifier: GPL-2.0-only
obj-y := dma-buf.o dma-fence.o dma-fence-array.o dma-fence-chain.o \
dma-fence-unwrap.o dma-resv.o
dma-fence-unwrap.o dma-resv.o dma-buf-mapping.o
obj-$(CONFIG_DMABUF_HEAPS) += dma-heap.o
obj-$(CONFIG_DMABUF_HEAPS) += heaps/
obj-$(CONFIG_SYNC_FILE) += sync_file.o

View File

@@ -0,0 +1,248 @@
// SPDX-License-Identifier: GPL-2.0-only
/*
* DMA BUF Mapping Helpers
*
*/
#include <linux/dma-buf-mapping.h>
#include <linux/dma-resv.h>
static struct scatterlist *fill_sg_entry(struct scatterlist *sgl, size_t length,
dma_addr_t addr)
{
unsigned int len, nents;
int i;
nents = DIV_ROUND_UP(length, UINT_MAX);
for (i = 0; i < nents; i++) {
len = min_t(size_t, length, UINT_MAX);
length -= len;
/*
* DMABUF abuses scatterlist to create a scatterlist
* that does not have any CPU list, only the DMA list.
* Always set the page related values to NULL to ensure
* importers can't use it. The phys_addr based DMA API
* does not require the CPU list for mapping or unmapping.
*/
sg_set_page(sgl, NULL, 0, 0);
sg_dma_address(sgl) = addr + (dma_addr_t)i * UINT_MAX;
sg_dma_len(sgl) = len;
sgl = sg_next(sgl);
}
return sgl;
}
static unsigned int calc_sg_nents(struct dma_iova_state *state,
struct dma_buf_phys_vec *phys_vec,
size_t nr_ranges, size_t size)
{
unsigned int nents = 0;
size_t i;
if (!state || !dma_use_iova(state)) {
for (i = 0; i < nr_ranges; i++)
nents += DIV_ROUND_UP(phys_vec[i].len, UINT_MAX);
} else {
/*
* In IOVA case, there is only one SG entry which spans
* for whole IOVA address space, but we need to make sure
* that it fits sg->length, maybe we need more.
*/
nents = DIV_ROUND_UP(size, UINT_MAX);
}
return nents;
}
/**
* struct dma_buf_dma - holds DMA mapping information
* @sgt: Scatter-gather table
* @state: DMA IOVA state relevant in IOMMU-based DMA
* @size: Total size of DMA transfer
*/
struct dma_buf_dma {
struct sg_table sgt;
struct dma_iova_state *state;
size_t size;
};
/**
* dma_buf_phys_vec_to_sgt - Returns the scatterlist table of the attachment
* from arrays of physical vectors. This funciton is intended for MMIO memory
* only.
* @attach: [in] attachment whose scatterlist is to be returned
* @provider: [in] p2pdma provider
* @phys_vec: [in] array of physical vectors
* @nr_ranges: [in] number of entries in phys_vec array
* @size: [in] total size of phys_vec
* @dir: [in] direction of DMA transfer
*
* Returns sg_table containing the scatterlist to be returned; returns ERR_PTR
* on error. May return -EINTR if it is interrupted by a signal.
*
* On success, the DMA addresses and lengths in the returned scatterlist are
* PAGE_SIZE aligned.
*
* A mapping must be unmapped by using dma_buf_free_sgt().
*
* NOTE: This function is intended for exporters. If direct traffic routing is
* mandatory exporter should call routing pci_p2pdma_map_type() before calling
* this function.
*/
struct sg_table *dma_buf_phys_vec_to_sgt(struct dma_buf_attachment *attach,
struct p2pdma_provider *provider,
struct dma_buf_phys_vec *phys_vec,
size_t nr_ranges, size_t size,
enum dma_data_direction dir)
{
unsigned int nents, mapped_len = 0;
struct dma_buf_dma *dma;
struct scatterlist *sgl;
dma_addr_t addr;
size_t i;
int ret;
dma_resv_assert_held(attach->dmabuf->resv);
if (WARN_ON(!attach || !attach->dmabuf || !provider))
/* This function is supposed to work on MMIO memory only */
return ERR_PTR(-EINVAL);
dma = kzalloc(sizeof(*dma), GFP_KERNEL);
if (!dma)
return ERR_PTR(-ENOMEM);
switch (pci_p2pdma_map_type(provider, attach->dev)) {
case PCI_P2PDMA_MAP_BUS_ADDR:
/*
* There is no need in IOVA at all for this flow.
*/
break;
case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
dma->state = kzalloc(sizeof(*dma->state), GFP_KERNEL);
if (!dma->state) {
ret = -ENOMEM;
goto err_free_dma;
}
dma_iova_try_alloc(attach->dev, dma->state, 0, size);
break;
default:
ret = -EINVAL;
goto err_free_dma;
}
nents = calc_sg_nents(dma->state, phys_vec, nr_ranges, size);
ret = sg_alloc_table(&dma->sgt, nents, GFP_KERNEL | __GFP_ZERO);
if (ret)
goto err_free_state;
sgl = dma->sgt.sgl;
for (i = 0; i < nr_ranges; i++) {
if (!dma->state) {
addr = pci_p2pdma_bus_addr_map(provider,
phys_vec[i].paddr);
} else if (dma_use_iova(dma->state)) {
ret = dma_iova_link(attach->dev, dma->state,
phys_vec[i].paddr, 0,
phys_vec[i].len, dir,
DMA_ATTR_MMIO);
if (ret)
goto err_unmap_dma;
mapped_len += phys_vec[i].len;
} else {
addr = dma_map_phys(attach->dev, phys_vec[i].paddr,
phys_vec[i].len, dir,
DMA_ATTR_MMIO);
ret = dma_mapping_error(attach->dev, addr);
if (ret)
goto err_unmap_dma;
}
if (!dma->state || !dma_use_iova(dma->state))
sgl = fill_sg_entry(sgl, phys_vec[i].len, addr);
}
if (dma->state && dma_use_iova(dma->state)) {
WARN_ON_ONCE(mapped_len != size);
ret = dma_iova_sync(attach->dev, dma->state, 0, mapped_len);
if (ret)
goto err_unmap_dma;
sgl = fill_sg_entry(sgl, mapped_len, dma->state->addr);
}
dma->size = size;
/*
* No CPU list included set orig_nents = 0 so others can detect
* this via SG table (use nents only).
*/
dma->sgt.orig_nents = 0;
/*
* SGL must be NULL to indicate that SGL is the last one
* and we allocated correct number of entries in sg_alloc_table()
*/
WARN_ON_ONCE(sgl);
return &dma->sgt;
err_unmap_dma:
if (!i || !dma->state) {
; /* Do nothing */
} else if (dma_use_iova(dma->state)) {
dma_iova_destroy(attach->dev, dma->state, mapped_len, dir,
DMA_ATTR_MMIO);
} else {
for_each_sgtable_dma_sg(&dma->sgt, sgl, i)
dma_unmap_phys(attach->dev, sg_dma_address(sgl),
sg_dma_len(sgl), dir, DMA_ATTR_MMIO);
}
sg_free_table(&dma->sgt);
err_free_state:
kfree(dma->state);
err_free_dma:
kfree(dma);
return ERR_PTR(ret);
}
EXPORT_SYMBOL_NS_GPL(dma_buf_phys_vec_to_sgt, "DMA_BUF");
/**
* dma_buf_free_sgt- unmaps the buffer
* @attach: [in] attachment to unmap buffer from
* @sgt: [in] scatterlist info of the buffer to unmap
* @dir: [in] direction of DMA transfer
*
* This unmaps a DMA mapping for @attached obtained
* by dma_buf_phys_vec_to_sgt().
*/
void dma_buf_free_sgt(struct dma_buf_attachment *attach, struct sg_table *sgt,
enum dma_data_direction dir)
{
struct dma_buf_dma *dma = container_of(sgt, struct dma_buf_dma, sgt);
int i;
dma_resv_assert_held(attach->dmabuf->resv);
if (!dma->state) {
; /* Do nothing */
} else if (dma_use_iova(dma->state)) {
dma_iova_destroy(attach->dev, dma->state, dma->size, dir,
DMA_ATTR_MMIO);
} else {
struct scatterlist *sgl;
for_each_sgtable_dma_sg(sgt, sgl, i)
dma_unmap_phys(attach->dev, sg_dma_address(sgl),
sg_dma_len(sgl), dir, DMA_ATTR_MMIO);
}
sg_free_table(sgt);
kfree(dma->state);
kfree(dma);
}
EXPORT_SYMBOL_NS_GPL(dma_buf_free_sgt, "DMA_BUF");

View File

@@ -239,6 +239,8 @@ i915-y += \
display/intel_cdclk.o \
display/intel_cmtg.o \
display/intel_color.o \
display/intel_colorop.o \
display/intel_color_pipeline.o \
display/intel_combo_phy.o \
display/intel_connector.o \
display/intel_crtc.o \

View File

@@ -32,6 +32,8 @@
#include "intel_display_utils.h"
#include "intel_dsb.h"
#include "intel_vrr.h"
#include "skl_universal_plane.h"
#include "skl_universal_plane_regs.h"
struct intel_color_funcs {
int (*color_check)(struct intel_atomic_state *state,
@@ -87,6 +89,14 @@ struct intel_color_funcs {
* Read config other than LUTs and CSCs, before them. Optional.
*/
void (*get_config)(struct intel_crtc_state *crtc_state);
/* Plane CSC*/
void (*load_plane_csc_matrix)(struct intel_dsb *dsb,
const struct intel_plane_state *plane_state);
/* Plane Pre/Post CSC */
void (*load_plane_luts)(struct intel_dsb *dsb,
const struct intel_plane_state *plane_state);
};
#define CTM_COEFF_SIGN (1ULL << 63)
@@ -609,6 +619,8 @@ static u16 ctm_to_twos_complement(u64 coeff, int int_bits, int frac_bits)
if (CTM_COEFF_NEGATIVE(coeff))
c = -c;
int_bits = max(int_bits, 1);
c = clamp(c, -(s64)BIT(int_bits + frac_bits - 1),
(s64)(BIT(int_bits + frac_bits - 1) - 1));
@@ -3836,6 +3848,266 @@ static void icl_read_luts(struct intel_crtc_state *crtc_state)
}
}
static void
xelpd_load_plane_csc_matrix(struct intel_dsb *dsb,
const struct intel_plane_state *plane_state)
{
struct intel_display *display = to_intel_display(plane_state);
const struct drm_plane_state *state = &plane_state->uapi;
enum pipe pipe = to_intel_plane(state->plane)->pipe;
enum plane_id plane = to_intel_plane(state->plane)->id;
const struct drm_property_blob *blob = plane_state->hw.ctm;
struct drm_color_ctm_3x4 *ctm;
const u64 *input;
u16 coeffs[9] = {};
int i, j;
if (!icl_is_hdr_plane(display, plane) || !blob)
return;
ctm = blob->data;
input = ctm->matrix;
/*
* Convert fixed point S31.32 input to format supported by the
* hardware.
*/
for (i = 0, j = 0; i < ARRAY_SIZE(coeffs); i++) {
u64 abs_coeff = ((1ULL << 63) - 1) & input[j];
/*
* Clamp input value to min/max supported by
* hardware.
*/
abs_coeff = clamp_val(abs_coeff, 0, CTM_COEFF_4_0 - 1);
/* sign bit */
if (CTM_COEFF_NEGATIVE(input[j]))
coeffs[i] |= 1 << 15;
if (abs_coeff < CTM_COEFF_0_125)
coeffs[i] |= (3 << 12) |
ILK_CSC_COEFF_FP(abs_coeff, 12);
else if (abs_coeff < CTM_COEFF_0_25)
coeffs[i] |= (2 << 12) |
ILK_CSC_COEFF_FP(abs_coeff, 11);
else if (abs_coeff < CTM_COEFF_0_5)
coeffs[i] |= (1 << 12) |
ILK_CSC_COEFF_FP(abs_coeff, 10);
else if (abs_coeff < CTM_COEFF_1_0)
coeffs[i] |= ILK_CSC_COEFF_FP(abs_coeff, 9);
else if (abs_coeff < CTM_COEFF_2_0)
coeffs[i] |= (7 << 12) |
ILK_CSC_COEFF_FP(abs_coeff, 8);
else
coeffs[i] |= (6 << 12) |
ILK_CSC_COEFF_FP(abs_coeff, 7);
/* Skip postoffs */
if (!((j + 2) % 4))
j += 2;
else
j++;
}
intel_de_write_dsb(display, dsb, PLANE_CSC_COEFF(pipe, plane, 0),
coeffs[0] << 16 | coeffs[1]);
intel_de_write_dsb(display, dsb, PLANE_CSC_COEFF(pipe, plane, 1),
coeffs[2] << 16);
intel_de_write_dsb(display, dsb, PLANE_CSC_COEFF(pipe, plane, 2),
coeffs[3] << 16 | coeffs[4]);
intel_de_write_dsb(display, dsb, PLANE_CSC_COEFF(pipe, plane, 3),
coeffs[5] << 16);
intel_de_write_dsb(display, dsb, PLANE_CSC_COEFF(pipe, plane, 4),
coeffs[6] << 16 | coeffs[7]);
intel_de_write_dsb(display, dsb, PLANE_CSC_COEFF(pipe, plane, 5),
coeffs[8] << 16);
intel_de_write_dsb(display, dsb, PLANE_CSC_PREOFF(pipe, plane, 0), 0);
intel_de_write_dsb(display, dsb, PLANE_CSC_PREOFF(pipe, plane, 1), 0);
intel_de_write_dsb(display, dsb, PLANE_CSC_PREOFF(pipe, plane, 2), 0);
/*
* Conversion from S31.32 to S0.12. BIT[12] is the signed bit
*/
intel_de_write_dsb(display, dsb,
PLANE_CSC_POSTOFF(pipe, plane, 0),
ctm_to_twos_complement(input[3], 0, 12));
intel_de_write_dsb(display, dsb,
PLANE_CSC_POSTOFF(pipe, plane, 1),
ctm_to_twos_complement(input[7], 0, 12));
intel_de_write_dsb(display, dsb,
PLANE_CSC_POSTOFF(pipe, plane, 2),
ctm_to_twos_complement(input[11], 0, 12));
}
static void
xelpd_program_plane_pre_csc_lut(struct intel_dsb *dsb,
const struct intel_plane_state *plane_state)
{
struct intel_display *display = to_intel_display(plane_state);
const struct drm_plane_state *state = &plane_state->uapi;
enum pipe pipe = to_intel_plane(state->plane)->pipe;
enum plane_id plane = to_intel_plane(state->plane)->id;
const struct drm_color_lut32 *pre_csc_lut = plane_state->hw.degamma_lut->data;
u32 i, lut_size;
if (icl_is_hdr_plane(display, plane)) {
lut_size = 128;
intel_de_write_dsb(display, dsb,
PLANE_PRE_CSC_GAMC_INDEX_ENH(pipe, plane, 0),
PLANE_PAL_PREC_AUTO_INCREMENT);
if (pre_csc_lut) {
for (i = 0; i < lut_size; i++) {
u32 lut_val = drm_color_lut32_extract(pre_csc_lut[i].green, 24);
intel_de_write_dsb(display, dsb,
PLANE_PRE_CSC_GAMC_DATA_ENH(pipe, plane, 0),
lut_val);
}
/* Program the max register to clamp values > 1.0. */
/* TODO: Restrict to 0x7ffffff */
do {
intel_de_write_dsb(display, dsb,
PLANE_PRE_CSC_GAMC_DATA_ENH(pipe, plane, 0),
(1 << 24));
} while (i++ > 130);
} else {
for (i = 0; i < lut_size; i++) {
u32 v = (i * ((1 << 24) - 1)) / (lut_size - 1);
intel_de_write_dsb(display, dsb,
PLANE_PRE_CSC_GAMC_DATA_ENH(pipe, plane, 0), v);
}
do {
intel_de_write_dsb(display, dsb,
PLANE_PRE_CSC_GAMC_DATA_ENH(pipe, plane, 0),
1 << 24);
} while (i++ < 130);
}
intel_de_write_dsb(display, dsb, PLANE_PRE_CSC_GAMC_INDEX_ENH(pipe, plane, 0), 0);
}
}
static void
xelpd_program_plane_post_csc_lut(struct intel_dsb *dsb,
const struct intel_plane_state *plane_state)
{
struct intel_display *display = to_intel_display(plane_state);
const struct drm_plane_state *state = &plane_state->uapi;
enum pipe pipe = to_intel_plane(state->plane)->pipe;
enum plane_id plane = to_intel_plane(state->plane)->id;
const struct drm_color_lut32 *post_csc_lut = plane_state->hw.gamma_lut->data;
u32 i, lut_size, lut_val;
if (icl_is_hdr_plane(display, plane)) {
intel_de_write_dsb(display, dsb, PLANE_POST_CSC_GAMC_INDEX_ENH(pipe, plane, 0),
PLANE_PAL_PREC_AUTO_INCREMENT);
/* TODO: Add macro */
intel_de_write_dsb(display, dsb, PLANE_POST_CSC_GAMC_SEG0_INDEX_ENH(pipe, plane, 0),
PLANE_PAL_PREC_AUTO_INCREMENT);
if (post_csc_lut) {
lut_size = 32;
for (i = 0; i < lut_size; i++) {
lut_val = drm_color_lut32_extract(post_csc_lut[i].green, 24);
intel_de_write_dsb(display, dsb,
PLANE_POST_CSC_GAMC_DATA_ENH(pipe, plane, 0),
lut_val);
}
/* Segment 2 */
do {
intel_de_write_dsb(display, dsb,
PLANE_POST_CSC_GAMC_DATA_ENH(pipe, plane, 0),
(1 << 24));
} while (i++ < 34);
} else {
/*TODO: Add for segment 0 */
lut_size = 32;
for (i = 0; i < lut_size; i++) {
u32 v = (i * ((1 << 24) - 1)) / (lut_size - 1);
intel_de_write_dsb(display, dsb,
PLANE_POST_CSC_GAMC_DATA_ENH(pipe, plane, 0), v);
}
do {
intel_de_write_dsb(display, dsb,
PLANE_POST_CSC_GAMC_DATA_ENH(pipe, plane, 0),
1 << 24);
} while (i++ < 34);
}
intel_de_write_dsb(display, dsb, PLANE_POST_CSC_GAMC_INDEX_ENH(pipe, plane, 0), 0);
intel_de_write_dsb(display, dsb,
PLANE_POST_CSC_GAMC_SEG0_INDEX_ENH(pipe, plane, 0), 0);
}
}
static void
xelpd_plane_load_luts(struct intel_dsb *dsb, const struct intel_plane_state *plane_state)
{
if (plane_state->hw.degamma_lut)
xelpd_program_plane_pre_csc_lut(dsb, plane_state);
if (plane_state->hw.gamma_lut)
xelpd_program_plane_post_csc_lut(dsb, plane_state);
}
static u32 glk_3dlut_10(const struct drm_color_lut32 *color)
{
return REG_FIELD_PREP(LUT_3D_DATA_RED_MASK, drm_color_lut32_extract(color->red, 10)) |
REG_FIELD_PREP(LUT_3D_DATA_GREEN_MASK, drm_color_lut32_extract(color->green, 10)) |
REG_FIELD_PREP(LUT_3D_DATA_BLUE_MASK, drm_color_lut32_extract(color->blue, 10));
}
static void glk_load_lut_3d(struct intel_dsb *dsb,
struct intel_crtc *crtc,
const struct drm_property_blob *blob)
{
struct intel_display *display = to_intel_display(crtc->base.dev);
const struct drm_color_lut32 *lut = blob->data;
int i, lut_size = drm_color_lut32_size(blob);
enum pipe pipe = crtc->pipe;
if (!dsb && intel_de_read(display, LUT_3D_CTL(pipe)) & LUT_3D_READY) {
drm_err(display->drm, "[CRTC:%d:%s] 3D LUT not ready, not loading LUTs\n",
crtc->base.base.id, crtc->base.name);
return;
}
intel_de_write_dsb(display, dsb, LUT_3D_INDEX(pipe), LUT_3D_AUTO_INCREMENT);
for (i = 0; i < lut_size; i++)
intel_de_write_dsb(display, dsb, LUT_3D_DATA(pipe), glk_3dlut_10(&lut[i]));
intel_de_write_dsb(display, dsb, LUT_3D_INDEX(pipe), 0);
}
static void glk_lut_3d_commit(struct intel_dsb *dsb, struct intel_crtc *crtc, bool enable)
{
struct intel_display *display = to_intel_display(crtc);
enum pipe pipe = crtc->pipe;
u32 val = 0;
if (!dsb && intel_de_read(display, LUT_3D_CTL(pipe)) & LUT_3D_READY) {
drm_err(display->drm, "[CRTC:%d:%s] 3D LUT not ready, not committing change\n",
crtc->base.base.id, crtc->base.name);
return;
}
if (enable)
val = LUT_3D_ENABLE | LUT_3D_READY | LUT_3D_BIND_PLANE_1;
intel_de_write_dsb(display, dsb, LUT_3D_CTL(pipe), val);
}
static const struct intel_color_funcs chv_color_funcs = {
.color_check = chv_color_check,
.color_commit_arm = i9xx_color_commit_arm,
@@ -3883,6 +4155,8 @@ static const struct intel_color_funcs tgl_color_funcs = {
.lut_equal = icl_lut_equal,
.read_csc = icl_read_csc,
.get_config = skl_get_config,
.load_plane_csc_matrix = xelpd_load_plane_csc_matrix,
.load_plane_luts = xelpd_plane_load_luts,
};
static const struct intel_color_funcs icl_color_funcs = {
@@ -3963,6 +4237,67 @@ static const struct intel_color_funcs ilk_color_funcs = {
.get_config = ilk_get_config,
};
void intel_color_plane_commit_arm(struct intel_dsb *dsb,
const struct intel_plane_state *plane_state)
{
struct intel_display *display = to_intel_display(plane_state);
struct intel_crtc *crtc = to_intel_crtc(plane_state->uapi.crtc);
if (crtc && intel_color_crtc_has_3dlut(display, crtc->pipe))
glk_lut_3d_commit(dsb, crtc, !!plane_state->hw.lut_3d);
}
static void
intel_color_load_plane_csc_matrix(struct intel_dsb *dsb,
const struct intel_plane_state *plane_state)
{
struct intel_display *display = to_intel_display(plane_state);
if (display->funcs.color->load_plane_csc_matrix)
display->funcs.color->load_plane_csc_matrix(dsb, plane_state);
}
static void
intel_color_load_plane_luts(struct intel_dsb *dsb,
const struct intel_plane_state *plane_state)
{
struct intel_display *display = to_intel_display(plane_state);
if (display->funcs.color->load_plane_luts)
display->funcs.color->load_plane_luts(dsb, plane_state);
}
bool
intel_color_crtc_has_3dlut(struct intel_display *display, enum pipe pipe)
{
if (DISPLAY_VER(display) >= 12)
return pipe == PIPE_A || pipe == PIPE_B;
else
return false;
}
static void
intel_color_load_3dlut(struct intel_dsb *dsb,
const struct intel_plane_state *plane_state)
{
struct intel_display *display = to_intel_display(plane_state);
struct intel_crtc *crtc = to_intel_crtc(plane_state->uapi.crtc);
if (crtc && intel_color_crtc_has_3dlut(display, crtc->pipe))
glk_load_lut_3d(dsb, crtc, plane_state->hw.lut_3d);
}
void intel_color_plane_program_pipeline(struct intel_dsb *dsb,
const struct intel_plane_state *plane_state)
{
if (plane_state->hw.ctm)
intel_color_load_plane_csc_matrix(dsb, plane_state);
if (plane_state->hw.degamma_lut || plane_state->hw.gamma_lut)
intel_color_load_plane_luts(dsb, plane_state);
if (plane_state->hw.lut_3d)
intel_color_load_3dlut(dsb, plane_state);
}
void intel_color_crtc_init(struct intel_crtc *crtc)
{
struct intel_display *display = to_intel_display(crtc);

View File

@@ -13,7 +13,9 @@ struct intel_crtc_state;
struct intel_crtc;
struct intel_display;
struct intel_dsb;
struct intel_plane_state;
struct drm_property_blob;
enum pipe;
void intel_color_init_hooks(struct intel_display *display);
int intel_color_init(struct intel_display *display);
@@ -40,5 +42,9 @@ bool intel_color_lut_equal(const struct intel_crtc_state *crtc_state,
const struct drm_property_blob *blob2,
bool is_pre_csc_lut);
void intel_color_assert_luts(const struct intel_crtc_state *crtc_state);
void intel_color_plane_program_pipeline(struct intel_dsb *dsb,
const struct intel_plane_state *plane_state);
void intel_color_plane_commit_arm(struct intel_dsb *dsb,
const struct intel_plane_state *plane_state);
bool intel_color_crtc_has_3dlut(struct intel_display *display, enum pipe pipe);
#endif /* __INTEL_COLOR_H__ */

View File

@@ -0,0 +1,99 @@
// SPDX-License-Identifier: MIT
/*
* Copyright © 2025 Intel Corporation
*/
#include "intel_color.h"
#include "intel_colorop.h"
#include "intel_color_pipeline.h"
#include "intel_de.h"
#include "intel_display_types.h"
#include "skl_universal_plane.h"
#define MAX_COLOR_PIPELINES 1
#define PLANE_DEGAMMA_SIZE 128
#define PLANE_GAMMA_SIZE 32
static
int _intel_color_pipeline_plane_init(struct drm_plane *plane, struct drm_prop_enum_list *list,
enum pipe pipe)
{
struct drm_device *dev = plane->dev;
struct intel_display *display = to_intel_display(dev);
struct drm_colorop *prev_op;
struct intel_colorop *colorop;
int ret;
colorop = intel_colorop_create(INTEL_PLANE_CB_PRE_CSC_LUT);
ret = drm_plane_colorop_curve_1d_lut_init(dev, &colorop->base, plane,
PLANE_DEGAMMA_SIZE,
DRM_COLOROP_LUT1D_INTERPOLATION_LINEAR,
DRM_COLOROP_FLAG_ALLOW_BYPASS);
if (ret)
return ret;
list->type = colorop->base.base.id;
list->name = kasprintf(GFP_KERNEL, "Color Pipeline %d", colorop->base.base.id);
/* TODO: handle failures and clean up */
prev_op = &colorop->base;
if (DISPLAY_VER(display) >= 35 &&
intel_color_crtc_has_3dlut(display, pipe) &&
plane->type == DRM_PLANE_TYPE_PRIMARY) {
colorop = intel_colorop_create(INTEL_PLANE_CB_3DLUT);
ret = drm_plane_colorop_3dlut_init(dev, &colorop->base, plane, 17,
DRM_COLOROP_LUT3D_INTERPOLATION_TETRAHEDRAL,
true);
if (ret)
return ret;
drm_colorop_set_next_property(prev_op, &colorop->base);
prev_op = &colorop->base;
}
colorop = intel_colorop_create(INTEL_PLANE_CB_CSC);
ret = drm_plane_colorop_ctm_3x4_init(dev, &colorop->base, plane,
DRM_COLOROP_FLAG_ALLOW_BYPASS);
if (ret)
return ret;
drm_colorop_set_next_property(prev_op, &colorop->base);
prev_op = &colorop->base;
colorop = intel_colorop_create(INTEL_PLANE_CB_POST_CSC_LUT);
ret = drm_plane_colorop_curve_1d_lut_init(dev, &colorop->base, plane,
PLANE_GAMMA_SIZE,
DRM_COLOROP_LUT1D_INTERPOLATION_LINEAR,
DRM_COLOROP_FLAG_ALLOW_BYPASS);
if (ret)
return ret;
drm_colorop_set_next_property(prev_op, &colorop->base);
return 0;
}
int intel_color_pipeline_plane_init(struct drm_plane *plane, enum pipe pipe)
{
struct drm_device *dev = plane->dev;
struct intel_display *display = to_intel_display(dev);
struct drm_prop_enum_list pipelines[MAX_COLOR_PIPELINES];
int len = 0;
int ret;
/* Currently expose pipeline only for HDR planes */
if (!icl_is_hdr_plane(display, to_intel_plane(plane)->id))
return 0;
/* Add pipeline consisting of transfer functions */
ret = _intel_color_pipeline_plane_init(plane, &pipelines[len], pipe);
if (ret)
return ret;
len++;
return drm_plane_create_color_pipeline_property(plane, pipelines, len);
}

View File

@@ -0,0 +1,14 @@
/* SPDX-License-Identifier: MIT */
/*
* Copyright © 2025 Intel Corporation
*/
#ifndef __INTEL_COLOR_PIPELINE_H__
#define __INTEL_COLOR_PIPELINE_H__
struct drm_plane;
enum pipe;
int intel_color_pipeline_plane_init(struct drm_plane *plane, enum pipe pipe);
#endif /* __INTEL_COLOR_PIPELINE_H__ */

View File

@@ -316,4 +316,33 @@
#define SKL_BOTTOM_COLOR_CSC_ENABLE REG_BIT(30)
#define SKL_BOTTOM_COLOR(pipe) _MMIO_PIPE(pipe, _SKL_BOTTOM_COLOR_A, _SKL_BOTTOM_COLOR_B)
/* 3D LUT */
#define _LUT_3D_CTL_A 0x490A4
#define _LUT_3D_CTL_B 0x491A4
#define LUT_3D_CTL(pipe) _MMIO_PIPE(pipe, _LUT_3D_CTL_A, _LUT_3D_CTL_B)
#define LUT_3D_ENABLE REG_BIT(31)
#define LUT_3D_READY REG_BIT(30)
#define LUT_3D_BINDING_MASK REG_GENMASK(23, 22)
#define LUT_3D_BIND_PIPE REG_FIELD_PREP(LUT_3D_BINDING_MASK, 0)
#define LUT_3D_BIND_PLANE_1 REG_FIELD_PREP(LUT_3D_BINDING_MASK, 1)
#define LUT_3D_BIND_PLANE_2 REG_FIELD_PREP(LUT_3D_BINDING_MASK, 2)
#define LUT_3D_BIND_PLANE_3 REG_FIELD_PREP(LUT_3D_BINDING_MASK, 3)
#define _LUT_3D_INDEX_A 0x490A8
#define _LUT_3D_INDEX_B 0x491A8
#define LUT_3D_INDEX(pipe) _MMIO_PIPE(pipe, _LUT_3D_INDEX_A, _LUT_3D_INDEX_B)
#define LUT_3D_AUTO_INCREMENT REG_BIT(13)
#define LUT_3D_INDEX_VALUE_MASK REG_GENMASK(12, 0)
#define LUT_3D_INDEX_VALUE(x) REG_FIELD_PREP(LUT_3D_INDEX_VALUE_MASK, (x))
#define _LUT_3D_DATA_A 0x490AC
#define _LUT_3D_DATA_B 0x491AC
#define LUT_3D_DATA(pipe) _MMIO_PIPE(pipe, _LUT_3D_DATA_A, _LUT_3D_DATA_B)
#define LUT_3D_DATA_RED_MASK REG_GENMASK(29, 20)
#define LUT_3D_DATA_GREEN_MASK REG_GENMASK(19, 10)
#define LUT_3D_DATA_BLUE_MASK REG_GENMASK(9, 0)
#define LUT_3D_DATA_RED(x) REG_FIELD_PREP(LUT_3D_DATA_RED_MASK, (x))
#define LUT_3D_DATA_GREEN(x) REG_FIELD_PREP(LUT_3D_DATA_GREEN_MASK, (x))
#define LUT_3D_DATA_BLUE(x) REG_FIELD_PREP(LUT_3D_DATA_BLUE_MASK, (x))
#endif /* __INTEL_COLOR_REGS_H__ */

View File

@@ -0,0 +1,35 @@
// SPDX-License-Identifier: MIT
/*
* Copyright © 2025 Intel Corporation
*/
#include "intel_colorop.h"
struct intel_colorop *to_intel_colorop(struct drm_colorop *colorop)
{
return container_of(colorop, struct intel_colorop, base);
}
struct intel_colorop *intel_colorop_alloc(void)
{
struct intel_colorop *colorop;
colorop = kzalloc(sizeof(*colorop), GFP_KERNEL);
if (!colorop)
return ERR_PTR(-ENOMEM);
return colorop;
}
struct intel_colorop *intel_colorop_create(enum intel_color_block id)
{
struct intel_colorop *colorop;
colorop = intel_colorop_alloc();
if (IS_ERR(colorop))
return colorop;
colorop->id = id;
return colorop;
}

View File

@@ -0,0 +1,15 @@
/* SPDX-License-Identifier: MIT */
/*
* Copyright © 2025 Intel Corporation
*/
#ifndef __INTEL_COLOROP_H__
#define __INTEL_COLOROP_H__
#include "intel_display_types.h"
struct intel_colorop *to_intel_colorop(struct drm_colorop *colorop);
struct intel_colorop *intel_colorop_alloc(void);
struct intel_colorop *intel_colorop_create(enum intel_color_block id);
#endif /* __INTEL_COLOROP_H__ */

View File

@@ -7304,6 +7304,7 @@ static void intel_atomic_dsb_finish(struct intel_atomic_state *state,
struct intel_display *display = to_intel_display(state);
struct intel_crtc_state *new_crtc_state =
intel_atomic_get_new_crtc_state(state, crtc);
unsigned int size = new_crtc_state->plane_color_changed ? 8192 : 1024;
if (!new_crtc_state->use_flipq &&
!new_crtc_state->use_dsb &&
@@ -7314,10 +7315,12 @@ static void intel_atomic_dsb_finish(struct intel_atomic_state *state,
* Rough estimate:
* ~64 registers per each plane * 8 planes = 512
* Double that for pipe stuff and other overhead.
* ~4913 registers for 3DLUT
* ~200 color registers * 3 HDR planes
*/
new_crtc_state->dsb_commit = intel_dsb_prepare(state, crtc, INTEL_DSB_0,
new_crtc_state->use_dsb ||
new_crtc_state->use_flipq ? 1024 : 16);
new_crtc_state->use_flipq ? size : 16);
if (!new_crtc_state->dsb_commit) {
new_crtc_state->use_flipq = false;
new_crtc_state->use_dsb = false;

View File

@@ -138,4 +138,13 @@ enum hpd_pin {
HPD_NUM_PINS
};
enum intel_color_block {
INTEL_PLANE_CB_PRE_CSC_LUT,
INTEL_PLANE_CB_CSC,
INTEL_PLANE_CB_POST_CSC_LUT,
INTEL_PLANE_CB_3DLUT,
INTEL_CB_MAX
};
#endif /* __INTEL_DISPLAY_LIMITS_H__ */

View File

@@ -646,6 +646,7 @@ struct intel_plane_state {
enum drm_color_encoding color_encoding;
enum drm_color_range color_range;
enum drm_scaling_filter scaling_filter;
struct drm_property_blob *ctm, *degamma_lut, *gamma_lut, *lut_3d;
} hw;
struct i915_vma *ggtt_vma;
@@ -1391,6 +1392,9 @@ struct intel_crtc_state {
u8 silence_period_sym_clocks;
u8 lfps_half_cycle_num_of_syms;
} alpm_state;
/* to track changes in plane color blocks */
bool plane_color_changed;
};
enum intel_pipe_crc_source {
@@ -1985,6 +1989,11 @@ struct intel_dp_mst_encoder {
struct intel_connector *connector;
};
struct intel_colorop {
struct drm_colorop base;
enum intel_color_block id;
};
static inline struct intel_encoder *
intel_attached_encoder(struct intel_connector *connector)
{

View File

@@ -49,6 +49,7 @@
#include "i9xx_plane_regs.h"
#include "intel_cdclk.h"
#include "intel_cursor.h"
#include "intel_colorop.h"
#include "intel_display_rps.h"
#include "intel_display_trace.h"
#include "intel_display_types.h"
@@ -336,6 +337,58 @@ intel_plane_copy_uapi_plane_damage(struct intel_plane_state *new_plane_state,
*damage = drm_plane_state_src(&new_uapi_plane_state->uapi);
}
static bool
intel_plane_colorop_replace_blob(struct intel_plane_state *plane_state,
struct intel_colorop *intel_colorop,
struct drm_property_blob *blob)
{
if (intel_colorop->id == INTEL_PLANE_CB_CSC)
return drm_property_replace_blob(&plane_state->hw.ctm, blob);
else if (intel_colorop->id == INTEL_PLANE_CB_PRE_CSC_LUT)
return drm_property_replace_blob(&plane_state->hw.degamma_lut, blob);
else if (intel_colorop->id == INTEL_PLANE_CB_POST_CSC_LUT)
return drm_property_replace_blob(&plane_state->hw.gamma_lut, blob);
else if (intel_colorop->id == INTEL_PLANE_CB_3DLUT)
return drm_property_replace_blob(&plane_state->hw.lut_3d, blob);
return false;
}
static void
intel_plane_color_copy_uapi_to_hw_state(struct intel_plane_state *plane_state,
const struct intel_plane_state *from_plane_state,
struct intel_crtc *crtc)
{
struct drm_colorop *iter_colorop, *colorop;
struct drm_colorop_state *new_colorop_state;
struct drm_atomic_state *state = plane_state->uapi.state;
struct intel_colorop *intel_colorop;
struct drm_property_blob *blob;
struct intel_atomic_state *intel_atomic_state = to_intel_atomic_state(state);
struct intel_crtc_state *new_crtc_state = intel_atomic_state ?
intel_atomic_get_new_crtc_state(intel_atomic_state, crtc) : NULL;
bool changed = false;
int i = 0;
iter_colorop = plane_state->uapi.color_pipeline;
while (iter_colorop) {
for_each_new_colorop_in_state(state, colorop, new_colorop_state, i) {
if (new_colorop_state->colorop == iter_colorop) {
blob = new_colorop_state->bypass ? NULL : new_colorop_state->data;
intel_colorop = to_intel_colorop(colorop);
changed |= intel_plane_colorop_replace_blob(plane_state,
intel_colorop,
blob);
}
}
iter_colorop = iter_colorop->next;
}
if (new_crtc_state && changed)
new_crtc_state->plane_color_changed = true;
}
void intel_plane_copy_uapi_to_hw_state(struct intel_plane_state *plane_state,
const struct intel_plane_state *from_plane_state,
struct intel_crtc *crtc)
@@ -364,6 +417,8 @@ void intel_plane_copy_uapi_to_hw_state(struct intel_plane_state *plane_state,
plane_state->uapi.src = drm_plane_state_src(&from_plane_state->uapi);
plane_state->uapi.dst = drm_plane_state_dest(&from_plane_state->uapi);
intel_plane_color_copy_uapi_to_hw_state(plane_state, from_plane_state, crtc);
}
void intel_plane_copy_hw_state(struct intel_plane_state *plane_state,

View File

@@ -11,6 +11,8 @@
#include "pxp/intel_pxp.h"
#include "intel_bo.h"
#include "intel_color.h"
#include "intel_color_pipeline.h"
#include "intel_de.h"
#include "intel_display_irq.h"
#include "intel_display_regs.h"
@@ -1275,6 +1277,18 @@ static u32 glk_plane_color_ctl(const struct intel_plane_state *plane_state)
if (plane_state->force_black)
plane_color_ctl |= PLANE_COLOR_PLANE_CSC_ENABLE;
if (plane_state->hw.degamma_lut)
plane_color_ctl |= PLANE_COLOR_PRE_CSC_GAMMA_ENABLE;
if (plane_state->hw.ctm)
plane_color_ctl |= PLANE_COLOR_PLANE_CSC_ENABLE;
if (plane_state->hw.gamma_lut) {
plane_color_ctl &= ~PLANE_COLOR_PLANE_GAMMA_DISABLE;
if (drm_color_lut32_size(plane_state->hw.gamma_lut) != 32)
plane_color_ctl |= PLANE_COLOR_POST_CSC_GAMMA_MULTSEG_ENABLE;
}
return plane_color_ctl;
}
@@ -1556,6 +1570,8 @@ icl_plane_update_noarm(struct intel_dsb *dsb,
plane_color_ctl = plane_state->color_ctl |
glk_plane_color_ctl_crtc(crtc_state);
intel_color_plane_program_pipeline(dsb, plane_state);
/* The scaler will handle the output position */
if (plane_state->scaler_id >= 0) {
crtc_x = 0;
@@ -1657,6 +1673,8 @@ icl_plane_update_arm(struct intel_dsb *dsb,
icl_plane_update_sel_fetch_arm(dsb, plane, crtc_state, plane_state);
intel_color_plane_commit_arm(dsb, plane_state);
/*
* In order to have FBC for fp16 formats pixel normalizer block must be
* active. Check if pixel normalizer block need to be enabled for FBC.
@@ -3001,6 +3019,9 @@ skl_universal_plane_create(struct intel_display *display,
DRM_COLOR_YCBCR_BT709,
DRM_COLOR_YCBCR_LIMITED_RANGE);
if (DISPLAY_VER(display) >= 12)
intel_color_pipeline_plane_init(&plane->base, pipe);
drm_plane_create_alpha_property(&plane->base);
drm_plane_create_blend_mode_property(&plane->base,
BIT(DRM_MODE_BLEND_PIXEL_NONE) |

View File

@@ -254,6 +254,8 @@
#define PLANE_COLOR_PIPE_CSC_ENABLE REG_BIT(23) /* Pre-ICL */
#define PLANE_COLOR_PLANE_CSC_ENABLE REG_BIT(21) /* ICL+ */
#define PLANE_COLOR_INPUT_CSC_ENABLE REG_BIT(20) /* ICL+ */
#define PLANE_COLOR_POST_CSC_GAMMA_MULTSEG_ENABLE REG_BIT(15) /* TGL+ */
#define PLANE_COLOR_PRE_CSC_GAMMA_ENABLE REG_BIT(14)
#define PLANE_COLOR_CSC_MODE_MASK REG_GENMASK(19, 17)
#define PLANE_COLOR_CSC_MODE_BYPASS REG_FIELD_PREP(PLANE_COLOR_CSC_MODE_MASK, 0)
#define PLANE_COLOR_CSC_MODE_YUV601_TO_RGB601 REG_FIELD_PREP(PLANE_COLOR_CSC_MODE_MASK, 1)
@@ -290,6 +292,119 @@
_PLANE_INPUT_CSC_POSTOFF_HI_1_A, _PLANE_INPUT_CSC_POSTOFF_HI_1_B, \
_PLANE_INPUT_CSC_POSTOFF_HI_2_A, _PLANE_INPUT_CSC_POSTOFF_HI_2_B)
#define _MMIO_PLANE_GAMC(plane, i, a, b) _MMIO(_PIPE(plane, a, b) + (i) * 4)
#define _PLANE_POST_CSC_GAMC_SEG0_INDEX_ENH_1_A 0x70160
#define _PLANE_POST_CSC_GAMC_SEG0_INDEX_ENH_1_B 0x71160
#define _PLANE_POST_CSC_GAMC_SEG0_INDEX_ENH_2_A 0x70260
#define _PLANE_POST_CSC_GAMC_SEG0_INDEX_ENH_2_B 0x71260
#define _PLANE_POST_CSC_GAMC_SEG0_INDEX_ENH_1(pipe) _PIPE(pipe, _PLANE_POST_CSC_GAMC_SEG0_INDEX_ENH_1_A, \
_PLANE_POST_CSC_GAMC_SEG0_INDEX_ENH_1_B)
#define _PLANE_POST_CSC_GAMC_SEG0_INDEX_ENH_2(pipe) _PIPE(pipe, _PLANE_POST_CSC_GAMC_SEG0_INDEX_ENH_2_A, \
_PLANE_POST_CSC_GAMC_SEG0_INDEX_ENH_2_B)
#define PLANE_POST_CSC_GAMC_SEG0_INDEX_ENH(pipe, plane, i) _MMIO_PLANE_GAMC(plane, i, _PLANE_POST_CSC_GAMC_SEG0_INDEX_ENH_1(pipe), \
_PLANE_POST_CSC_GAMC_SEG0_INDEX_ENH_2(pipe))
#define _PLANE_POST_CSC_GAMC_SEG0_DATA_ENH_1_A 0x70164
#define _PLANE_POST_CSC_GAMC_SEG0_DATA_ENH_1_B 0x71164
#define _PLANE_POST_CSC_GAMC_SEG0_DATA_ENH_2_A 0x70264
#define _PLANE_POST_CSC_GAMC_SEG0_DATA_ENH_2_B 0x71264
#define _PLANE_POST_CSC_GAMC_SEG0_DATA_ENH_1(pipe) _PIPE(pipe, _PLANE_POST_CSC_GAMC_SEG0_DATA_ENH_1_A, \
_PLANE_POST_CSC_GAMC_SEG0_DATA_ENH_1_B)
#define _PLANE_POST_CSC_GAMC_SEG0_DATA_ENH_2(pipe) _PIPE(pipe, _PLANE_POST_CSC_GAMC_SEG0_DATA_ENH_2_A, \
_PLANE_POST_CSC_GAMC_SEG0_DATA_ENH_2_B)
#define PLANE_POST_CSC_GAMC_SEG0_DATA_ENH(pipe, plane, i) _MMIO_PLANE_GAMC(plane, i, _PLANE_POST_CSC_GAMC_SEG0_DATA_ENH_1(pipe), \
_PLANE_POST_CSC_GAMC_SEG0_DATA_ENH_2(pipe))
#define _PLANE_POST_CSC_GAMC_INDEX_ENH_1_A 0x701d8
#define _PLANE_POST_CSC_GAMC_INDEX_ENH_1_B 0x711d8
#define _PLANE_POST_CSC_GAMC_INDEX_ENH_2_A 0x702d8
#define _PLANE_POST_CSC_GAMC_INDEX_ENH_2_B 0x712d8
#define _PLANE_POST_CSC_GAMC_INDEX_ENH_1(pipe) _PIPE(pipe, _PLANE_POST_CSC_GAMC_INDEX_ENH_1_A, \
_PLANE_POST_CSC_GAMC_INDEX_ENH_1_B)
#define _PLANE_POST_CSC_GAMC_INDEX_ENH_2(pipe) _PIPE(pipe, _PLANE_POST_CSC_GAMC_INDEX_ENH_2_A, \
_PLANE_POST_CSC_GAMC_INDEX_ENH_2_B)
#define PLANE_POST_CSC_GAMC_INDEX_ENH(pipe, plane, i) _MMIO_PLANE_GAMC(plane, i, _PLANE_POST_CSC_GAMC_INDEX_ENH_1(pipe), \
_PLANE_POST_CSC_GAMC_INDEX_ENH_2(pipe))
#define _PLANE_POST_CSC_GAMC_DATA_ENH_1_A 0x701dc
#define _PLANE_POST_CSC_GAMC_DATA_ENH_1_B 0x711dc
#define _PLANE_POST_CSC_GAMC_DATA_ENH_2_A 0x702dc
#define _PLANE_POST_CSC_GAMC_DATA_ENH_2_B 0x712dc
#define _PLANE_POST_CSC_GAMC_DATA_ENH_1(pipe) _PIPE(pipe, _PLANE_POST_CSC_GAMC_DATA_ENH_1_A, \
_PLANE_POST_CSC_GAMC_DATA_ENH_1_B)
#define _PLANE_POST_CSC_GAMC_DATA_ENH_2(pipe) _PIPE(pipe, _PLANE_POST_CSC_GAMC_DATA_ENH_2_A, \
_PLANE_POST_CSC_GAMC_DATA_ENH_2_B)
#define PLANE_POST_CSC_GAMC_DATA_ENH(pipe, plane, i) _MMIO_PLANE_GAMC(plane, i, _PLANE_POST_CSC_GAMC_DATA_ENH_1(pipe), \
_PLANE_POST_CSC_GAMC_DATA_ENH_2(pipe))
#define _PLANE_POST_CSC_GAMC_INDEX_1_A 0x704d8
#define _PLANE_POST_CSC_GAMC_INDEX_1_B 0x714d8
#define _PLANE_POST_CSC_GAMC_INDEX_2_A 0x705d8
#define _PLANE_POST_CSC_GAMC_INDEX_2_B 0x715d8
#define _PLANE_POST_CSC_GAMC_INDEX_1(pipe) _PIPE(pipe, _PLANE_POST_CSC_GAMC_INDEX_1_A, \
_PLANE_POST_CSC_GAMC_INDEX_1_B)
#define _PLANE_POST_CSC_GAMC_INDEX_2(pipe) _PIPE(pipe, _PLANE_POST_CSC_GAMC_INDEX_2_A, \
_PLANE_POST_CSC_GAMC_INDEX_2_B)
#define PLANE_POST_CSC_GAMC_INDEX(pipe, plane, i) _MMIO_PLANE_GAMC(plane, i, _PLANE_POST_CSC_GAMC_INDEX_1(pipe), \
_PLANE_POST_CSC_GAMC_INDEX_2(pipe))
#define _PLANE_POST_CSC_GAMC_DATA_1_A 0x704dc
#define _PLANE_POST_CSC_GAMC_DATA_1_B 0x714dc
#define _PLANE_POST_CSC_GAMC_DATA_2_A 0x705dc
#define _PLANE_POST_CSC_GAMC_DATA_2_B 0x715dc
#define _PLANE_POST_CSC_GAMC_DATA_1(pipe) _PIPE(pipe, _PLANE_POST_CSC_GAMC_DATA_1_A, \
_PLANE_POST_CSC_GAMC_DATA_1_B)
#define _PLANE_POST_CSC_GAMC_DATA_2(pipe) _PIPE(pipe, _PLANE_POST_CSC_GAMC_DATA_2_A, \
_PLANE_POST_CSC_GAMC_DATA_2_B)
#define PLANE_POST_CSC_GAMC_DATA(pipe, plane, i) _MMIO_PLANE_GAMC(plane, i, _PLANE_POST_CSC_GAMC_DATA_1(pipe), \
_PLANE_POST_CSC_GAMC_DATA_2(pipe))
#define _PLANE_PRE_CSC_GAMC_INDEX_ENH_1_A 0x701d0
#define _PLANE_PRE_CSC_GAMC_INDEX_ENH_1_B 0x711d0
#define _PLANE_PRE_CSC_GAMC_INDEX_ENH_2_A 0x702d0
#define _PLANE_PRE_CSC_GAMC_INDEX_ENH_2_B 0x712d0
#define _PLANE_PRE_CSC_GAMC_INDEX_ENH_1(pipe) _PIPE(pipe, _PLANE_PRE_CSC_GAMC_INDEX_ENH_1_A, \
_PLANE_PRE_CSC_GAMC_INDEX_ENH_1_B)
#define _PLANE_PRE_CSC_GAMC_INDEX_ENH_2(pipe) _PIPE(pipe, _PLANE_PRE_CSC_GAMC_INDEX_ENH_2_A, \
_PLANE_PRE_CSC_GAMC_INDEX_ENH_2_B)
#define PLANE_PRE_CSC_GAMC_INDEX_ENH(pipe, plane, i) _MMIO_PLANE_GAMC(plane, i, _PLANE_PRE_CSC_GAMC_INDEX_ENH_1(pipe), \
_PLANE_PRE_CSC_GAMC_INDEX_ENH_2(pipe))
#define PLANE_PAL_PREC_AUTO_INCREMENT REG_BIT(10)
#define _PLANE_PRE_CSC_GAMC_DATA_ENH_1_A 0x701d4
#define _PLANE_PRE_CSC_GAMC_DATA_ENH_1_B 0x711d4
#define _PLANE_PRE_CSC_GAMC_DATA_ENH_2_A 0x702d4
#define _PLANE_PRE_CSC_GAMC_DATA_ENH_2_B 0x712d4
#define _PLANE_PRE_CSC_GAMC_DATA_ENH_1(pipe) _PIPE(pipe, _PLANE_PRE_CSC_GAMC_DATA_ENH_1_A, \
_PLANE_PRE_CSC_GAMC_DATA_ENH_1_B)
#define _PLANE_PRE_CSC_GAMC_DATA_ENH_2(pipe) _PIPE(pipe, _PLANE_PRE_CSC_GAMC_DATA_ENH_2_A, \
_PLANE_PRE_CSC_GAMC_DATA_ENH_2_B)
#define PLANE_PRE_CSC_GAMC_DATA_ENH(pipe, plane, i) _MMIO_PLANE_GAMC(plane, i, _PLANE_PRE_CSC_GAMC_DATA_ENH_1(pipe), \
_PLANE_PRE_CSC_GAMC_DATA_ENH_2(pipe))
#define _PLANE_PRE_CSC_GAMC_INDEX_1_A 0x704d0
#define _PLANE_PRE_CSC_GAMC_INDEX_1_B 0x714d0
#define _PLANE_PRE_CSC_GAMC_INDEX_2_A 0x705d0
#define _PLANE_PRE_CSC_GAMC_INDEX_2_B 0x715d0
#define _PLANE_PRE_CSC_GAMC_INDEX_1(pipe) _PIPE(pipe, _PLANE_PRE_CSC_GAMC_INDEX_1_A, \
_PLANE_PRE_CSC_GAMC_INDEX_1_B)
#define _PLANE_PRE_CSC_GAMC_INDEX_2(pipe) _PIPE(pipe, _PLANE_PRE_CSC_GAMC_INDEX_2_A, \
_PLANE_PRE_CSC_GAMC_INDEX_2_B)
#define PLANE_PRE_CSC_GAMC_INDEX(pipe, plane, i) _MMIO_PLANE_GAMC(plane, i, _PLANE_PRE_CSC_GAMC_INDEX_1(pipe), \
_PLANE_PRE_CSC_GAMC_INDEX_2(pipe))
#define _PLANE_PRE_CSC_GAMC_DATA_1_A 0x704d4
#define _PLANE_PRE_CSC_GAMC_DATA_1_B 0x714d4
#define _PLANE_PRE_CSC_GAMC_DATA_2_A 0x705d4
#define _PLANE_PRE_CSC_GAMC_DATA_2_B 0x715d4
#define _PLANE_PRE_CSC_GAMC_DATA_1(pipe) _PIPE(pipe, _PLANE_PRE_CSC_GAMC_DATA_1_A, \
_PLANE_PRE_CSC_GAMC_DATA_1_B)
#define _PLANE_PRE_CSC_GAMC_DATA_2(pipe) _PIPE(pipe, _PLANE_PRE_CSC_GAMC_DATA_2_A, \
_PLANE_PRE_CSC_GAMC_DATA_2_B)
#define PLANE_PRE_CSC_GAMC_DATA(pipe, plane, i) _MMIO_PLANE_GAMC(plane, i, _PLANE_PRE_CSC_GAMC_DATA_1(pipe), \
_PLANE_PRE_CSC_GAMC_DATA_2(pipe))
#define _PLANE_CSC_RY_GY_1_A 0x70210
#define _PLANE_CSC_RY_GY_2_A 0x70310
#define _PLANE_CSC_RY_GY_1_B 0x71210

View File

@@ -1141,6 +1141,122 @@ static int intel_vgpu_set_irqs(struct intel_vgpu *vgpu, u32 flags,
return func(vgpu, index, start, count, flags, data);
}
static int intel_vgpu_ioctl_get_region_info(struct vfio_device *vfio_dev,
struct vfio_region_info *info,
struct vfio_info_cap *caps)
{
struct vfio_region_info_cap_sparse_mmap *sparse = NULL;
struct intel_vgpu *vgpu = vfio_dev_to_vgpu(vfio_dev);
int nr_areas = 1;
int cap_type_id;
unsigned int i;
int ret;
switch (info->index) {
case VFIO_PCI_CONFIG_REGION_INDEX:
info->offset = VFIO_PCI_INDEX_TO_OFFSET(info->index);
info->size = vgpu->gvt->device_info.cfg_space_size;
info->flags = VFIO_REGION_INFO_FLAG_READ |
VFIO_REGION_INFO_FLAG_WRITE;
break;
case VFIO_PCI_BAR0_REGION_INDEX:
info->offset = VFIO_PCI_INDEX_TO_OFFSET(info->index);
info->size = vgpu->cfg_space.bar[info->index].size;
if (!info->size) {
info->flags = 0;
break;
}
info->flags = VFIO_REGION_INFO_FLAG_READ |
VFIO_REGION_INFO_FLAG_WRITE;
break;
case VFIO_PCI_BAR1_REGION_INDEX:
info->offset = VFIO_PCI_INDEX_TO_OFFSET(info->index);
info->size = 0;
info->flags = 0;
break;
case VFIO_PCI_BAR2_REGION_INDEX:
info->offset = VFIO_PCI_INDEX_TO_OFFSET(info->index);
info->flags = VFIO_REGION_INFO_FLAG_CAPS |
VFIO_REGION_INFO_FLAG_MMAP |
VFIO_REGION_INFO_FLAG_READ |
VFIO_REGION_INFO_FLAG_WRITE;
info->size = gvt_aperture_sz(vgpu->gvt);
sparse = kzalloc(struct_size(sparse, areas, nr_areas),
GFP_KERNEL);
if (!sparse)
return -ENOMEM;
sparse->header.id = VFIO_REGION_INFO_CAP_SPARSE_MMAP;
sparse->header.version = 1;
sparse->nr_areas = nr_areas;
cap_type_id = VFIO_REGION_INFO_CAP_SPARSE_MMAP;
sparse->areas[0].offset =
PAGE_ALIGN(vgpu_aperture_offset(vgpu));
sparse->areas[0].size = vgpu_aperture_sz(vgpu);
break;
case VFIO_PCI_BAR3_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
info->offset = VFIO_PCI_INDEX_TO_OFFSET(info->index);
info->size = 0;
info->flags = 0;
gvt_dbg_core("get region info bar:%d\n", info->index);
break;
case VFIO_PCI_ROM_REGION_INDEX:
case VFIO_PCI_VGA_REGION_INDEX:
info->offset = VFIO_PCI_INDEX_TO_OFFSET(info->index);
info->size = 0;
info->flags = 0;
gvt_dbg_core("get region info index:%d\n", info->index);
break;
default: {
struct vfio_region_info_cap_type cap_type = {
.header.id = VFIO_REGION_INFO_CAP_TYPE,
.header.version = 1
};
if (info->index >= VFIO_PCI_NUM_REGIONS + vgpu->num_regions)
return -EINVAL;
info->index = array_index_nospec(
info->index, VFIO_PCI_NUM_REGIONS + vgpu->num_regions);
i = info->index - VFIO_PCI_NUM_REGIONS;
info->offset = VFIO_PCI_INDEX_TO_OFFSET(info->index);
info->size = vgpu->region[i].size;
info->flags = vgpu->region[i].flags;
cap_type.type = vgpu->region[i].type;
cap_type.subtype = vgpu->region[i].subtype;
ret = vfio_info_add_capability(caps, &cap_type.header,
sizeof(cap_type));
if (ret)
return ret;
}
}
if ((info->flags & VFIO_REGION_INFO_FLAG_CAPS) && sparse) {
ret = -EINVAL;
if (cap_type_id == VFIO_REGION_INFO_CAP_SPARSE_MMAP) {
ret = vfio_info_add_capability(
caps, &sparse->header,
struct_size(sparse, areas, sparse->nr_areas));
}
if (ret) {
kfree(sparse);
return ret;
}
}
kfree(sparse);
return 0;
}
static long intel_vgpu_ioctl(struct vfio_device *vfio_dev, unsigned int cmd,
unsigned long arg)
{
@@ -1169,152 +1285,6 @@ static long intel_vgpu_ioctl(struct vfio_device *vfio_dev, unsigned int cmd,
return copy_to_user((void __user *)arg, &info, minsz) ?
-EFAULT : 0;
} else if (cmd == VFIO_DEVICE_GET_REGION_INFO) {
struct vfio_region_info info;
struct vfio_info_cap caps = { .buf = NULL, .size = 0 };
unsigned int i;
int ret;
struct vfio_region_info_cap_sparse_mmap *sparse = NULL;
int nr_areas = 1;
int cap_type_id;
minsz = offsetofend(struct vfio_region_info, offset);
if (copy_from_user(&info, (void __user *)arg, minsz))
return -EFAULT;
if (info.argsz < minsz)
return -EINVAL;
switch (info.index) {
case VFIO_PCI_CONFIG_REGION_INDEX:
info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
info.size = vgpu->gvt->device_info.cfg_space_size;
info.flags = VFIO_REGION_INFO_FLAG_READ |
VFIO_REGION_INFO_FLAG_WRITE;
break;
case VFIO_PCI_BAR0_REGION_INDEX:
info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
info.size = vgpu->cfg_space.bar[info.index].size;
if (!info.size) {
info.flags = 0;
break;
}
info.flags = VFIO_REGION_INFO_FLAG_READ |
VFIO_REGION_INFO_FLAG_WRITE;
break;
case VFIO_PCI_BAR1_REGION_INDEX:
info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
info.size = 0;
info.flags = 0;
break;
case VFIO_PCI_BAR2_REGION_INDEX:
info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
info.flags = VFIO_REGION_INFO_FLAG_CAPS |
VFIO_REGION_INFO_FLAG_MMAP |
VFIO_REGION_INFO_FLAG_READ |
VFIO_REGION_INFO_FLAG_WRITE;
info.size = gvt_aperture_sz(vgpu->gvt);
sparse = kzalloc(struct_size(sparse, areas, nr_areas),
GFP_KERNEL);
if (!sparse)
return -ENOMEM;
sparse->header.id = VFIO_REGION_INFO_CAP_SPARSE_MMAP;
sparse->header.version = 1;
sparse->nr_areas = nr_areas;
cap_type_id = VFIO_REGION_INFO_CAP_SPARSE_MMAP;
sparse->areas[0].offset =
PAGE_ALIGN(vgpu_aperture_offset(vgpu));
sparse->areas[0].size = vgpu_aperture_sz(vgpu);
break;
case VFIO_PCI_BAR3_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
info.size = 0;
info.flags = 0;
gvt_dbg_core("get region info bar:%d\n", info.index);
break;
case VFIO_PCI_ROM_REGION_INDEX:
case VFIO_PCI_VGA_REGION_INDEX:
info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
info.size = 0;
info.flags = 0;
gvt_dbg_core("get region info index:%d\n", info.index);
break;
default:
{
struct vfio_region_info_cap_type cap_type = {
.header.id = VFIO_REGION_INFO_CAP_TYPE,
.header.version = 1 };
if (info.index >= VFIO_PCI_NUM_REGIONS +
vgpu->num_regions)
return -EINVAL;
info.index =
array_index_nospec(info.index,
VFIO_PCI_NUM_REGIONS +
vgpu->num_regions);
i = info.index - VFIO_PCI_NUM_REGIONS;
info.offset =
VFIO_PCI_INDEX_TO_OFFSET(info.index);
info.size = vgpu->region[i].size;
info.flags = vgpu->region[i].flags;
cap_type.type = vgpu->region[i].type;
cap_type.subtype = vgpu->region[i].subtype;
ret = vfio_info_add_capability(&caps,
&cap_type.header,
sizeof(cap_type));
if (ret)
return ret;
}
}
if ((info.flags & VFIO_REGION_INFO_FLAG_CAPS) && sparse) {
ret = -EINVAL;
if (cap_type_id == VFIO_REGION_INFO_CAP_SPARSE_MMAP)
ret = vfio_info_add_capability(&caps,
&sparse->header,
struct_size(sparse, areas,
sparse->nr_areas));
if (ret) {
kfree(sparse);
return ret;
}
}
if (caps.size) {
info.flags |= VFIO_REGION_INFO_FLAG_CAPS;
if (info.argsz < sizeof(info) + caps.size) {
info.argsz = sizeof(info) + caps.size;
info.cap_offset = 0;
} else {
vfio_info_cap_shift(&caps, sizeof(info));
if (copy_to_user((void __user *)arg +
sizeof(info), caps.buf,
caps.size)) {
kfree(caps.buf);
kfree(sparse);
return -EFAULT;
}
info.cap_offset = sizeof(info);
}
kfree(caps.buf);
}
kfree(sparse);
return copy_to_user((void __user *)arg, &info, minsz) ?
-EFAULT : 0;
} else if (cmd == VFIO_DEVICE_GET_IRQ_INFO) {
struct vfio_irq_info info;
@@ -1477,6 +1447,7 @@ static const struct vfio_device_ops intel_vgpu_dev_ops = {
.write = intel_vgpu_write,
.mmap = intel_vgpu_mmap,
.ioctl = intel_vgpu_ioctl,
.get_region_info_caps = intel_vgpu_ioctl_get_region_info,
.dma_unmap = intel_vgpu_dma_unmap,
.bind_iommufd = vfio_iommufd_emulated_bind,
.unbind_iommufd = vfio_iommufd_emulated_unbind,

View File

@@ -184,6 +184,10 @@ xe-$(CONFIG_PCI_IOV) += \
xe_sriov_pf_sysfs.o \
xe_tile_sriov_pf_debugfs.o
ifdef CONFIG_XE_VFIO_PCI
xe-$(CONFIG_PCI_IOV) += xe_sriov_vfio.o
endif
# include helpers for tests even when XE is built-in
ifdef CONFIG_DRM_XE_KUNIT_TEST
xe-y += tests/xe_kunit_helpers.o
@@ -242,6 +246,8 @@ xe-$(CONFIG_DRM_XE_DISPLAY) += \
i915-display/intel_cdclk.o \
i915-display/intel_cmtg.o \
i915-display/intel_color.o \
i915-display/intel_colorop.o \
i915-display/intel_color_pipeline.o \
i915-display/intel_combo_phy.o \
i915-display/intel_connector.o \
i915-display/intel_crtc.o \

View File

@@ -54,13 +54,14 @@ static inline void xe_sched_tdr_queue_imm(struct xe_gpu_scheduler *sched)
static inline void xe_sched_resubmit_jobs(struct xe_gpu_scheduler *sched)
{
struct drm_sched_job *s_job;
bool restore_replay = false;
list_for_each_entry(s_job, &sched->base.pending_list, list) {
struct drm_sched_fence *s_fence = s_job->s_fence;
struct dma_fence *hw_fence = s_fence->parent;
if (to_xe_sched_job(s_job)->skip_emit ||
(hw_fence && !dma_fence_is_signaled(hw_fence)))
restore_replay |= to_xe_sched_job(s_job)->restore_replay;
if (restore_replay || (hw_fence && !dma_fence_is_signaled(hw_fence)))
sched->base.ops->run_job(s_job);
}
}

View File

@@ -711,7 +711,7 @@ static u64 pf_profile_fair_ggtt(struct xe_gt *gt, unsigned int num_vfs)
if (num_vfs > 56)
return SZ_64M - SZ_8M;
return rounddown_pow_of_two(shareable / num_vfs);
return rounddown_pow_of_two(div_u64(shareable, num_vfs));
}
/**

View File

@@ -17,6 +17,7 @@
#include "xe_gt_sriov_pf_helpers.h"
#include "xe_gt_sriov_pf_migration.h"
#include "xe_gt_sriov_printk.h"
#include "xe_guc.h"
#include "xe_guc_buf.h"
#include "xe_guc_ct.h"
#include "xe_migrate.h"
@@ -1023,6 +1024,12 @@ static void action_ring_cleanup(void *arg)
ptr_ring_cleanup(r, destroy_pf_packet);
}
static void pf_gt_migration_check_support(struct xe_gt *gt)
{
if (GUC_FIRMWARE_VER(&gt->uc.guc) < MAKE_GUC_VER(70, 54, 0))
xe_sriov_pf_migration_disable(gt_to_xe(gt), "requires GuC version >= 70.54.0");
}
/**
* xe_gt_sriov_pf_migration_init() - Initialize support for VF migration.
* @gt: the &xe_gt
@@ -1039,6 +1046,8 @@ int xe_gt_sriov_pf_migration_init(struct xe_gt *gt)
xe_gt_assert(gt, IS_SRIOV_PF(xe));
pf_gt_migration_check_support(gt);
if (!pf_migration_supported(gt))
return 0;

View File

@@ -822,7 +822,7 @@ static void submit_exec_queue(struct xe_exec_queue *q, struct xe_sched_job *job)
xe_gt_assert(guc_to_gt(guc), exec_queue_registered(q));
if (!job->skip_emit || job->last_replay) {
if (!job->restore_replay || job->last_replay) {
if (xe_exec_queue_is_parallel(q))
wq_item_append(q);
else
@@ -881,10 +881,10 @@ guc_exec_queue_run_job(struct drm_sched_job *drm_job)
if (!killed_or_banned_or_wedged && !xe_sched_job_is_error(job)) {
if (!exec_queue_registered(q))
register_exec_queue(q, GUC_CONTEXT_NORMAL);
if (!job->skip_emit)
if (!job->restore_replay)
q->ring_ops->emit_job(job);
submit_exec_queue(q, job);
job->skip_emit = false;
job->restore_replay = false;
}
/*
@@ -2112,6 +2112,18 @@ static void guc_exec_queue_revert_pending_state_change(struct xe_guc *guc,
q->guc->resume_time = 0;
}
static void lrc_parallel_clear(struct xe_lrc *lrc)
{
struct xe_device *xe = gt_to_xe(lrc->gt);
struct iosys_map map = xe_lrc_parallel_map(lrc);
int i;
for (i = 0; i < WQ_SIZE / sizeof(u32); ++i)
parallel_write(xe, map, wq[i],
FIELD_PREP(WQ_TYPE_MASK, WQ_TYPE_NOOP) |
FIELD_PREP(WQ_LEN_MASK, 0));
}
/*
* This function is quite complex but only real way to ensure no state is lost
* during VF resume flows. The function scans the queue state, make adjustments
@@ -2135,8 +2147,8 @@ static void guc_exec_queue_pause(struct xe_guc *guc, struct xe_exec_queue *q)
guc_exec_queue_revert_pending_state_change(guc, q);
if (xe_exec_queue_is_parallel(q)) {
struct xe_device *xe = guc_to_xe(guc);
struct iosys_map map = xe_lrc_parallel_map(q->lrc[0]);
/* Pairs with WRITE_ONCE in __xe_exec_queue_init */
struct xe_lrc *lrc = READ_ONCE(q->lrc[0]);
/*
* NOP existing WQ commands that may contain stale GGTT
@@ -2144,14 +2156,14 @@ static void guc_exec_queue_pause(struct xe_guc *guc, struct xe_exec_queue *q)
* seems to get confused if the WQ head/tail pointers are
* adjusted.
*/
for (i = 0; i < WQ_SIZE / sizeof(u32); ++i)
parallel_write(xe, map, wq[i],
FIELD_PREP(WQ_TYPE_MASK, WQ_TYPE_NOOP) |
FIELD_PREP(WQ_LEN_MASK, 0));
if (lrc)
lrc_parallel_clear(lrc);
}
job = xe_sched_first_pending_job(sched);
if (job) {
job->restore_replay = true;
/*
* Adjust software tail so jobs submitted overwrite previous
* position in ring buffer with new GGTT addresses.
@@ -2241,17 +2253,18 @@ static void guc_exec_queue_unpause_prepare(struct xe_guc *guc,
struct xe_exec_queue *q)
{
struct xe_gpu_scheduler *sched = &q->guc->sched;
struct drm_sched_job *s_job;
struct xe_sched_job *job = NULL;
bool restore_replay = false;
list_for_each_entry(s_job, &sched->base.pending_list, list) {
job = to_xe_sched_job(s_job);
list_for_each_entry(job, &sched->base.pending_list, drm.list) {
restore_replay |= job->restore_replay;
if (restore_replay) {
xe_gt_dbg(guc_to_gt(guc), "Replay JOB - guc_id=%d, seqno=%d",
q->guc->id, xe_sched_job_seqno(job));
xe_gt_dbg(guc_to_gt(guc), "Replay JOB - guc_id=%d, seqno=%d",
q->guc->id, xe_sched_job_seqno(job));
q->ring_ops->emit_job(job);
job->skip_emit = true;
q->ring_ops->emit_job(job);
job->restore_replay = true;
}
}
if (job)

View File

@@ -102,7 +102,6 @@ retry_userptr:
/* Lock VM and BOs dma-resv */
xe_validation_ctx_init(&ctx, &vm->xe->val, &exec, (struct xe_val_flags) {});
drm_exec_init(&exec, 0, 0);
drm_exec_until_all_locked(&exec) {
err = xe_pagefault_begin(&exec, vma, tile->mem.vram,
needs_vram == 1);

View File

@@ -1223,6 +1223,23 @@ static struct pci_driver xe_pci_driver = {
#endif
};
/**
* xe_pci_to_pf_device() - Get PF &xe_device.
* @pdev: the VF &pci_dev device
*
* Return: pointer to PF &xe_device, NULL otherwise.
*/
struct xe_device *xe_pci_to_pf_device(struct pci_dev *pdev)
{
struct drm_device *drm;
drm = pci_iov_get_pf_drvdata(pdev, &xe_pci_driver);
if (IS_ERR(drm))
return NULL;
return to_xe_device(drm);
}
int xe_register_pci_driver(void)
{
return pci_register_driver(&xe_pci_driver);

View File

@@ -6,7 +6,10 @@
#ifndef _XE_PCI_H_
#define _XE_PCI_H_
struct pci_dev;
int xe_register_pci_driver(void);
void xe_unregister_pci_driver(void);
struct xe_device *xe_pci_to_pf_device(struct pci_dev *pdev);
#endif

View File

@@ -726,6 +726,13 @@ static void xe_pm_runtime_lockdep_prime(void)
/**
* xe_pm_runtime_get - Get a runtime_pm reference and resume synchronously
* @xe: xe device instance
*
* When possible, scope-based runtime PM (through guard(xe_pm_runtime)) is
* be preferred over direct usage of this function. Manual get/put handling
* should only be used when the function contains goto-based logic which
* can break scope-based handling, or when the lifetime of the runtime PM
* reference does not match a specific scope (e.g., runtime PM obtained in one
* function and released in a different one).
*/
void xe_pm_runtime_get(struct xe_device *xe)
{
@@ -758,6 +765,13 @@ void xe_pm_runtime_put(struct xe_device *xe)
* xe_pm_runtime_get_ioctl - Get a runtime_pm reference before ioctl
* @xe: xe device instance
*
* When possible, scope-based runtime PM (through
* ACQUIRE(xe_pm_runtime_ioctl, ...)) is be preferred over direct usage of this
* function. Manual get/put handling should only be used when the function
* contains goto-based logic which can break scope-based handling, or when the
* lifetime of the runtime PM reference does not match a specific scope (e.g.,
* runtime PM obtained in one function and released in a different one).
*
* Returns: Any number greater than or equal to 0 for success, negative error
* code otherwise.
*/
@@ -827,6 +841,13 @@ static bool xe_pm_suspending_or_resuming(struct xe_device *xe)
* It will warn if not protected.
* The reference should be put back after this function regardless, since it
* will always bump the usage counter, regardless.
*
* When possible, scope-based runtime PM (through guard(xe_pm_runtime_noresume))
* is be preferred over direct usage of this function. Manual get/put handling
* should only be used when the function contains goto-based logic which can
* break scope-based handling, or when the lifetime of the runtime PM reference
* does not match a specific scope (e.g., runtime PM obtained in one function
* and released in a different one).
*/
void xe_pm_runtime_get_noresume(struct xe_device *xe)
{

View File

@@ -6,6 +6,7 @@
#ifndef _XE_PM_H_
#define _XE_PM_H_
#include <linux/cleanup.h>
#include <linux/pm_runtime.h>
#define DEFAULT_VRAM_THRESHOLD 300 /* in MB */
@@ -37,4 +38,20 @@ int xe_pm_block_on_suspend(struct xe_device *xe);
void xe_pm_might_block_on_suspend(void);
int xe_pm_module_init(void);
static inline void __xe_pm_runtime_noop(struct xe_device *xe) {}
DEFINE_GUARD(xe_pm_runtime, struct xe_device *,
xe_pm_runtime_get(_T), xe_pm_runtime_put(_T))
DEFINE_GUARD(xe_pm_runtime_noresume, struct xe_device *,
xe_pm_runtime_get_noresume(_T), xe_pm_runtime_put(_T))
DEFINE_GUARD_COND(xe_pm_runtime, _ioctl, xe_pm_runtime_get_ioctl(_T), _RET >= 0)
/*
* Used when a function needs to release runtime PM in all possible cases
* and error paths, but the wakeref was already acquired by a different
* function (i.e., get() has already happened so only a put() is needed).
*/
DEFINE_GUARD(xe_pm_runtime_release_only, struct xe_device *,
__xe_pm_runtime_noop(_T), xe_pm_runtime_put(_T));
#endif

View File

@@ -63,8 +63,8 @@ struct xe_sched_job {
bool ring_ops_flush_tlb;
/** @ggtt: mapped in ggtt. */
bool ggtt;
/** @skip_emit: skip emitting the job */
bool skip_emit;
/** @restore_replay: job being replayed for restore */
bool restore_replay;
/** @last_replay: last job being replayed */
bool last_replay;
/** @ptrs: per instance pointers. */

View File

@@ -46,13 +46,37 @@ bool xe_sriov_pf_migration_supported(struct xe_device *xe)
{
xe_assert(xe, IS_SRIOV_PF(xe));
return xe->sriov.pf.migration.supported;
return IS_ENABLED(CONFIG_DRM_XE_DEBUG) || !xe->sriov.pf.migration.disabled;
}
static bool pf_check_migration_support(struct xe_device *xe)
/**
* xe_sriov_pf_migration_disable() - Turn off SR-IOV VF migration support on PF.
* @xe: the &xe_device instance.
* @fmt: format string for the log message, to be combined with following VAs.
*/
void xe_sriov_pf_migration_disable(struct xe_device *xe, const char *fmt, ...)
{
/* XXX: for now this is for feature enabling only */
return IS_ENABLED(CONFIG_DRM_XE_DEBUG);
struct va_format vaf;
va_list va_args;
xe_assert(xe, IS_SRIOV_PF(xe));
va_start(va_args, fmt);
vaf.fmt = fmt;
vaf.va = &va_args;
xe_sriov_notice(xe, "migration %s: %pV\n",
IS_ENABLED(CONFIG_DRM_XE_DEBUG) ?
"missing prerequisite" : "disabled",
&vaf);
va_end(va_args);
xe->sriov.pf.migration.disabled = true;
}
static void pf_migration_check_support(struct xe_device *xe)
{
if (!xe_device_has_memirq(xe))
xe_sriov_pf_migration_disable(xe, "requires memory-based IRQ support");
}
static void pf_migration_cleanup(void *arg)
@@ -77,7 +101,8 @@ int xe_sriov_pf_migration_init(struct xe_device *xe)
xe_assert(xe, IS_SRIOV_PF(xe));
xe->sriov.pf.migration.supported = pf_check_migration_support(xe);
pf_migration_check_support(xe);
if (!xe_sriov_pf_migration_supported(xe))
return 0;

View File

@@ -14,6 +14,7 @@ struct xe_sriov_packet;
int xe_sriov_pf_migration_init(struct xe_device *xe);
bool xe_sriov_pf_migration_supported(struct xe_device *xe);
void xe_sriov_pf_migration_disable(struct xe_device *xe, const char *fmt, ...);
int xe_sriov_pf_migration_restore_produce(struct xe_device *xe, unsigned int vfid,
struct xe_sriov_packet *data);
struct xe_sriov_packet *

View File

@@ -14,8 +14,8 @@
* struct xe_sriov_pf_migration - Xe device level VF migration data
*/
struct xe_sriov_pf_migration {
/** @supported: indicates whether VF migration feature is supported */
bool supported;
/** @disabled: indicates whether VF migration feature is disabled */
bool disabled;
};
/**

View File

@@ -0,0 +1,80 @@
// SPDX-License-Identifier: MIT
/*
* Copyright © 2025 Intel Corporation
*/
#include <drm/intel/xe_sriov_vfio.h>
#include <linux/cleanup.h>
#include "xe_pci.h"
#include "xe_pm.h"
#include "xe_sriov_pf_control.h"
#include "xe_sriov_pf_helpers.h"
#include "xe_sriov_pf_migration.h"
struct xe_device *xe_sriov_vfio_get_pf(struct pci_dev *pdev)
{
return xe_pci_to_pf_device(pdev);
}
EXPORT_SYMBOL_FOR_MODULES(xe_sriov_vfio_get_pf, "xe-vfio-pci");
bool xe_sriov_vfio_migration_supported(struct xe_device *xe)
{
if (!IS_SRIOV_PF(xe))
return -EPERM;
return xe_sriov_pf_migration_supported(xe);
}
EXPORT_SYMBOL_FOR_MODULES(xe_sriov_vfio_migration_supported, "xe-vfio-pci");
#define DEFINE_XE_SRIOV_VFIO_FUNCTION(_type, _func, _impl) \
_type xe_sriov_vfio_##_func(struct xe_device *xe, unsigned int vfid) \
{ \
if (!IS_SRIOV_PF(xe)) \
return -EPERM; \
if (vfid == PFID || vfid > xe_sriov_pf_num_vfs(xe)) \
return -EINVAL; \
\
guard(xe_pm_runtime_noresume)(xe); \
\
return xe_sriov_pf_##_impl(xe, vfid); \
} \
EXPORT_SYMBOL_FOR_MODULES(xe_sriov_vfio_##_func, "xe-vfio-pci")
DEFINE_XE_SRIOV_VFIO_FUNCTION(int, wait_flr_done, control_wait_flr);
DEFINE_XE_SRIOV_VFIO_FUNCTION(int, suspend_device, control_pause_vf);
DEFINE_XE_SRIOV_VFIO_FUNCTION(int, resume_device, control_resume_vf);
DEFINE_XE_SRIOV_VFIO_FUNCTION(int, stop_copy_enter, control_trigger_save_vf);
DEFINE_XE_SRIOV_VFIO_FUNCTION(int, stop_copy_exit, control_finish_save_vf);
DEFINE_XE_SRIOV_VFIO_FUNCTION(int, resume_data_enter, control_trigger_restore_vf);
DEFINE_XE_SRIOV_VFIO_FUNCTION(int, resume_data_exit, control_finish_restore_vf);
DEFINE_XE_SRIOV_VFIO_FUNCTION(int, error, control_stop_vf);
DEFINE_XE_SRIOV_VFIO_FUNCTION(ssize_t, stop_copy_size, migration_size);
ssize_t xe_sriov_vfio_data_read(struct xe_device *xe, unsigned int vfid,
char __user *buf, size_t len)
{
if (!IS_SRIOV_PF(xe))
return -EPERM;
if (vfid == PFID || vfid > xe_sriov_pf_num_vfs(xe))
return -EINVAL;
guard(xe_pm_runtime_noresume)(xe);
return xe_sriov_pf_migration_read(xe, vfid, buf, len);
}
EXPORT_SYMBOL_FOR_MODULES(xe_sriov_vfio_data_read, "xe-vfio-pci");
ssize_t xe_sriov_vfio_data_write(struct xe_device *xe, unsigned int vfid,
const char __user *buf, size_t len)
{
if (!IS_SRIOV_PF(xe))
return -EPERM;
if (vfid == PFID || vfid > xe_sriov_pf_num_vfs(xe))
return -EINVAL;
guard(xe_pm_runtime_noresume)(xe);
return xe_sriov_pf_migration_write(xe, vfid, buf, len);
}
EXPORT_SYMBOL_FOR_MODULES(xe_sriov_vfio_data_write, "xe-vfio-pci");

View File

@@ -80,6 +80,7 @@ config INFINIBAND_VIRT_DMA
if INFINIBAND_USER_ACCESS || !INFINIBAND_USER_ACCESS
if !UML
source "drivers/infiniband/hw/bnxt_re/Kconfig"
source "drivers/infiniband/hw/bng_re/Kconfig"
source "drivers/infiniband/hw/cxgb4/Kconfig"
source "drivers/infiniband/hw/efa/Kconfig"
source "drivers/infiniband/hw/erdma/Kconfig"

View File

@@ -34,7 +34,6 @@ MODULE_AUTHOR("Sean Hefty");
MODULE_DESCRIPTION("InfiniBand CM");
MODULE_LICENSE("Dual BSD/GPL");
#define CM_DESTROY_ID_WAIT_TIMEOUT 10000 /* msecs */
#define CM_DIRECT_RETRY_CTX ((void *) 1UL)
#define CM_MRA_SETTING 24 /* 4.096us * 2^24 = ~68.7 seconds */
@@ -1057,6 +1056,7 @@ static void cm_destroy_id(struct ib_cm_id *cm_id, int err)
{
struct cm_id_private *cm_id_priv;
enum ib_cm_state old_state;
unsigned long timeout;
struct cm_work *work;
int ret;
@@ -1167,10 +1167,9 @@ retest:
xa_erase(&cm.local_id_table, cm_local_id(cm_id->local_id));
cm_deref_id(cm_id_priv);
timeout = msecs_to_jiffies((cm_id_priv->max_cm_retries * cm_id_priv->timeout_ms * 5) / 4);
do {
ret = wait_for_completion_timeout(&cm_id_priv->comp,
msecs_to_jiffies(
CM_DESTROY_ID_WAIT_TIMEOUT));
ret = wait_for_completion_timeout(&cm_id_priv->comp, timeout);
if (!ret) /* timeout happened */
cm_destroy_id_wait_timeout(cm_id, old_state);
} while (!ret);
@@ -4518,7 +4517,7 @@ static int __init ib_cm_init(void)
get_random_bytes(&cm.random_id_operand, sizeof cm.random_id_operand);
INIT_LIST_HEAD(&cm.timewait_list);
cm.wq = alloc_workqueue("ib_cm", 0, 1);
cm.wq = alloc_workqueue("ib_cm", WQ_PERCPU, 1);
if (!cm.wq) {
ret = -ENOMEM;
goto error2;

View File

@@ -4475,6 +4475,8 @@ int rdma_connect_locked(struct rdma_cm_id *id,
container_of(id, struct rdma_id_private, id);
int ret;
lockdep_assert_held(&id_priv->handler_mutex);
if (!cma_comp_exch(id_priv, RDMA_CM_ROUTE_RESOLVED, RDMA_CM_CONNECT))
return -EINVAL;

View File

@@ -3021,7 +3021,7 @@ static int __init ib_core_init(void)
{
int ret = -ENOMEM;
ib_wq = alloc_workqueue("infiniband", 0, 0);
ib_wq = alloc_workqueue("infiniband", WQ_PERCPU, 0);
if (!ib_wq)
return -ENOMEM;
@@ -3031,7 +3031,7 @@ static int __init ib_core_init(void)
goto err;
ib_comp_wq = alloc_workqueue("ib-comp-wq",
WQ_HIGHPRI | WQ_MEM_RECLAIM | WQ_SYSFS, 0);
WQ_HIGHPRI | WQ_MEM_RECLAIM | WQ_SYSFS | WQ_PERCPU, 0);
if (!ib_comp_wq)
goto err_unbound;

View File

@@ -175,7 +175,7 @@ void rdma_restrack_new(struct rdma_restrack_entry *res,
EXPORT_SYMBOL(rdma_restrack_new);
/**
* rdma_restrack_add() - add object to the reource tracking database
* rdma_restrack_add() - add object to the resource tracking database
* @res: resource entry
*/
void rdma_restrack_add(struct rdma_restrack_entry *res)
@@ -277,7 +277,7 @@ int rdma_restrack_put(struct rdma_restrack_entry *res)
EXPORT_SYMBOL(rdma_restrack_put);
/**
* rdma_restrack_del() - delete object from the reource tracking database
* rdma_restrack_del() - delete object from the resource tracking database
* @res: resource entry
*/
void rdma_restrack_del(struct rdma_restrack_entry *res)

View File

@@ -366,7 +366,7 @@ static int ucma_event_handler(struct rdma_cm_id *cm_id,
if (event->event == RDMA_CM_EVENT_DEVICE_REMOVAL) {
xa_lock(&ctx_table);
if (xa_load(&ctx_table, ctx->id) == ctx)
queue_work(system_unbound_wq, &ctx->close_work);
queue_work(system_dfl_wq, &ctx->close_work);
xa_unlock(&ctx_table);
}
return 0;

View File

@@ -45,6 +45,8 @@
#include "uverbs.h"
#define RESCHED_LOOP_CNT_THRESHOLD 0x1000
static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int dirty)
{
bool make_dirty = umem->writable && dirty;
@@ -55,10 +57,14 @@ static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int d
ib_dma_unmap_sgtable_attrs(dev, &umem->sgt_append.sgt,
DMA_BIDIRECTIONAL, 0);
for_each_sgtable_sg(&umem->sgt_append.sgt, sg, i)
for_each_sgtable_sg(&umem->sgt_append.sgt, sg, i) {
unpin_user_page_range_dirty_lock(sg_page(sg),
DIV_ROUND_UP(sg->length, PAGE_SIZE), make_dirty);
if (i && !(i % RESCHED_LOOP_CNT_THRESHOLD))
cond_resched();
}
sg_free_append_table(&umem->sgt_append);
}

View File

@@ -148,6 +148,7 @@ __attribute_const__ int ib_rate_to_mult(enum ib_rate rate)
case IB_RATE_400_GBPS: return 160;
case IB_RATE_600_GBPS: return 240;
case IB_RATE_800_GBPS: return 320;
case IB_RATE_1600_GBPS: return 640;
default: return -1;
}
}
@@ -178,6 +179,7 @@ __attribute_const__ enum ib_rate mult_to_ib_rate(int mult)
case 160: return IB_RATE_400_GBPS;
case 240: return IB_RATE_600_GBPS;
case 320: return IB_RATE_800_GBPS;
case 640: return IB_RATE_1600_GBPS;
default: return IB_RATE_PORT_CURRENT;
}
}
@@ -208,6 +210,7 @@ __attribute_const__ int ib_rate_to_mbps(enum ib_rate rate)
case IB_RATE_400_GBPS: return 425000;
case IB_RATE_600_GBPS: return 637500;
case IB_RATE_800_GBPS: return 850000;
case IB_RATE_1600_GBPS: return 1700000;
default: return -1;
}
}

View File

@@ -13,5 +13,6 @@ obj-$(CONFIG_INFINIBAND_HFI1) += hfi1/
obj-$(CONFIG_INFINIBAND_HNS_HIP08) += hns/
obj-$(CONFIG_INFINIBAND_QEDR) += qedr/
obj-$(CONFIG_INFINIBAND_BNXT_RE) += bnxt_re/
obj-$(CONFIG_INFINIBAND_BNG_RE) += bng_re/
obj-$(CONFIG_INFINIBAND_ERDMA) += erdma/
obj-$(CONFIG_INFINIBAND_IONIC) += ionic/

View File

@@ -0,0 +1,10 @@
# SPDX-License-Identifier: GPL-2.0-only
config INFINIBAND_BNG_RE
tristate "Broadcom Next generation RoCE HCA support"
depends on 64BIT
depends on INET && DCB && BNGE
help
This driver supports Broadcom Next generation
50/100/200/400/800 gigabit RoCE HCAs. The module
will be called bng_re. To compile this driver
as a module, choose M here.

View File

@@ -0,0 +1,8 @@
# SPDX-License-Identifier: GPL-2.0
ccflags-y := -I $(srctree)/drivers/net/ethernet/broadcom/bnge -I $(srctree)/drivers/infiniband/hw/bnxt_re
obj-$(CONFIG_INFINIBAND_BNG_RE) += bng_re.o
bng_re-y := bng_dev.o bng_fw.o \
bng_res.o bng_sp.o \
bng_debugfs.o

View File

@@ -0,0 +1,39 @@
// SPDX-License-Identifier: GPL-2.0
// Copyright (c) 2025 Broadcom.
#include <linux/debugfs.h>
#include <linux/pci.h>
#include <rdma/ib_verbs.h>
#include "bng_res.h"
#include "bng_fw.h"
#include "bnge.h"
#include "bnge_auxr.h"
#include "bng_re.h"
#include "bng_debugfs.h"
static struct dentry *bng_re_debugfs_root;
void bng_re_debugfs_add_pdev(struct bng_re_dev *rdev)
{
struct pci_dev *pdev = rdev->aux_dev->pdev;
rdev->dbg_root =
debugfs_create_dir(dev_name(&pdev->dev), bng_re_debugfs_root);
}
void bng_re_debugfs_rem_pdev(struct bng_re_dev *rdev)
{
debugfs_remove_recursive(rdev->dbg_root);
rdev->dbg_root = NULL;
}
void bng_re_register_debugfs(void)
{
bng_re_debugfs_root = debugfs_create_dir("bng_re", NULL);
}
void bng_re_unregister_debugfs(void)
{
debugfs_remove(bng_re_debugfs_root);
}

View File

@@ -0,0 +1,12 @@
/* SPDX-License-Identifier: GPL-2.0 */
// Copyright (c) 2025 Broadcom.
#ifndef __BNG_RE_DEBUGFS__
#define __BNG_RE_DEBUGFS__
void bng_re_debugfs_add_pdev(struct bng_re_dev *rdev);
void bng_re_debugfs_rem_pdev(struct bng_re_dev *rdev);
void bng_re_register_debugfs(void);
void bng_re_unregister_debugfs(void);
#endif

View File

@@ -0,0 +1,534 @@
// SPDX-License-Identifier: GPL-2.0
// Copyright (c) 2025 Broadcom.
#include <linux/module.h>
#include <linux/pci.h>
#include <linux/auxiliary_bus.h>
#include <rdma/ib_verbs.h>
#include "bng_res.h"
#include "bng_sp.h"
#include "bng_fw.h"
#include "bnge.h"
#include "bnge_auxr.h"
#include "bng_re.h"
#include "bnge_hwrm.h"
#include "bng_debugfs.h"
MODULE_AUTHOR("Siva Reddy Kallam <siva.kallam@broadcom.com>");
MODULE_DESCRIPTION(BNG_RE_DESC);
MODULE_LICENSE("Dual BSD/GPL");
static struct bng_re_dev *bng_re_dev_add(struct auxiliary_device *adev,
struct bnge_auxr_dev *aux_dev)
{
struct bng_re_dev *rdev;
/* Allocate bng_re_dev instance */
rdev = ib_alloc_device(bng_re_dev, ibdev);
if (!rdev) {
pr_err("%s: bng_re_dev allocation failure!", KBUILD_MODNAME);
return NULL;
}
/* Assign auxiliary device specific data */
rdev->netdev = aux_dev->net;
rdev->aux_dev = aux_dev;
rdev->adev = adev;
rdev->fn_id = rdev->aux_dev->pdev->devfn;
return rdev;
}
static int bng_re_register_netdev(struct bng_re_dev *rdev)
{
struct bnge_auxr_dev *aux_dev;
aux_dev = rdev->aux_dev;
return bnge_register_dev(aux_dev, rdev->adev);
}
static void bng_re_destroy_chip_ctx(struct bng_re_dev *rdev)
{
struct bng_re_chip_ctx *chip_ctx;
if (!rdev->chip_ctx)
return;
kfree(rdev->dev_attr);
rdev->dev_attr = NULL;
chip_ctx = rdev->chip_ctx;
rdev->chip_ctx = NULL;
rdev->rcfw.res = NULL;
rdev->bng_res.cctx = NULL;
rdev->bng_res.pdev = NULL;
kfree(chip_ctx);
}
static int bng_re_setup_chip_ctx(struct bng_re_dev *rdev)
{
struct bng_re_chip_ctx *chip_ctx;
struct bnge_auxr_dev *aux_dev;
int rc = -ENOMEM;
aux_dev = rdev->aux_dev;
rdev->bng_res.pdev = aux_dev->pdev;
rdev->rcfw.res = &rdev->bng_res;
chip_ctx = kzalloc(sizeof(*chip_ctx), GFP_KERNEL);
if (!chip_ctx)
return -ENOMEM;
chip_ctx->chip_num = aux_dev->chip_num;
chip_ctx->hw_stats_size = aux_dev->hw_ring_stats_size;
rdev->chip_ctx = chip_ctx;
rdev->bng_res.cctx = rdev->chip_ctx;
rdev->dev_attr = kzalloc(sizeof(*rdev->dev_attr), GFP_KERNEL);
if (!rdev->dev_attr)
goto free_chip_ctx;
rdev->bng_res.dattr = rdev->dev_attr;
return 0;
free_chip_ctx:
kfree(rdev->chip_ctx);
rdev->chip_ctx = NULL;
return rc;
}
static void bng_re_init_hwrm_hdr(struct input *hdr, u16 opcd)
{
hdr->req_type = cpu_to_le16(opcd);
hdr->cmpl_ring = cpu_to_le16(-1);
hdr->target_id = cpu_to_le16(-1);
}
static void bng_re_fill_fw_msg(struct bnge_fw_msg *fw_msg, void *msg,
int msg_len, void *resp, int resp_max_len,
int timeout)
{
fw_msg->msg = msg;
fw_msg->msg_len = msg_len;
fw_msg->resp = resp;
fw_msg->resp_max_len = resp_max_len;
fw_msg->timeout = timeout;
}
static int bng_re_net_ring_free(struct bng_re_dev *rdev,
u16 fw_ring_id, int type)
{
struct bnge_auxr_dev *aux_dev = rdev->aux_dev;
struct hwrm_ring_free_input req = {};
struct hwrm_ring_free_output resp;
struct bnge_fw_msg fw_msg = {};
int rc = -EINVAL;
if (!rdev)
return rc;
if (!aux_dev)
return rc;
bng_re_init_hwrm_hdr((void *)&req, HWRM_RING_FREE);
req.ring_type = type;
req.ring_id = cpu_to_le16(fw_ring_id);
bng_re_fill_fw_msg(&fw_msg, (void *)&req, sizeof(req), (void *)&resp,
sizeof(resp), BNGE_DFLT_HWRM_CMD_TIMEOUT);
rc = bnge_send_msg(aux_dev, &fw_msg);
if (rc)
ibdev_err(&rdev->ibdev, "Failed to free HW ring:%d :%#x",
req.ring_id, rc);
return rc;
}
static int bng_re_net_ring_alloc(struct bng_re_dev *rdev,
struct bng_re_ring_attr *ring_attr,
u16 *fw_ring_id)
{
struct bnge_auxr_dev *aux_dev = rdev->aux_dev;
struct hwrm_ring_alloc_input req = {};
struct hwrm_ring_alloc_output resp;
struct bnge_fw_msg fw_msg = {};
int rc = -EINVAL;
if (!aux_dev)
return rc;
bng_re_init_hwrm_hdr((void *)&req, HWRM_RING_ALLOC);
req.enables = 0;
req.page_tbl_addr = cpu_to_le64(ring_attr->dma_arr[0]);
if (ring_attr->pages > 1) {
/* Page size is in log2 units */
req.page_size = BNGE_PAGE_SHIFT;
req.page_tbl_depth = 1;
}
req.fbo = 0;
/* Association of ring index with doorbell index and MSIX number */
req.logical_id = cpu_to_le16(ring_attr->lrid);
req.length = cpu_to_le32(ring_attr->depth + 1);
req.ring_type = ring_attr->type;
req.int_mode = ring_attr->mode;
bng_re_fill_fw_msg(&fw_msg, (void *)&req, sizeof(req), (void *)&resp,
sizeof(resp), BNGE_DFLT_HWRM_CMD_TIMEOUT);
rc = bnge_send_msg(aux_dev, &fw_msg);
if (!rc)
*fw_ring_id = le16_to_cpu(resp.ring_id);
return rc;
}
static int bng_re_stats_ctx_free(struct bng_re_dev *rdev)
{
struct bnge_auxr_dev *aux_dev = rdev->aux_dev;
struct hwrm_stat_ctx_free_input req = {};
struct hwrm_stat_ctx_free_output resp = {};
struct bnge_fw_msg fw_msg = {};
int rc = -EINVAL;
if (!aux_dev)
return rc;
bng_re_init_hwrm_hdr((void *)&req, HWRM_STAT_CTX_FREE);
req.stat_ctx_id = cpu_to_le32(rdev->stats_ctx.fw_id);
bng_re_fill_fw_msg(&fw_msg, (void *)&req, sizeof(req), (void *)&resp,
sizeof(resp), BNGE_DFLT_HWRM_CMD_TIMEOUT);
rc = bnge_send_msg(aux_dev, &fw_msg);
if (rc)
ibdev_err(&rdev->ibdev, "Failed to free HW stats context %#x",
rc);
return rc;
}
static int bng_re_stats_ctx_alloc(struct bng_re_dev *rdev)
{
struct bnge_auxr_dev *aux_dev = rdev->aux_dev;
struct bng_re_stats *stats = &rdev->stats_ctx;
struct hwrm_stat_ctx_alloc_output resp = {};
struct hwrm_stat_ctx_alloc_input req = {};
struct bnge_fw_msg fw_msg = {};
int rc = -EINVAL;
stats->fw_id = BNGE_INVALID_STATS_CTX_ID;
if (!aux_dev)
return rc;
bng_re_init_hwrm_hdr((void *)&req, HWRM_STAT_CTX_ALLOC);
req.update_period_ms = cpu_to_le32(1000);
req.stats_dma_addr = cpu_to_le64(stats->dma_map);
req.stats_dma_length = cpu_to_le16(rdev->chip_ctx->hw_stats_size);
req.stat_ctx_flags = STAT_CTX_ALLOC_REQ_STAT_CTX_FLAGS_ROCE;
bng_re_fill_fw_msg(&fw_msg, (void *)&req, sizeof(req), (void *)&resp,
sizeof(resp), BNGE_DFLT_HWRM_CMD_TIMEOUT);
rc = bnge_send_msg(aux_dev, &fw_msg);
if (!rc)
stats->fw_id = le32_to_cpu(resp.stat_ctx_id);
return rc;
}
static void bng_re_query_hwrm_version(struct bng_re_dev *rdev)
{
struct bnge_auxr_dev *aux_dev = rdev->aux_dev;
struct hwrm_ver_get_output ver_get_resp = {};
struct hwrm_ver_get_input ver_get_req = {};
struct bng_re_chip_ctx *cctx;
struct bnge_fw_msg fw_msg = {};
int rc;
bng_re_init_hwrm_hdr((void *)&ver_get_req, HWRM_VER_GET);
ver_get_req.hwrm_intf_maj = HWRM_VERSION_MAJOR;
ver_get_req.hwrm_intf_min = HWRM_VERSION_MINOR;
ver_get_req.hwrm_intf_upd = HWRM_VERSION_UPDATE;
bng_re_fill_fw_msg(&fw_msg, (void *)&ver_get_req, sizeof(ver_get_req),
(void *)&ver_get_resp, sizeof(ver_get_resp),
BNGE_DFLT_HWRM_CMD_TIMEOUT);
rc = bnge_send_msg(aux_dev, &fw_msg);
if (rc) {
ibdev_err(&rdev->ibdev, "Failed to query HW version, rc = 0x%x",
rc);
return;
}
cctx = rdev->chip_ctx;
cctx->hwrm_intf_ver =
(u64)le16_to_cpu(ver_get_resp.hwrm_intf_major) << 48 |
(u64)le16_to_cpu(ver_get_resp.hwrm_intf_minor) << 32 |
(u64)le16_to_cpu(ver_get_resp.hwrm_intf_build) << 16 |
le16_to_cpu(ver_get_resp.hwrm_intf_patch);
cctx->hwrm_cmd_max_timeout = le16_to_cpu(ver_get_resp.max_req_timeout);
if (!cctx->hwrm_cmd_max_timeout)
cctx->hwrm_cmd_max_timeout = BNG_ROCE_FW_MAX_TIMEOUT;
}
static void bng_re_dev_uninit(struct bng_re_dev *rdev)
{
int rc;
bng_re_debugfs_rem_pdev(rdev);
if (test_and_clear_bit(BNG_RE_FLAG_RCFW_CHANNEL_EN, &rdev->flags)) {
rc = bng_re_deinit_rcfw(&rdev->rcfw);
if (rc)
ibdev_warn(&rdev->ibdev,
"Failed to deinitialize RCFW: %#x", rc);
bng_re_stats_ctx_free(rdev);
bng_re_free_stats_ctx_mem(rdev->bng_res.pdev, &rdev->stats_ctx);
bng_re_disable_rcfw_channel(&rdev->rcfw);
bng_re_net_ring_free(rdev, rdev->rcfw.creq.ring_id,
RING_ALLOC_REQ_RING_TYPE_NQ);
bng_re_free_rcfw_channel(&rdev->rcfw);
}
kfree(rdev->nqr);
rdev->nqr = NULL;
bng_re_destroy_chip_ctx(rdev);
if (test_and_clear_bit(BNG_RE_FLAG_NETDEV_REGISTERED, &rdev->flags))
bnge_unregister_dev(rdev->aux_dev);
}
static int bng_re_dev_init(struct bng_re_dev *rdev)
{
struct bng_re_ring_attr rattr = {};
struct bng_re_creq_ctx *creq;
u32 db_offt;
int vid;
u8 type;
int rc;
/* Registered a new RoCE device instance to netdev */
rc = bng_re_register_netdev(rdev);
if (rc) {
ibdev_err(&rdev->ibdev,
"Failed to register with netedev: %#x\n", rc);
return -EINVAL;
}
set_bit(BNG_RE_FLAG_NETDEV_REGISTERED, &rdev->flags);
if (rdev->aux_dev->auxr_info->msix_requested < BNG_RE_MIN_MSIX) {
ibdev_err(&rdev->ibdev,
"RoCE requires minimum 2 MSI-X vectors, but only %d reserved\n",
rdev->aux_dev->auxr_info->msix_requested);
bnge_unregister_dev(rdev->aux_dev);
clear_bit(BNG_RE_FLAG_NETDEV_REGISTERED, &rdev->flags);
return -EINVAL;
}
ibdev_dbg(&rdev->ibdev, "Got %d MSI-X vectors\n",
rdev->aux_dev->auxr_info->msix_requested);
rc = bng_re_setup_chip_ctx(rdev);
if (rc) {
bnge_unregister_dev(rdev->aux_dev);
clear_bit(BNG_RE_FLAG_NETDEV_REGISTERED, &rdev->flags);
ibdev_err(&rdev->ibdev, "Failed to get chip context\n");
return -EINVAL;
}
bng_re_query_hwrm_version(rdev);
rc = bng_re_alloc_fw_channel(&rdev->bng_res, &rdev->rcfw);
if (rc) {
ibdev_err(&rdev->ibdev,
"Failed to allocate RCFW Channel: %#x\n", rc);
goto fail;
}
/* Allocate nq record memory */
rdev->nqr = kzalloc(sizeof(*rdev->nqr), GFP_KERNEL);
if (!rdev->nqr) {
bng_re_destroy_chip_ctx(rdev);
bnge_unregister_dev(rdev->aux_dev);
clear_bit(BNG_RE_FLAG_NETDEV_REGISTERED, &rdev->flags);
return -ENOMEM;
}
rdev->nqr->num_msix = rdev->aux_dev->auxr_info->msix_requested;
memcpy(rdev->nqr->msix_entries, rdev->aux_dev->msix_info,
sizeof(struct bnge_msix_info) * rdev->nqr->num_msix);
type = RING_ALLOC_REQ_RING_TYPE_NQ;
creq = &rdev->rcfw.creq;
rattr.dma_arr = creq->hwq.pbl[BNG_PBL_LVL_0].pg_map_arr;
rattr.pages = creq->hwq.pbl[creq->hwq.level].pg_count;
rattr.type = type;
rattr.mode = RING_ALLOC_REQ_INT_MODE_MSIX;
rattr.depth = BNG_FW_CREQE_MAX_CNT - 1;
rattr.lrid = rdev->nqr->msix_entries[BNG_RE_CREQ_NQ_IDX].ring_idx;
rc = bng_re_net_ring_alloc(rdev, &rattr, &creq->ring_id);
if (rc) {
ibdev_err(&rdev->ibdev, "Failed to allocate CREQ: %#x\n", rc);
goto free_rcfw;
}
db_offt = rdev->nqr->msix_entries[BNG_RE_CREQ_NQ_IDX].db_offset;
vid = rdev->nqr->msix_entries[BNG_RE_CREQ_NQ_IDX].vector;
rc = bng_re_enable_fw_channel(&rdev->rcfw,
vid, db_offt);
if (rc) {
ibdev_err(&rdev->ibdev, "Failed to enable RCFW channel: %#x\n",
rc);
goto free_ring;
}
rc = bng_re_get_dev_attr(&rdev->rcfw);
if (rc)
goto disable_rcfw;
bng_re_debugfs_add_pdev(rdev);
rc = bng_re_alloc_stats_ctx_mem(rdev->bng_res.pdev, rdev->chip_ctx,
&rdev->stats_ctx);
if (rc) {
ibdev_err(&rdev->ibdev,
"Failed to allocate stats context: %#x\n", rc);
goto disable_rcfw;
}
rc = bng_re_stats_ctx_alloc(rdev);
if (rc) {
ibdev_err(&rdev->ibdev,
"Failed to allocate QPLIB context: %#x\n", rc);
goto free_stats_ctx;
}
rc = bng_re_init_rcfw(&rdev->rcfw, &rdev->stats_ctx);
if (rc) {
ibdev_err(&rdev->ibdev,
"Failed to initialize RCFW: %#x\n", rc);
goto free_sctx;
}
set_bit(BNG_RE_FLAG_RCFW_CHANNEL_EN, &rdev->flags);
return 0;
free_sctx:
bng_re_stats_ctx_free(rdev);
free_stats_ctx:
bng_re_free_stats_ctx_mem(rdev->bng_res.pdev, &rdev->stats_ctx);
disable_rcfw:
bng_re_disable_rcfw_channel(&rdev->rcfw);
free_ring:
bng_re_net_ring_free(rdev, rdev->rcfw.creq.ring_id, type);
free_rcfw:
bng_re_free_rcfw_channel(&rdev->rcfw);
fail:
bng_re_dev_uninit(rdev);
return rc;
}
static int bng_re_add_device(struct auxiliary_device *adev)
{
struct bnge_auxr_priv *auxr_priv =
container_of(adev, struct bnge_auxr_priv, aux_dev);
struct bng_re_en_dev_info *dev_info;
struct bng_re_dev *rdev;
int rc;
dev_info = auxiliary_get_drvdata(adev);
rdev = bng_re_dev_add(adev, auxr_priv->auxr_dev);
if (!rdev) {
rc = -ENOMEM;
goto exit;
}
dev_info->rdev = rdev;
rc = bng_re_dev_init(rdev);
if (rc)
goto re_dev_dealloc;
return 0;
re_dev_dealloc:
ib_dealloc_device(&rdev->ibdev);
exit:
return rc;
}
static void bng_re_remove_device(struct bng_re_dev *rdev,
struct auxiliary_device *aux_dev)
{
bng_re_dev_uninit(rdev);
ib_dealloc_device(&rdev->ibdev);
}
static int bng_re_probe(struct auxiliary_device *adev,
const struct auxiliary_device_id *id)
{
struct bnge_auxr_priv *aux_priv =
container_of(adev, struct bnge_auxr_priv, aux_dev);
struct bng_re_en_dev_info *en_info;
int rc;
en_info = kzalloc(sizeof(*en_info), GFP_KERNEL);
if (!en_info)
return -ENOMEM;
en_info->auxr_dev = aux_priv->auxr_dev;
auxiliary_set_drvdata(adev, en_info);
rc = bng_re_add_device(adev);
if (rc)
kfree(en_info);
return rc;
}
static void bng_re_remove(struct auxiliary_device *adev)
{
struct bng_re_en_dev_info *dev_info = auxiliary_get_drvdata(adev);
struct bng_re_dev *rdev;
rdev = dev_info->rdev;
if (rdev)
bng_re_remove_device(rdev, adev);
kfree(dev_info);
}
static const struct auxiliary_device_id bng_re_id_table[] = {
{ .name = BNG_RE_ADEV_NAME ".rdma", },
{},
};
MODULE_DEVICE_TABLE(auxiliary, bng_re_id_table);
static struct auxiliary_driver bng_re_driver = {
.name = "rdma",
.probe = bng_re_probe,
.remove = bng_re_remove,
.id_table = bng_re_id_table,
};
static int __init bng_re_mod_init(void)
{
int rc;
bng_re_register_debugfs();
rc = auxiliary_driver_register(&bng_re_driver);
if (rc) {
pr_err("%s: Failed to register auxiliary driver\n",
KBUILD_MODNAME);
goto unreg_debugfs;
}
return 0;
unreg_debugfs:
bng_re_unregister_debugfs();
return rc;
}
static void __exit bng_re_mod_exit(void)
{
auxiliary_driver_unregister(&bng_re_driver);
bng_re_unregister_debugfs();
}
module_init(bng_re_mod_init);
module_exit(bng_re_mod_exit);

View File

@@ -0,0 +1,767 @@
// SPDX-License-Identifier: GPL-2.0
// Copyright (c) 2025 Broadcom.
#include <linux/pci.h>
#include "roce_hsi.h"
#include "bng_res.h"
#include "bng_fw.h"
#include "bng_sp.h"
/**
* bng_re_map_rc - map return type based on opcode
* @opcode: roce slow path opcode
*
* case #1
* Firmware initiated error recovery is a safe state machine and
* driver can consider all the underlying rdma resources are free.
* In this state, it is safe to return success for opcodes related to
* destroying rdma resources (like destroy qp, destroy cq etc.).
*
* case #2
* If driver detect potential firmware stall, it is not safe state machine
* and the driver can not consider all the underlying rdma resources are
* freed.
* In this state, it is not safe to return success for opcodes related to
* destroying rdma resources (like destroy qp, destroy cq etc.).
*
* Scope of this helper function is only for case #1.
*
* Returns:
* 0 to communicate success to caller.
* Non zero error code to communicate failure to caller.
*/
static int bng_re_map_rc(u8 opcode)
{
switch (opcode) {
case CMDQ_BASE_OPCODE_DESTROY_QP:
case CMDQ_BASE_OPCODE_DESTROY_SRQ:
case CMDQ_BASE_OPCODE_DESTROY_CQ:
case CMDQ_BASE_OPCODE_DEALLOCATE_KEY:
case CMDQ_BASE_OPCODE_DEREGISTER_MR:
case CMDQ_BASE_OPCODE_DELETE_GID:
case CMDQ_BASE_OPCODE_DESTROY_QP1:
case CMDQ_BASE_OPCODE_DESTROY_AH:
case CMDQ_BASE_OPCODE_DEINITIALIZE_FW:
case CMDQ_BASE_OPCODE_MODIFY_ROCE_CC:
case CMDQ_BASE_OPCODE_SET_LINK_AGGR_MODE:
return 0;
default:
return -ETIMEDOUT;
}
}
void bng_re_free_rcfw_channel(struct bng_re_rcfw *rcfw)
{
kfree(rcfw->crsqe_tbl);
bng_re_free_hwq(rcfw->res, &rcfw->cmdq.hwq);
bng_re_free_hwq(rcfw->res, &rcfw->creq.hwq);
rcfw->pdev = NULL;
}
int bng_re_alloc_fw_channel(struct bng_re_res *res,
struct bng_re_rcfw *rcfw)
{
struct bng_re_hwq_attr hwq_attr = {};
struct bng_re_sg_info sginfo = {};
struct bng_re_cmdq_ctx *cmdq;
struct bng_re_creq_ctx *creq;
rcfw->pdev = res->pdev;
cmdq = &rcfw->cmdq;
creq = &rcfw->creq;
rcfw->res = res;
sginfo.pgsize = PAGE_SIZE;
sginfo.pgshft = PAGE_SHIFT;
hwq_attr.sginfo = &sginfo;
hwq_attr.res = rcfw->res;
hwq_attr.depth = BNG_FW_CREQE_MAX_CNT;
hwq_attr.stride = BNG_FW_CREQE_UNITS;
hwq_attr.type = BNG_HWQ_TYPE_QUEUE;
if (bng_re_alloc_init_hwq(&creq->hwq, &hwq_attr)) {
dev_err(&rcfw->pdev->dev,
"HW channel CREQ allocation failed\n");
goto fail;
}
rcfw->cmdq_depth = BNG_FW_CMDQE_MAX_CNT;
sginfo.pgsize = bng_fw_cmdqe_page_size(rcfw->cmdq_depth);
hwq_attr.depth = rcfw->cmdq_depth & 0x7FFFFFFF;
hwq_attr.stride = BNG_FW_CMDQE_UNITS;
hwq_attr.type = BNG_HWQ_TYPE_CTX;
if (bng_re_alloc_init_hwq(&cmdq->hwq, &hwq_attr)) {
dev_err(&rcfw->pdev->dev,
"HW channel CMDQ allocation failed\n");
goto fail;
}
rcfw->crsqe_tbl = kcalloc(cmdq->hwq.max_elements,
sizeof(*rcfw->crsqe_tbl), GFP_KERNEL);
if (!rcfw->crsqe_tbl)
goto fail;
spin_lock_init(&rcfw->tbl_lock);
rcfw->max_timeout = res->cctx->hwrm_cmd_max_timeout;
return 0;
fail:
bng_re_free_rcfw_channel(rcfw);
return -ENOMEM;
}
static int bng_re_process_qp_event(struct bng_re_rcfw *rcfw,
struct creq_qp_event *qp_event,
u32 *num_wait)
{
struct bng_re_hwq *hwq = &rcfw->cmdq.hwq;
struct bng_re_crsqe *crsqe;
u32 req_size;
u16 cookie;
bool is_waiter_alive;
struct pci_dev *pdev;
u32 wait_cmds = 0;
int rc = 0;
pdev = rcfw->pdev;
switch (qp_event->event) {
case CREQ_QP_EVENT_EVENT_QP_ERROR_NOTIFICATION:
dev_err(&pdev->dev, "Received QP error notification\n");
break;
default:
/*
* Command Response
* cmdq->lock needs to be acquired to synchronie
* the command send and completion reaping. This function
* is always called with creq->lock held. Using
* the nested variant of spin_lock.
*
*/
spin_lock_nested(&hwq->lock, SINGLE_DEPTH_NESTING);
cookie = le16_to_cpu(qp_event->cookie);
cookie &= BNG_FW_MAX_COOKIE_VALUE;
crsqe = &rcfw->crsqe_tbl[cookie];
if (WARN_ONCE(test_bit(FIRMWARE_STALL_DETECTED,
&rcfw->cmdq.flags),
"Unreponsive rcfw channel detected.!!")) {
dev_info(&pdev->dev,
"rcfw timedout: cookie = %#x, free_slots = %d",
cookie, crsqe->free_slots);
spin_unlock(&hwq->lock);
return rc;
}
if (crsqe->is_waiter_alive) {
if (crsqe->resp) {
memcpy(crsqe->resp, qp_event, sizeof(*qp_event));
/* Insert write memory barrier to ensure that
* response data is copied before clearing the
* flags
*/
smp_wmb();
}
}
wait_cmds++;
req_size = crsqe->req_size;
is_waiter_alive = crsqe->is_waiter_alive;
crsqe->req_size = 0;
if (!is_waiter_alive)
crsqe->resp = NULL;
crsqe->is_in_used = false;
hwq->cons += req_size;
spin_unlock(&hwq->lock);
}
*num_wait += wait_cmds;
return rc;
}
/* function events */
static int bng_re_process_func_event(struct bng_re_rcfw *rcfw,
struct creq_func_event *func_event)
{
switch (func_event->event) {
case CREQ_FUNC_EVENT_EVENT_TX_WQE_ERROR:
case CREQ_FUNC_EVENT_EVENT_TX_DATA_ERROR:
case CREQ_FUNC_EVENT_EVENT_RX_WQE_ERROR:
case CREQ_FUNC_EVENT_EVENT_RX_DATA_ERROR:
case CREQ_FUNC_EVENT_EVENT_CQ_ERROR:
case CREQ_FUNC_EVENT_EVENT_TQM_ERROR:
case CREQ_FUNC_EVENT_EVENT_CFCQ_ERROR:
case CREQ_FUNC_EVENT_EVENT_CFCS_ERROR:
case CREQ_FUNC_EVENT_EVENT_CFCC_ERROR:
case CREQ_FUNC_EVENT_EVENT_CFCM_ERROR:
case CREQ_FUNC_EVENT_EVENT_TIM_ERROR:
case CREQ_FUNC_EVENT_EVENT_VF_COMM_REQUEST:
case CREQ_FUNC_EVENT_EVENT_RESOURCE_EXHAUSTED:
break;
default:
return -EINVAL;
}
return 0;
}
/* CREQ Completion handlers */
static void bng_re_service_creq(struct tasklet_struct *t)
{
struct bng_re_rcfw *rcfw = from_tasklet(rcfw, t, creq.creq_tasklet);
struct bng_re_creq_ctx *creq = &rcfw->creq;
u32 type, budget = BNG_FW_CREQ_ENTRY_POLL_BUDGET;
struct bng_re_hwq *hwq = &creq->hwq;
struct creq_base *creqe;
u32 num_wakeup = 0;
u32 hw_polled = 0;
/* Service the CREQ until budget is over */
spin_lock_bh(&hwq->lock);
while (budget > 0) {
creqe = bng_re_get_qe(hwq, hwq->cons, NULL);
if (!BNG_FW_CREQ_CMP_VALID(creqe, creq->creq_db.dbinfo.flags))
break;
/* The valid test of the entry must be done first before
* reading any further.
*/
dma_rmb();
type = creqe->type & CREQ_BASE_TYPE_MASK;
switch (type) {
case CREQ_BASE_TYPE_QP_EVENT:
bng_re_process_qp_event
(rcfw, (struct creq_qp_event *)creqe,
&num_wakeup);
creq->stats.creq_qp_event_processed++;
break;
case CREQ_BASE_TYPE_FUNC_EVENT:
if (!bng_re_process_func_event
(rcfw, (struct creq_func_event *)creqe))
creq->stats.creq_func_event_processed++;
else
dev_warn(&rcfw->pdev->dev,
"aeqe:%#x Not handled\n", type);
break;
default:
if (type != ASYNC_EVENT_CMPL_TYPE_HWRM_ASYNC_EVENT)
dev_warn(&rcfw->pdev->dev,
"creqe with event 0x%x not handled\n",
type);
break;
}
budget--;
hw_polled++;
bng_re_hwq_incr_cons(hwq->max_elements, &hwq->cons,
1, &creq->creq_db.dbinfo.flags);
}
if (hw_polled)
bng_re_ring_nq_db(&creq->creq_db.dbinfo,
rcfw->res->cctx, true);
spin_unlock_bh(&hwq->lock);
if (num_wakeup)
wake_up_nr(&rcfw->cmdq.waitq, num_wakeup);
}
static int __send_message_basic_sanity(struct bng_re_rcfw *rcfw,
struct bng_re_cmdqmsg *msg,
u8 opcode)
{
struct bng_re_cmdq_ctx *cmdq;
cmdq = &rcfw->cmdq;
if (test_bit(FIRMWARE_STALL_DETECTED, &cmdq->flags))
return -ETIMEDOUT;
if (test_bit(FIRMWARE_INITIALIZED_FLAG, &cmdq->flags) &&
opcode == CMDQ_BASE_OPCODE_INITIALIZE_FW) {
dev_err(&rcfw->pdev->dev, "RCFW already initialized!");
return -EINVAL;
}
if (!test_bit(FIRMWARE_INITIALIZED_FLAG, &cmdq->flags) &&
(opcode != CMDQ_BASE_OPCODE_QUERY_FUNC &&
opcode != CMDQ_BASE_OPCODE_INITIALIZE_FW &&
opcode != CMDQ_BASE_OPCODE_QUERY_VERSION)) {
dev_err(&rcfw->pdev->dev,
"RCFW not initialized, reject opcode 0x%x",
opcode);
return -EOPNOTSUPP;
}
return 0;
}
static int __send_message(struct bng_re_rcfw *rcfw,
struct bng_re_cmdqmsg *msg, u8 opcode)
{
u32 bsize, free_slots, required_slots;
struct bng_re_cmdq_ctx *cmdq;
struct bng_re_crsqe *crsqe;
struct bng_fw_cmdqe *cmdqe;
struct bng_re_hwq *hwq;
u32 sw_prod, cmdq_prod;
struct pci_dev *pdev;
u16 cookie;
u8 *preq;
cmdq = &rcfw->cmdq;
hwq = &cmdq->hwq;
pdev = rcfw->pdev;
/* Cmdq are in 16-byte units, each request can consume 1 or more
* cmdqe
*/
spin_lock_bh(&hwq->lock);
required_slots = bng_re_get_cmd_slots(msg->req);
free_slots = HWQ_FREE_SLOTS(hwq);
cookie = cmdq->seq_num & BNG_FW_MAX_COOKIE_VALUE;
crsqe = &rcfw->crsqe_tbl[cookie];
if (required_slots >= free_slots) {
dev_info_ratelimited(&pdev->dev,
"CMDQ is full req/free %d/%d!",
required_slots, free_slots);
spin_unlock_bh(&hwq->lock);
return -EAGAIN;
}
__set_cmdq_base_cookie(msg->req, msg->req_sz, cpu_to_le16(cookie));
bsize = bng_re_set_cmd_slots(msg->req);
crsqe->free_slots = free_slots;
crsqe->resp = (struct creq_qp_event *)msg->resp;
crsqe->is_waiter_alive = true;
crsqe->is_in_used = true;
crsqe->opcode = opcode;
crsqe->req_size = __get_cmdq_base_cmd_size(msg->req, msg->req_sz);
if (__get_cmdq_base_resp_size(msg->req, msg->req_sz) && msg->sb) {
struct bng_re_rcfw_sbuf *sbuf = msg->sb;
__set_cmdq_base_resp_addr(msg->req, msg->req_sz,
cpu_to_le64(sbuf->dma_addr));
__set_cmdq_base_resp_size(msg->req, msg->req_sz,
ALIGN(sbuf->size,
BNG_FW_CMDQE_UNITS) /
BNG_FW_CMDQE_UNITS);
}
preq = (u8 *)msg->req;
do {
/* Locate the next cmdq slot */
sw_prod = HWQ_CMP(hwq->prod, hwq);
cmdqe = bng_re_get_qe(hwq, sw_prod, NULL);
/* Copy a segment of the req cmd to the cmdq */
memset(cmdqe, 0, sizeof(*cmdqe));
memcpy(cmdqe, preq, min_t(u32, bsize, sizeof(*cmdqe)));
preq += min_t(u32, bsize, sizeof(*cmdqe));
bsize -= min_t(u32, bsize, sizeof(*cmdqe));
hwq->prod++;
} while (bsize > 0);
cmdq->seq_num++;
cmdq_prod = hwq->prod & 0xFFFF;
if (test_bit(FIRMWARE_FIRST_FLAG, &cmdq->flags)) {
/* The very first doorbell write
* is required to set this flag
* which prompts the FW to reset
* its internal pointers
*/
cmdq_prod |= BIT(FIRMWARE_FIRST_FLAG);
clear_bit(FIRMWARE_FIRST_FLAG, &cmdq->flags);
}
/* ring CMDQ DB */
wmb();
writel(cmdq_prod, cmdq->cmdq_mbox.prod);
writel(BNG_FW_CMDQ_TRIG_VAL, cmdq->cmdq_mbox.db);
spin_unlock_bh(&hwq->lock);
/* Return the CREQ response pointer */
return 0;
}
/**
* __wait_for_resp - Don't hold the cpu context and wait for response
* @rcfw: rcfw channel instance of rdev
* @cookie: cookie to track the command
*
* Wait for command completion in sleepable context.
*
* Returns:
* 0 if command is completed by firmware.
* Non zero error code for rest of the case.
*/
static int __wait_for_resp(struct bng_re_rcfw *rcfw, u16 cookie)
{
struct bng_re_cmdq_ctx *cmdq;
struct bng_re_crsqe *crsqe;
cmdq = &rcfw->cmdq;
crsqe = &rcfw->crsqe_tbl[cookie];
do {
wait_event_timeout(cmdq->waitq,
!crsqe->is_in_used,
secs_to_jiffies(rcfw->max_timeout));
if (!crsqe->is_in_used)
return 0;
bng_re_service_creq(&rcfw->creq.creq_tasklet);
if (!crsqe->is_in_used)
return 0;
} while (true);
};
/**
* bng_re_rcfw_send_message - interface to send
* and complete rcfw command.
* @rcfw: rcfw channel instance of rdev
* @msg: message to send
*
* This function does not account shadow queue depth. It will send
* all the command unconditionally as long as send queue is not full.
*
* Returns:
* 0 if command completed by firmware.
* Non zero if the command is not completed by firmware.
*/
int bng_re_rcfw_send_message(struct bng_re_rcfw *rcfw,
struct bng_re_cmdqmsg *msg)
{
struct creq_qp_event *evnt = (struct creq_qp_event *)msg->resp;
struct bng_re_crsqe *crsqe;
u16 cookie;
int rc;
u8 opcode;
opcode = __get_cmdq_base_opcode(msg->req, msg->req_sz);
rc = __send_message_basic_sanity(rcfw, msg, opcode);
if (rc)
return rc == -ENXIO ? bng_re_map_rc(opcode) : rc;
rc = __send_message(rcfw, msg, opcode);
if (rc)
return rc;
cookie = le16_to_cpu(__get_cmdq_base_cookie(msg->req, msg->req_sz))
& BNG_FW_MAX_COOKIE_VALUE;
rc = __wait_for_resp(rcfw, cookie);
if (rc) {
spin_lock_bh(&rcfw->cmdq.hwq.lock);
crsqe = &rcfw->crsqe_tbl[cookie];
crsqe->is_waiter_alive = false;
if (rc == -ENODEV)
set_bit(FIRMWARE_STALL_DETECTED, &rcfw->cmdq.flags);
spin_unlock_bh(&rcfw->cmdq.hwq.lock);
return -ETIMEDOUT;
}
if (evnt->status) {
/* failed with status */
dev_err(&rcfw->pdev->dev, "cmdq[%#x]=%#x status %#x\n",
cookie, opcode, evnt->status);
rc = -EIO;
}
return rc;
}
static int bng_re_map_cmdq_mbox(struct bng_re_rcfw *rcfw)
{
struct bng_re_cmdq_mbox *mbox;
resource_size_t bar_reg;
struct pci_dev *pdev;
pdev = rcfw->pdev;
mbox = &rcfw->cmdq.cmdq_mbox;
mbox->reg.bar_id = BNG_FW_COMM_PCI_BAR_REGION;
mbox->reg.len = BNG_FW_COMM_SIZE;
mbox->reg.bar_base = pci_resource_start(pdev, mbox->reg.bar_id);
if (!mbox->reg.bar_base) {
dev_err(&pdev->dev,
"CMDQ BAR region %d resc start is 0!\n",
mbox->reg.bar_id);
return -ENOMEM;
}
bar_reg = mbox->reg.bar_base + BNG_FW_COMM_BASE_OFFSET;
mbox->reg.len = BNG_FW_COMM_SIZE;
mbox->reg.bar_reg = ioremap(bar_reg, mbox->reg.len);
if (!mbox->reg.bar_reg) {
dev_err(&pdev->dev,
"CMDQ BAR region %d mapping failed\n",
mbox->reg.bar_id);
return -ENOMEM;
}
mbox->prod = (void __iomem *)(mbox->reg.bar_reg +
BNG_FW_PF_VF_COMM_PROD_OFFSET);
mbox->db = (void __iomem *)(mbox->reg.bar_reg + BNG_FW_COMM_TRIG_OFFSET);
return 0;
}
static irqreturn_t bng_re_creq_irq(int irq, void *dev_instance)
{
struct bng_re_rcfw *rcfw = dev_instance;
struct bng_re_creq_ctx *creq;
struct bng_re_hwq *hwq;
u32 sw_cons;
creq = &rcfw->creq;
hwq = &creq->hwq;
/* Prefetch the CREQ element */
sw_cons = HWQ_CMP(hwq->cons, hwq);
bng_re_get_qe(hwq, sw_cons, NULL);
tasklet_schedule(&creq->creq_tasklet);
return IRQ_HANDLED;
}
int bng_re_rcfw_start_irq(struct bng_re_rcfw *rcfw, int msix_vector,
bool need_init)
{
struct bng_re_creq_ctx *creq;
struct bng_re_res *res;
int rc;
creq = &rcfw->creq;
res = rcfw->res;
if (creq->irq_handler_avail)
return -EFAULT;
creq->msix_vec = msix_vector;
if (need_init)
tasklet_setup(&creq->creq_tasklet, bng_re_service_creq);
else
tasklet_enable(&creq->creq_tasklet);
creq->irq_name = kasprintf(GFP_KERNEL, "bng_re-creq@pci:%s",
pci_name(res->pdev));
if (!creq->irq_name)
return -ENOMEM;
rc = request_irq(creq->msix_vec, bng_re_creq_irq, 0,
creq->irq_name, rcfw);
if (rc) {
kfree(creq->irq_name);
creq->irq_name = NULL;
tasklet_disable(&creq->creq_tasklet);
return rc;
}
creq->irq_handler_avail = true;
bng_re_ring_nq_db(&creq->creq_db.dbinfo, res->cctx, true);
atomic_inc(&rcfw->rcfw_intr_enabled);
return 0;
}
static int bng_re_map_creq_db(struct bng_re_rcfw *rcfw, u32 reg_offt)
{
struct bng_re_creq_db *creq_db;
resource_size_t bar_reg;
struct pci_dev *pdev;
pdev = rcfw->pdev;
creq_db = &rcfw->creq.creq_db;
creq_db->dbinfo.flags = 0;
creq_db->reg.bar_id = BNG_FW_COMM_CONS_PCI_BAR_REGION;
creq_db->reg.bar_base = pci_resource_start(pdev, creq_db->reg.bar_id);
if (!creq_db->reg.bar_id)
dev_err(&pdev->dev,
"CREQ BAR region %d resc start is 0!",
creq_db->reg.bar_id);
bar_reg = creq_db->reg.bar_base + reg_offt;
creq_db->reg.len = BNG_FW_CREQ_DB_LEN;
creq_db->reg.bar_reg = ioremap(bar_reg, creq_db->reg.len);
if (!creq_db->reg.bar_reg) {
dev_err(&pdev->dev,
"CREQ BAR region %d mapping failed",
creq_db->reg.bar_id);
return -ENOMEM;
}
creq_db->dbinfo.db = creq_db->reg.bar_reg;
creq_db->dbinfo.hwq = &rcfw->creq.hwq;
creq_db->dbinfo.xid = rcfw->creq.ring_id;
return 0;
}
void bng_re_rcfw_stop_irq(struct bng_re_rcfw *rcfw, bool kill)
{
struct bng_re_creq_ctx *creq;
creq = &rcfw->creq;
if (!creq->irq_handler_avail)
return;
creq->irq_handler_avail = false;
/* Mask h/w interrupts */
bng_re_ring_nq_db(&creq->creq_db.dbinfo, rcfw->res->cctx, false);
/* Sync with last running IRQ-handler */
synchronize_irq(creq->msix_vec);
free_irq(creq->msix_vec, rcfw);
kfree(creq->irq_name);
creq->irq_name = NULL;
atomic_set(&rcfw->rcfw_intr_enabled, 0);
if (kill)
tasklet_kill(&creq->creq_tasklet);
tasklet_disable(&creq->creq_tasklet);
}
void bng_re_disable_rcfw_channel(struct bng_re_rcfw *rcfw)
{
struct bng_re_creq_ctx *creq;
struct bng_re_cmdq_ctx *cmdq;
creq = &rcfw->creq;
cmdq = &rcfw->cmdq;
/* Make sure the HW channel is stopped! */
bng_re_rcfw_stop_irq(rcfw, true);
iounmap(cmdq->cmdq_mbox.reg.bar_reg);
iounmap(creq->creq_db.reg.bar_reg);
cmdq->cmdq_mbox.reg.bar_reg = NULL;
creq->creq_db.reg.bar_reg = NULL;
creq->msix_vec = 0;
}
static void bng_re_start_rcfw(struct bng_re_rcfw *rcfw)
{
struct bng_re_cmdq_ctx *cmdq;
struct bng_re_creq_ctx *creq;
struct bng_re_cmdq_mbox *mbox;
struct cmdq_init init = {0};
cmdq = &rcfw->cmdq;
creq = &rcfw->creq;
mbox = &cmdq->cmdq_mbox;
init.cmdq_pbl = cpu_to_le64(cmdq->hwq.pbl[BNG_PBL_LVL_0].pg_map_arr[0]);
init.cmdq_size_cmdq_lvl =
cpu_to_le16(((rcfw->cmdq_depth <<
CMDQ_INIT_CMDQ_SIZE_SFT) &
CMDQ_INIT_CMDQ_SIZE_MASK) |
((cmdq->hwq.level <<
CMDQ_INIT_CMDQ_LVL_SFT) &
CMDQ_INIT_CMDQ_LVL_MASK));
init.creq_ring_id = cpu_to_le16(creq->ring_id);
/* Write to the mailbox register */
__iowrite32_copy(mbox->reg.bar_reg, &init, sizeof(init) / 4);
}
int bng_re_enable_fw_channel(struct bng_re_rcfw *rcfw,
int msix_vector,
int cp_bar_reg_off)
{
struct bng_re_cmdq_ctx *cmdq;
int rc;
cmdq = &rcfw->cmdq;
/* Assign defaults */
cmdq->seq_num = 0;
set_bit(FIRMWARE_FIRST_FLAG, &cmdq->flags);
init_waitqueue_head(&cmdq->waitq);
rc = bng_re_map_cmdq_mbox(rcfw);
if (rc)
return rc;
rc = bng_re_map_creq_db(rcfw, cp_bar_reg_off);
if (rc)
return rc;
rc = bng_re_rcfw_start_irq(rcfw, msix_vector, true);
if (rc) {
dev_err(&rcfw->pdev->dev,
"Failed to request IRQ for CREQ rc = 0x%x\n", rc);
bng_re_disable_rcfw_channel(rcfw);
return rc;
}
bng_re_start_rcfw(rcfw);
return 0;
}
int bng_re_deinit_rcfw(struct bng_re_rcfw *rcfw)
{
struct creq_deinitialize_fw_resp resp = {};
struct cmdq_deinitialize_fw req = {};
struct bng_re_cmdqmsg msg = {};
int rc;
bng_re_rcfw_cmd_prep((struct cmdq_base *)&req,
CMDQ_BASE_OPCODE_DEINITIALIZE_FW,
sizeof(req));
bng_re_fill_cmdqmsg(&msg, &req, &resp, NULL,
sizeof(req), sizeof(resp), 0);
rc = bng_re_rcfw_send_message(rcfw, &msg);
if (rc)
return rc;
clear_bit(FIRMWARE_INITIALIZED_FLAG, &rcfw->cmdq.flags);
return 0;
}
static inline bool _is_hw_retx_supported(u16 dev_cap_flags)
{
return dev_cap_flags &
(CREQ_QUERY_FUNC_RESP_SB_HW_REQUESTER_RETX_ENABLED |
CREQ_QUERY_FUNC_RESP_SB_HW_RESPONDER_RETX_ENABLED);
}
#define BNG_RE_HW_RETX(a) _is_hw_retx_supported((a))
static inline bool _is_optimize_modify_qp_supported(u16 dev_cap_ext_flags2)
{
return dev_cap_ext_flags2 &
CREQ_QUERY_FUNC_RESP_SB_OPTIMIZE_MODIFY_QP_SUPPORTED;
}
int bng_re_init_rcfw(struct bng_re_rcfw *rcfw,
struct bng_re_stats *stats_ctx)
{
struct creq_initialize_fw_resp resp = {};
struct cmdq_initialize_fw req = {};
struct bng_re_cmdqmsg msg = {};
int rc;
u16 flags = 0;
bng_re_rcfw_cmd_prep((struct cmdq_base *)&req,
CMDQ_BASE_OPCODE_INITIALIZE_FW,
sizeof(req));
/* Supply (log-base-2-of-host-page-size - base-page-shift)
* to bono to adjust the doorbell page sizes.
*/
req.log2_dbr_pg_size = cpu_to_le16(PAGE_SHIFT -
BNG_FW_DBR_BASE_PAGE_SHIFT);
if (BNG_RE_HW_RETX(rcfw->res->dattr->dev_cap_flags))
flags |= CMDQ_INITIALIZE_FW_FLAGS_HW_REQUESTER_RETX_SUPPORTED;
if (_is_optimize_modify_qp_supported(rcfw->res->dattr->dev_cap_flags2))
flags |= CMDQ_INITIALIZE_FW_FLAGS_OPTIMIZE_MODIFY_QP_SUPPORTED;
req.flags |= cpu_to_le16(flags);
req.stat_ctx_id = cpu_to_le32(stats_ctx->fw_id);
bng_re_fill_cmdqmsg(&msg, &req, &resp, NULL, sizeof(req), sizeof(resp), 0);
rc = bng_re_rcfw_send_message(rcfw, &msg);
if (rc)
return rc;
set_bit(FIRMWARE_INITIALIZED_FLAG, &rcfw->cmdq.flags);
return 0;
}

View File

@@ -0,0 +1,211 @@
/* SPDX-License-Identifier: GPL-2.0 */
// Copyright (c) 2025 Broadcom.
#ifndef __BNG_FW_H__
#define __BNG_FW_H__
#include "bng_tlv.h"
/* FW DB related */
#define BNG_FW_CMDQ_TRIG_VAL 1
#define BNG_FW_COMM_PCI_BAR_REGION 0
#define BNG_FW_COMM_CONS_PCI_BAR_REGION 2
#define BNG_FW_DBR_BASE_PAGE_SHIFT 12
#define BNG_FW_COMM_SIZE 0x104
#define BNG_FW_COMM_BASE_OFFSET 0x600
#define BNG_FW_COMM_TRIG_OFFSET 0x100
#define BNG_FW_PF_VF_COMM_PROD_OFFSET 0xc
#define BNG_FW_CREQ_DB_LEN 8
/* CREQ */
#define BNG_FW_CREQE_MAX_CNT (64 * 1024)
#define BNG_FW_CREQE_UNITS 16
#define BNG_FW_CREQ_ENTRY_POLL_BUDGET 0x100
#define BNG_FW_CREQ_CMP_VALID(hdr, pass) \
(!!((hdr)->v & CREQ_BASE_V) == \
!((pass) & BNG_RE_FLAG_EPOCH_CONS_MASK))
#define BNG_FW_CREQ_ENTRY_POLL_BUDGET 0x100
/* CMDQ */
struct bng_fw_cmdqe {
u8 data[16];
};
#define BNG_FW_CMDQE_MAX_CNT 8192
#define BNG_FW_CMDQE_UNITS sizeof(struct bng_fw_cmdqe)
#define BNG_FW_CMDQE_BYTES(depth) ((depth) * BNG_FW_CMDQE_UNITS)
#define BNG_FW_MAX_COOKIE_VALUE (BNG_FW_CMDQE_MAX_CNT - 1)
#define BNG_FW_CMD_IS_BLOCKING 0x8000
/* Crsq buf is 1024-Byte */
struct bng_re_crsbe {
u8 data[1024];
};
static inline u32 bng_fw_cmdqe_npages(u32 depth)
{
u32 npages;
npages = BNG_FW_CMDQE_BYTES(depth) / PAGE_SIZE;
if (BNG_FW_CMDQE_BYTES(depth) % PAGE_SIZE)
npages++;
return npages;
}
static inline u32 bng_fw_cmdqe_page_size(u32 depth)
{
return (bng_fw_cmdqe_npages(depth) * PAGE_SIZE);
}
struct bng_re_cmdq_mbox {
struct bng_re_reg_desc reg;
void __iomem *prod;
void __iomem *db;
};
/* HWQ */
struct bng_re_cmdq_ctx {
struct bng_re_hwq hwq;
struct bng_re_cmdq_mbox cmdq_mbox;
unsigned long flags;
#define FIRMWARE_INITIALIZED_FLAG (0)
#define FIRMWARE_STALL_DETECTED (3)
#define FIRMWARE_FIRST_FLAG (31)
wait_queue_head_t waitq;
u32 seq_num;
};
struct bng_re_creq_db {
struct bng_re_reg_desc reg;
struct bng_re_db_info dbinfo;
};
struct bng_re_creq_stat {
u64 creq_qp_event_processed;
u64 creq_func_event_processed;
};
struct bng_re_creq_ctx {
struct bng_re_hwq hwq;
struct bng_re_creq_db creq_db;
struct bng_re_creq_stat stats;
struct tasklet_struct creq_tasklet;
u16 ring_id;
int msix_vec;
bool irq_handler_avail;
char *irq_name;
};
struct bng_re_crsqe {
struct creq_qp_event *resp;
u32 req_size;
/* Free slots at the time of submission */
u32 free_slots;
u8 opcode;
bool is_waiter_alive;
bool is_in_used;
};
struct bng_re_rcfw_sbuf {
void *sb;
dma_addr_t dma_addr;
u32 size;
};
/* RoCE FW Communication Channels */
struct bng_re_rcfw {
struct pci_dev *pdev;
struct bng_re_res *res;
struct bng_re_cmdq_ctx cmdq;
struct bng_re_creq_ctx creq;
struct bng_re_crsqe *crsqe_tbl;
/* To synchronize the qp-handle hash table */
spinlock_t tbl_lock;
u32 cmdq_depth;
/* cached from chip cctx for quick reference in slow path */
u16 max_timeout;
atomic_t rcfw_intr_enabled;
};
struct bng_re_cmdqmsg {
struct cmdq_base *req;
struct creq_base *resp;
void *sb;
u32 req_sz;
u32 res_sz;
u8 block;
};
static inline void bng_re_rcfw_cmd_prep(struct cmdq_base *req,
u8 opcode, u8 cmd_size)
{
req->opcode = opcode;
req->cmd_size = cmd_size;
}
static inline void bng_re_fill_cmdqmsg(struct bng_re_cmdqmsg *msg,
void *req, void *resp, void *sb,
u32 req_sz, u32 res_sz, u8 block)
{
msg->req = req;
msg->resp = resp;
msg->sb = sb;
msg->req_sz = req_sz;
msg->res_sz = res_sz;
msg->block = block;
}
/* Get the number of command units required for the req. The
* function returns correct value only if called before
* setting using bng_re_set_cmd_slots
*/
static inline u32 bng_re_get_cmd_slots(struct cmdq_base *req)
{
u32 cmd_units = 0;
if (HAS_TLV_HEADER(req)) {
struct roce_tlv *tlv_req = (struct roce_tlv *)req;
cmd_units = tlv_req->total_size;
} else {
cmd_units = (req->cmd_size + BNG_FW_CMDQE_UNITS - 1) /
BNG_FW_CMDQE_UNITS;
}
return cmd_units;
}
static inline u32 bng_re_set_cmd_slots(struct cmdq_base *req)
{
u32 cmd_byte = 0;
if (HAS_TLV_HEADER(req)) {
struct roce_tlv *tlv_req = (struct roce_tlv *)req;
cmd_byte = tlv_req->total_size * BNG_FW_CMDQE_UNITS;
} else {
cmd_byte = req->cmd_size;
req->cmd_size = (req->cmd_size + BNG_FW_CMDQE_UNITS - 1) /
BNG_FW_CMDQE_UNITS;
}
return cmd_byte;
}
void bng_re_free_rcfw_channel(struct bng_re_rcfw *rcfw);
int bng_re_alloc_fw_channel(struct bng_re_res *res,
struct bng_re_rcfw *rcfw);
int bng_re_enable_fw_channel(struct bng_re_rcfw *rcfw,
int msix_vector,
int cp_bar_reg_off);
void bng_re_disable_rcfw_channel(struct bng_re_rcfw *rcfw);
int bng_re_rcfw_start_irq(struct bng_re_rcfw *rcfw, int msix_vector,
bool need_init);
void bng_re_rcfw_stop_irq(struct bng_re_rcfw *rcfw, bool kill);
int bng_re_rcfw_send_message(struct bng_re_rcfw *rcfw,
struct bng_re_cmdqmsg *msg);
int bng_re_init_rcfw(struct bng_re_rcfw *rcfw,
struct bng_re_stats *stats_ctx);
int bng_re_deinit_rcfw(struct bng_re_rcfw *rcfw);
#endif

View File

@@ -0,0 +1,85 @@
/* SPDX-License-Identifier: GPL-2.0 */
// Copyright (c) 2025 Broadcom.
#ifndef __BNG_RE_H__
#define __BNG_RE_H__
#include "bng_res.h"
#define BNG_RE_ADEV_NAME "bng_en"
#define BNG_RE_DESC "Broadcom 800G RoCE Driver"
#define rdev_to_dev(rdev) ((rdev) ? (&(rdev)->ibdev.dev) : NULL)
#define BNG_RE_MIN_MSIX 2
#define BNG_RE_MAX_MSIX BNGE_MAX_ROCE_MSIX
#define BNG_RE_CREQ_NQ_IDX 0
#define BNGE_INVALID_STATS_CTX_ID -1
/* NQ specific structures */
struct bng_re_nq_db {
struct bng_re_reg_desc reg;
struct bng_re_db_info dbinfo;
};
struct bng_re_nq {
struct pci_dev *pdev;
struct bng_re_res *res;
char *name;
struct bng_re_hwq hwq;
struct bng_re_nq_db nq_db;
u16 ring_id;
int msix_vec;
cpumask_t mask;
struct tasklet_struct nq_tasklet;
bool requested;
int budget;
u32 load;
struct workqueue_struct *cqn_wq;
};
struct bng_re_nq_record {
struct bnge_msix_info msix_entries[BNG_RE_MAX_MSIX];
struct bng_re_nq nq[BNG_RE_MAX_MSIX];
int num_msix;
/* serialize NQ access */
struct mutex load_lock;
};
struct bng_re_en_dev_info {
struct bng_re_dev *rdev;
struct bnge_auxr_dev *auxr_dev;
};
struct bng_re_ring_attr {
dma_addr_t *dma_arr;
int pages;
int type;
u32 depth;
u32 lrid; /* Logical ring id */
u8 mode;
};
struct bng_re_dev {
struct ib_device ibdev;
unsigned long flags;
#define BNG_RE_FLAG_NETDEV_REGISTERED 0
#define BNG_RE_FLAG_RCFW_CHANNEL_EN 1
struct net_device *netdev;
struct auxiliary_device *adev;
struct bnge_auxr_dev *aux_dev;
struct bng_re_chip_ctx *chip_ctx;
int fn_id;
struct bng_re_res bng_res;
struct bng_re_rcfw rcfw;
struct bng_re_nq_record *nqr;
/* Device Resources */
struct bng_re_dev_attr *dev_attr;
struct dentry *dbg_root;
struct bng_re_stats stats_ctx;
};
#endif

View File

@@ -0,0 +1,279 @@
// SPDX-License-Identifier: GPL-2.0
// Copyright (c) 2025 Broadcom.
#include <linux/pci.h>
#include <linux/vmalloc.h>
#include <rdma/ib_umem.h>
#include <linux/bnxt/hsi.h>
#include "bng_res.h"
#include "roce_hsi.h"
/* Stats */
void bng_re_free_stats_ctx_mem(struct pci_dev *pdev,
struct bng_re_stats *stats)
{
if (stats->dma) {
dma_free_coherent(&pdev->dev, stats->size,
stats->dma, stats->dma_map);
}
memset(stats, 0, sizeof(*stats));
stats->fw_id = -1;
}
int bng_re_alloc_stats_ctx_mem(struct pci_dev *pdev,
struct bng_re_chip_ctx *cctx,
struct bng_re_stats *stats)
{
memset(stats, 0, sizeof(*stats));
stats->fw_id = -1;
stats->size = cctx->hw_stats_size;
stats->dma = dma_alloc_coherent(&pdev->dev, stats->size,
&stats->dma_map, GFP_KERNEL);
if (!stats->dma)
return -ENOMEM;
return 0;
}
static void bng_free_pbl(struct bng_re_res *res, struct bng_re_pbl *pbl)
{
struct pci_dev *pdev = res->pdev;
int i;
for (i = 0; i < pbl->pg_count; i++) {
if (pbl->pg_arr[i])
dma_free_coherent(&pdev->dev, pbl->pg_size,
(void *)((unsigned long)
pbl->pg_arr[i] &
PAGE_MASK),
pbl->pg_map_arr[i]);
else
dev_warn(&pdev->dev,
"PBL free pg_arr[%d] empty?!\n", i);
pbl->pg_arr[i] = NULL;
}
vfree(pbl->pg_arr);
pbl->pg_arr = NULL;
vfree(pbl->pg_map_arr);
pbl->pg_map_arr = NULL;
pbl->pg_count = 0;
pbl->pg_size = 0;
}
static int bng_alloc_pbl(struct bng_re_res *res,
struct bng_re_pbl *pbl,
struct bng_re_sg_info *sginfo)
{
struct pci_dev *pdev = res->pdev;
u32 pages;
int i;
if (sginfo->nopte)
return 0;
pages = sginfo->npages;
/* page ptr arrays */
pbl->pg_arr = vmalloc_array(pages, sizeof(void *));
if (!pbl->pg_arr)
return -ENOMEM;
pbl->pg_map_arr = vmalloc_array(pages, sizeof(dma_addr_t));
if (!pbl->pg_map_arr) {
vfree(pbl->pg_arr);
pbl->pg_arr = NULL;
return -ENOMEM;
}
pbl->pg_count = 0;
pbl->pg_size = sginfo->pgsize;
for (i = 0; i < pages; i++) {
pbl->pg_arr[i] = dma_alloc_coherent(&pdev->dev,
pbl->pg_size,
&pbl->pg_map_arr[i],
GFP_KERNEL);
if (!pbl->pg_arr[i])
goto fail;
pbl->pg_count++;
}
return 0;
fail:
bng_free_pbl(res, pbl);
return -ENOMEM;
}
void bng_re_free_hwq(struct bng_re_res *res,
struct bng_re_hwq *hwq)
{
int i;
if (!hwq->max_elements)
return;
if (hwq->level >= BNG_PBL_LVL_MAX)
return;
for (i = 0; i < hwq->level + 1; i++)
bng_free_pbl(res, &hwq->pbl[i]);
hwq->level = BNG_PBL_LVL_MAX;
hwq->max_elements = 0;
hwq->element_size = 0;
hwq->prod = 0;
hwq->cons = 0;
}
/* All HWQs are power of 2 in size */
int bng_re_alloc_init_hwq(struct bng_re_hwq *hwq,
struct bng_re_hwq_attr *hwq_attr)
{
u32 npages, pg_size;
struct bng_re_sg_info sginfo = {};
u32 depth, stride, npbl, npde;
dma_addr_t *src_phys_ptr, **dst_virt_ptr;
struct bng_re_res *res;
struct pci_dev *pdev;
int i, rc, lvl;
res = hwq_attr->res;
pdev = res->pdev;
pg_size = hwq_attr->sginfo->pgsize;
hwq->level = BNG_PBL_LVL_MAX;
depth = roundup_pow_of_two(hwq_attr->depth);
stride = roundup_pow_of_two(hwq_attr->stride);
npages = (depth * stride) / pg_size;
if ((depth * stride) % pg_size)
npages++;
if (!npages)
return -EINVAL;
hwq_attr->sginfo->npages = npages;
if (npages == MAX_PBL_LVL_0_PGS && !hwq_attr->sginfo->nopte) {
/* This request is Level 0, map PTE */
rc = bng_alloc_pbl(res, &hwq->pbl[BNG_PBL_LVL_0], hwq_attr->sginfo);
if (rc)
goto fail;
hwq->level = BNG_PBL_LVL_0;
goto done;
}
if (npages >= MAX_PBL_LVL_0_PGS) {
if (npages > MAX_PBL_LVL_1_PGS) {
u32 flag = PTU_PTE_VALID;
/* 2 levels of indirection */
npbl = npages >> MAX_PBL_LVL_1_PGS_SHIFT;
if (npages % BIT(MAX_PBL_LVL_1_PGS_SHIFT))
npbl++;
npde = npbl >> MAX_PDL_LVL_SHIFT;
if (npbl % BIT(MAX_PDL_LVL_SHIFT))
npde++;
/* Alloc PDE pages */
sginfo.pgsize = npde * pg_size;
sginfo.npages = 1;
rc = bng_alloc_pbl(res, &hwq->pbl[BNG_PBL_LVL_0], &sginfo);
if (rc)
goto fail;
/* Alloc PBL pages */
sginfo.npages = npbl;
sginfo.pgsize = PAGE_SIZE;
rc = bng_alloc_pbl(res, &hwq->pbl[BNG_PBL_LVL_1], &sginfo);
if (rc)
goto fail;
/* Fill PDL with PBL page pointers */
dst_virt_ptr =
(dma_addr_t **)hwq->pbl[BNG_PBL_LVL_0].pg_arr;
src_phys_ptr = hwq->pbl[BNG_PBL_LVL_1].pg_map_arr;
for (i = 0; i < hwq->pbl[BNG_PBL_LVL_1].pg_count; i++)
dst_virt_ptr[0][i] = src_phys_ptr[i] | flag;
/* Alloc or init PTEs */
rc = bng_alloc_pbl(res, &hwq->pbl[BNG_PBL_LVL_2],
hwq_attr->sginfo);
if (rc)
goto fail;
hwq->level = BNG_PBL_LVL_2;
if (hwq_attr->sginfo->nopte)
goto done;
/* Fill PBLs with PTE pointers */
dst_virt_ptr =
(dma_addr_t **)hwq->pbl[BNG_PBL_LVL_1].pg_arr;
src_phys_ptr = hwq->pbl[BNG_PBL_LVL_2].pg_map_arr;
for (i = 0; i < hwq->pbl[BNG_PBL_LVL_2].pg_count; i++) {
dst_virt_ptr[PTR_PG(i)][PTR_IDX(i)] =
src_phys_ptr[i] | PTU_PTE_VALID;
}
if (hwq_attr->type == BNG_HWQ_TYPE_QUEUE) {
/* Find the last pg of the size */
i = hwq->pbl[BNG_PBL_LVL_2].pg_count;
dst_virt_ptr[PTR_PG(i - 1)][PTR_IDX(i - 1)] |=
PTU_PTE_LAST;
if (i > 1)
dst_virt_ptr[PTR_PG(i - 2)]
[PTR_IDX(i - 2)] |=
PTU_PTE_NEXT_TO_LAST;
}
} else { /* pages < 512 npbl = 1, npde = 0 */
u32 flag = PTU_PTE_VALID;
/* 1 level of indirection */
npbl = npages >> MAX_PBL_LVL_1_PGS_SHIFT;
if (npages % BIT(MAX_PBL_LVL_1_PGS_SHIFT))
npbl++;
sginfo.npages = npbl;
sginfo.pgsize = PAGE_SIZE;
/* Alloc PBL page */
rc = bng_alloc_pbl(res, &hwq->pbl[BNG_PBL_LVL_0], &sginfo);
if (rc)
goto fail;
/* Alloc or init PTEs */
rc = bng_alloc_pbl(res, &hwq->pbl[BNG_PBL_LVL_1],
hwq_attr->sginfo);
if (rc)
goto fail;
hwq->level = BNG_PBL_LVL_1;
if (hwq_attr->sginfo->nopte)
goto done;
/* Fill PBL with PTE pointers */
dst_virt_ptr =
(dma_addr_t **)hwq->pbl[BNG_PBL_LVL_0].pg_arr;
src_phys_ptr = hwq->pbl[BNG_PBL_LVL_1].pg_map_arr;
for (i = 0; i < hwq->pbl[BNG_PBL_LVL_1].pg_count; i++)
dst_virt_ptr[PTR_PG(i)][PTR_IDX(i)] =
src_phys_ptr[i] | flag;
if (hwq_attr->type == BNG_HWQ_TYPE_QUEUE) {
/* Find the last pg of the size */
i = hwq->pbl[BNG_PBL_LVL_1].pg_count;
dst_virt_ptr[PTR_PG(i - 1)][PTR_IDX(i - 1)] |=
PTU_PTE_LAST;
if (i > 1)
dst_virt_ptr[PTR_PG(i - 2)]
[PTR_IDX(i - 2)] |=
PTU_PTE_NEXT_TO_LAST;
}
}
}
done:
hwq->prod = 0;
hwq->cons = 0;
hwq->pdev = pdev;
hwq->depth = hwq_attr->depth;
hwq->max_elements = hwq->depth;
hwq->element_size = stride;
hwq->qe_ppg = pg_size / stride;
/* For direct access to the elements */
lvl = hwq->level;
if (hwq_attr->sginfo->nopte && hwq->level)
lvl = hwq->level - 1;
hwq->pbl_ptr = hwq->pbl[lvl].pg_arr;
hwq->pbl_dma_ptr = hwq->pbl[lvl].pg_map_arr;
spin_lock_init(&hwq->lock);
return 0;
fail:
bng_re_free_hwq(res, hwq);
return -ENOMEM;
}

View File

@@ -0,0 +1,215 @@
/* SPDX-License-Identifier: GPL-2.0 */
// Copyright (c) 2025 Broadcom.
#ifndef __BNG_RES_H__
#define __BNG_RES_H__
#include "roce_hsi.h"
#define BNG_ROCE_FW_MAX_TIMEOUT 60
#define PTR_CNT_PER_PG (PAGE_SIZE / sizeof(void *))
#define PTR_MAX_IDX_PER_PG (PTR_CNT_PER_PG - 1)
#define PTR_PG(x) (((x) & ~PTR_MAX_IDX_PER_PG) / PTR_CNT_PER_PG)
#define PTR_IDX(x) ((x) & PTR_MAX_IDX_PER_PG)
#define HWQ_CMP(idx, hwq) ((idx) & ((hwq)->max_elements - 1))
#define HWQ_FREE_SLOTS(hwq) (hwq->max_elements - \
((HWQ_CMP(hwq->prod, hwq)\
- HWQ_CMP(hwq->cons, hwq))\
& (hwq->max_elements - 1)))
#define MAX_PBL_LVL_0_PGS 1
#define MAX_PBL_LVL_1_PGS 512
#define MAX_PBL_LVL_1_PGS_SHIFT 9
#define MAX_PBL_LVL_1_PGS_FOR_LVL_2 256
#define MAX_PBL_LVL_2_PGS (256 * 512)
#define MAX_PDL_LVL_SHIFT 9
#define BNG_RE_DBR_VALID (0x1UL << 26)
#define BNG_RE_DBR_EPOCH_SHIFT 24
#define BNG_RE_DBR_TOGGLE_SHIFT 25
#define BNG_MAX_TQM_ALLOC_REQ 48
struct bng_re_reg_desc {
u8 bar_id;
resource_size_t bar_base;
unsigned long offset;
void __iomem *bar_reg;
size_t len;
};
struct bng_re_db_info {
void __iomem *db;
void __iomem *priv_db;
struct bng_re_hwq *hwq;
u32 xid;
u32 max_slot;
u32 flags;
u8 toggle;
};
enum bng_re_db_info_flags_mask {
BNG_RE_FLAG_EPOCH_CONS_SHIFT = 0x0UL,
BNG_RE_FLAG_EPOCH_PROD_SHIFT = 0x1UL,
BNG_RE_FLAG_EPOCH_CONS_MASK = 0x1UL,
BNG_RE_FLAG_EPOCH_PROD_MASK = 0x2UL,
};
enum bng_re_db_epoch_flag_shift {
BNG_RE_DB_EPOCH_CONS_SHIFT = BNG_RE_DBR_EPOCH_SHIFT,
BNG_RE_DB_EPOCH_PROD_SHIFT = (BNG_RE_DBR_EPOCH_SHIFT - 1),
};
struct bng_re_chip_ctx {
u16 chip_num;
u16 hw_stats_size;
u64 hwrm_intf_ver;
u16 hwrm_cmd_max_timeout;
};
struct bng_re_pbl {
u32 pg_count;
u32 pg_size;
void **pg_arr;
dma_addr_t *pg_map_arr;
};
enum bng_re_pbl_lvl {
BNG_PBL_LVL_0,
BNG_PBL_LVL_1,
BNG_PBL_LVL_2,
BNG_PBL_LVL_MAX
};
enum bng_re_hwq_type {
BNG_HWQ_TYPE_CTX,
BNG_HWQ_TYPE_QUEUE
};
struct bng_re_sg_info {
u32 npages;
u32 pgshft;
u32 pgsize;
bool nopte;
};
struct bng_re_hwq_attr {
struct bng_re_res *res;
struct bng_re_sg_info *sginfo;
enum bng_re_hwq_type type;
u32 depth;
u32 stride;
u32 aux_stride;
u32 aux_depth;
};
struct bng_re_hwq {
struct pci_dev *pdev;
/* lock to protect hwq */
spinlock_t lock;
struct bng_re_pbl pbl[BNG_PBL_LVL_MAX + 1];
/* Valid values: 0, 1, 2 */
enum bng_re_pbl_lvl level;
/* PBL entries */
void **pbl_ptr;
/* PBL dma_addr */
dma_addr_t *pbl_dma_ptr;
u32 max_elements;
u32 depth;
u16 element_size;
u32 prod;
u32 cons;
/* queue entry per page */
u16 qe_ppg;
};
struct bng_re_stats {
dma_addr_t dma_map;
void *dma;
u32 size;
u32 fw_id;
};
struct bng_re_res {
struct pci_dev *pdev;
struct bng_re_chip_ctx *cctx;
struct bng_re_dev_attr *dattr;
};
static inline void *bng_re_get_qe(struct bng_re_hwq *hwq,
u32 indx, u64 *pg)
{
u32 pg_num, pg_idx;
pg_num = (indx / hwq->qe_ppg);
pg_idx = (indx % hwq->qe_ppg);
if (pg)
*pg = (u64)&hwq->pbl_ptr[pg_num];
return (void *)(hwq->pbl_ptr[pg_num] + hwq->element_size * pg_idx);
}
#define BNG_RE_INIT_DBHDR(xid, type, indx, toggle) \
(((u64)(((xid) & DBC_DBC_XID_MASK) | DBC_DBC_PATH_ROCE | \
(type) | BNG_RE_DBR_VALID) << 32) | (indx) | \
(((u32)(toggle)) << (BNG_RE_DBR_TOGGLE_SHIFT)))
static inline void bng_re_ring_db(struct bng_re_db_info *info,
u32 type)
{
u64 key = 0;
u32 indx;
u8 toggle = 0;
if (type == DBC_DBC_TYPE_CQ_ARMALL ||
type == DBC_DBC_TYPE_CQ_ARMSE)
toggle = info->toggle;
indx = (info->hwq->cons & DBC_DBC_INDEX_MASK) |
((info->flags & BNG_RE_FLAG_EPOCH_CONS_MASK) <<
BNG_RE_DB_EPOCH_CONS_SHIFT);
key = BNG_RE_INIT_DBHDR(info->xid, type, indx, toggle);
writeq(key, info->db);
}
static inline void bng_re_ring_nq_db(struct bng_re_db_info *info,
struct bng_re_chip_ctx *cctx,
bool arm)
{
u32 type;
type = arm ? DBC_DBC_TYPE_NQ_ARM : DBC_DBC_TYPE_NQ;
bng_re_ring_db(info, type);
}
static inline void bng_re_hwq_incr_cons(u32 max_elements, u32 *cons, u32 cnt,
u32 *dbinfo_flags)
{
/* move cons and update toggle/epoch if wrap around */
*cons += cnt;
if (*cons >= max_elements) {
*cons %= max_elements;
*dbinfo_flags ^= 1UL << BNG_RE_FLAG_EPOCH_CONS_SHIFT;
}
}
static inline bool _is_max_srq_ext_supported(u16 dev_cap_ext_flags_2)
{
return !!(dev_cap_ext_flags_2 & CREQ_QUERY_FUNC_RESP_SB_MAX_SRQ_EXTENDED);
}
void bng_re_free_hwq(struct bng_re_res *res,
struct bng_re_hwq *hwq);
int bng_re_alloc_init_hwq(struct bng_re_hwq *hwq,
struct bng_re_hwq_attr *hwq_attr);
void bng_re_free_stats_ctx_mem(struct pci_dev *pdev,
struct bng_re_stats *stats);
int bng_re_alloc_stats_ctx_mem(struct pci_dev *pdev,
struct bng_re_chip_ctx *cctx,
struct bng_re_stats *stats);
#endif

View File

@@ -0,0 +1,131 @@
// SPDX-License-Identifier: GPL-2.0
// Copyright (c) 2025 Broadcom.
#include <linux/interrupt.h>
#include <linux/pci.h>
#include "bng_res.h"
#include "bng_fw.h"
#include "bng_sp.h"
#include "bng_tlv.h"
static bool bng_re_is_atomic_cap(struct bng_re_rcfw *rcfw)
{
u16 pcie_ctl2 = 0;
pcie_capability_read_word(rcfw->pdev, PCI_EXP_DEVCTL2, &pcie_ctl2);
return (pcie_ctl2 & PCI_EXP_DEVCTL2_ATOMIC_REQ);
}
static void bng_re_query_version(struct bng_re_rcfw *rcfw,
char *fw_ver)
{
struct creq_query_version_resp resp = {};
struct bng_re_cmdqmsg msg = {};
struct cmdq_query_version req = {};
int rc;
bng_re_rcfw_cmd_prep((struct cmdq_base *)&req,
CMDQ_BASE_OPCODE_QUERY_VERSION,
sizeof(req));
bng_re_fill_cmdqmsg(&msg, &req, &resp, NULL, sizeof(req), sizeof(resp), 0);
rc = bng_re_rcfw_send_message(rcfw, &msg);
if (rc)
return;
fw_ver[0] = resp.fw_maj;
fw_ver[1] = resp.fw_minor;
fw_ver[2] = resp.fw_bld;
fw_ver[3] = resp.fw_rsvd;
}
int bng_re_get_dev_attr(struct bng_re_rcfw *rcfw)
{
struct bng_re_dev_attr *attr = rcfw->res->dattr;
struct creq_query_func_resp resp = {};
struct bng_re_cmdqmsg msg = {};
struct creq_query_func_resp_sb *sb;
struct bng_re_rcfw_sbuf sbuf;
struct cmdq_query_func req = {};
u8 *tqm_alloc;
int i, rc;
u32 temp;
bng_re_rcfw_cmd_prep((struct cmdq_base *)&req,
CMDQ_BASE_OPCODE_QUERY_FUNC,
sizeof(req));
sbuf.size = ALIGN(sizeof(*sb), BNG_FW_CMDQE_UNITS);
sbuf.sb = dma_alloc_coherent(&rcfw->pdev->dev, sbuf.size,
&sbuf.dma_addr, GFP_KERNEL);
if (!sbuf.sb)
return -ENOMEM;
sb = sbuf.sb;
req.resp_size = sbuf.size / BNG_FW_CMDQE_UNITS;
bng_re_fill_cmdqmsg(&msg, &req, &resp, &sbuf, sizeof(req),
sizeof(resp), 0);
rc = bng_re_rcfw_send_message(rcfw, &msg);
if (rc)
goto bail;
/* Extract the context from the side buffer */
attr->max_qp = le32_to_cpu(sb->max_qp);
/* max_qp value reported by FW doesn't include the QP1 */
attr->max_qp += 1;
attr->max_qp_rd_atom =
sb->max_qp_rd_atom > BNG_RE_MAX_OUT_RD_ATOM ?
BNG_RE_MAX_OUT_RD_ATOM : sb->max_qp_rd_atom;
attr->max_qp_init_rd_atom =
sb->max_qp_init_rd_atom > BNG_RE_MAX_OUT_RD_ATOM ?
BNG_RE_MAX_OUT_RD_ATOM : sb->max_qp_init_rd_atom;
attr->max_qp_wqes = le16_to_cpu(sb->max_qp_wr) - 1;
/* Adjust for max_qp_wqes for variable wqe */
attr->max_qp_wqes = min_t(u32, attr->max_qp_wqes, BNG_VAR_MAX_WQE - 1);
attr->max_qp_sges = min_t(u32, sb->max_sge_var_wqe, BNG_VAR_MAX_SGE);
attr->max_cq = le32_to_cpu(sb->max_cq);
attr->max_cq_wqes = le32_to_cpu(sb->max_cqe);
attr->max_cq_sges = attr->max_qp_sges;
attr->max_mr = le32_to_cpu(sb->max_mr);
attr->max_mw = le32_to_cpu(sb->max_mw);
attr->max_mr_size = le64_to_cpu(sb->max_mr_size);
attr->max_pd = 64 * 1024;
attr->max_raw_ethy_qp = le32_to_cpu(sb->max_raw_eth_qp);
attr->max_ah = le32_to_cpu(sb->max_ah);
attr->max_srq = le16_to_cpu(sb->max_srq);
attr->max_srq_wqes = le32_to_cpu(sb->max_srq_wr) - 1;
attr->max_srq_sges = sb->max_srq_sge;
attr->max_pkey = 1;
attr->max_inline_data = le32_to_cpu(sb->max_inline_data);
/*
* Read the max gid supported by HW.
* For each entry in HW GID in HW table, we consume 2
* GID entries in the kernel GID table. So max_gid reported
* to stack can be up to twice the value reported by the HW, up to 256 gids.
*/
attr->max_sgid = le32_to_cpu(sb->max_gid);
attr->max_sgid = min_t(u32, BNG_RE_NUM_GIDS_SUPPORTED, 2 * attr->max_sgid);
attr->dev_cap_flags = le16_to_cpu(sb->dev_cap_flags);
attr->dev_cap_flags2 = le16_to_cpu(sb->dev_cap_ext_flags_2);
if (_is_max_srq_ext_supported(attr->dev_cap_flags2))
attr->max_srq += le16_to_cpu(sb->max_srq_ext);
bng_re_query_version(rcfw, attr->fw_ver);
for (i = 0; i < BNG_MAX_TQM_ALLOC_REQ / 4; i++) {
temp = le32_to_cpu(sb->tqm_alloc_reqs[i]);
tqm_alloc = (u8 *)&temp;
attr->tqm_alloc_reqs[i * 4] = *tqm_alloc;
attr->tqm_alloc_reqs[i * 4 + 1] = *(++tqm_alloc);
attr->tqm_alloc_reqs[i * 4 + 2] = *(++tqm_alloc);
attr->tqm_alloc_reqs[i * 4 + 3] = *(++tqm_alloc);
}
attr->max_dpi = le32_to_cpu(sb->max_dpi);
attr->is_atomic = bng_re_is_atomic_cap(rcfw);
bail:
dma_free_coherent(&rcfw->pdev->dev, sbuf.size,
sbuf.sb, sbuf.dma_addr);
return rc;
}

View File

@@ -0,0 +1,47 @@
/* SPDX-License-Identifier: GPL-2.0 */
// Copyright (c) 2025 Broadcom.
#ifndef __BNG_SP_H__
#define __BNG_SP_H__
#include "bng_fw.h"
#define BNG_VAR_MAX_WQE 4352
#define BNG_VAR_MAX_SGE 13
struct bng_re_dev_attr {
#define FW_VER_ARR_LEN 4
u8 fw_ver[FW_VER_ARR_LEN];
#define BNG_RE_NUM_GIDS_SUPPORTED 256
u16 max_sgid;
u16 max_mrw;
u32 max_qp;
#define BNG_RE_MAX_OUT_RD_ATOM 126
u32 max_qp_rd_atom;
u32 max_qp_init_rd_atom;
u32 max_qp_wqes;
u32 max_qp_sges;
u32 max_cq;
u32 max_cq_wqes;
u32 max_cq_sges;
u32 max_mr;
u64 max_mr_size;
u32 max_pd;
u32 max_mw;
u32 max_raw_ethy_qp;
u32 max_ah;
u32 max_srq;
u32 max_srq_wqes;
u32 max_srq_sges;
u32 max_pkey;
u32 max_inline_data;
u32 l2_db_size;
u8 tqm_alloc_reqs[BNG_MAX_TQM_ALLOC_REQ];
bool is_atomic;
u16 dev_cap_flags;
u16 dev_cap_flags2;
u32 max_dpi;
};
int bng_re_get_dev_attr(struct bng_re_rcfw *rcfw);
#endif

View File

@@ -0,0 +1,128 @@
/* SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause */
#ifndef __BNG_TLV_H__
#define __BNG_TLV_H__
#include "roce_hsi.h"
struct roce_tlv {
struct tlv tlv;
u8 total_size; // in units of 16 byte chunks
u8 unused[7]; // for 16 byte alignment
};
/*
* TLV size in units of 16 byte chunks
*/
#define TLV_SIZE ((sizeof(struct roce_tlv) + 15) / 16)
/*
* TLV length in bytes
*/
#define TLV_BYTES (TLV_SIZE * 16)
#define HAS_TLV_HEADER(msg) (le16_to_cpu(((struct tlv *)(msg))->cmd_discr) == CMD_DISCR_TLV_ENCAP)
#define GET_TLV_DATA(tlv) ((void *)&((uint8_t *)(tlv))[TLV_BYTES])
static inline u8 __get_cmdq_base_opcode(struct cmdq_base *req, u32 size)
{
if (HAS_TLV_HEADER(req) && size > TLV_BYTES)
return ((struct cmdq_base *)GET_TLV_DATA(req))->opcode;
else
return req->opcode;
}
static inline void __set_cmdq_base_opcode(struct cmdq_base *req,
u32 size, u8 val)
{
if (HAS_TLV_HEADER(req) && size > TLV_BYTES)
((struct cmdq_base *)GET_TLV_DATA(req))->opcode = val;
else
req->opcode = val;
}
static inline __le16 __get_cmdq_base_cookie(struct cmdq_base *req, u32 size)
{
if (HAS_TLV_HEADER(req) && size > TLV_BYTES)
return ((struct cmdq_base *)GET_TLV_DATA(req))->cookie;
else
return req->cookie;
}
static inline void __set_cmdq_base_cookie(struct cmdq_base *req,
u32 size, __le16 val)
{
if (HAS_TLV_HEADER(req) && size > TLV_BYTES)
((struct cmdq_base *)GET_TLV_DATA(req))->cookie = val;
else
req->cookie = val;
}
static inline __le64 __get_cmdq_base_resp_addr(struct cmdq_base *req, u32 size)
{
if (HAS_TLV_HEADER(req) && size > TLV_BYTES)
return ((struct cmdq_base *)GET_TLV_DATA(req))->resp_addr;
else
return req->resp_addr;
}
static inline void __set_cmdq_base_resp_addr(struct cmdq_base *req,
u32 size, __le64 val)
{
if (HAS_TLV_HEADER(req) && size > TLV_BYTES)
((struct cmdq_base *)GET_TLV_DATA(req))->resp_addr = val;
else
req->resp_addr = val;
}
static inline u8 __get_cmdq_base_resp_size(struct cmdq_base *req, u32 size)
{
if (HAS_TLV_HEADER(req) && size > TLV_BYTES)
return ((struct cmdq_base *)GET_TLV_DATA(req))->resp_size;
else
return req->resp_size;
}
static inline void __set_cmdq_base_resp_size(struct cmdq_base *req,
u32 size, u8 val)
{
if (HAS_TLV_HEADER(req) && size > TLV_BYTES)
((struct cmdq_base *)GET_TLV_DATA(req))->resp_size = val;
else
req->resp_size = val;
}
static inline u8 __get_cmdq_base_cmd_size(struct cmdq_base *req, u32 size)
{
if (HAS_TLV_HEADER(req) && size > TLV_BYTES)
return ((struct roce_tlv *)(req))->total_size;
else
return req->cmd_size;
}
static inline void __set_cmdq_base_cmd_size(struct cmdq_base *req,
u32 size, u8 val)
{
if (HAS_TLV_HEADER(req) && size > TLV_BYTES)
((struct cmdq_base *)GET_TLV_DATA(req))->cmd_size = val;
else
req->cmd_size = val;
}
static inline __le16 __get_cmdq_base_flags(struct cmdq_base *req, u32 size)
{
if (HAS_TLV_HEADER(req) && size > TLV_BYTES)
return ((struct cmdq_base *)GET_TLV_DATA(req))->flags;
else
return req->flags;
}
static inline void __set_cmdq_base_flags(struct cmdq_base *req,
u32 size, __le16 val)
{
if (HAS_TLV_HEADER(req) && size > TLV_BYTES)
((struct cmdq_base *)GET_TLV_DATA(req))->flags = val;
else
req->flags = val;
}
#endif /* __BNG_TLV_H__ */

View File

@@ -224,6 +224,8 @@ struct bnxt_re_dev {
struct workqueue_struct *dcb_wq;
struct dentry *cc_config;
struct bnxt_re_dbg_cc_config_params *cc_config_params;
struct dentry *cq_coal_cfg;
struct bnxt_re_dbg_cq_coal_params *cq_coal_cfg_params;
#define BNXT_VPD_FLD_LEN 32
char board_partno[BNXT_VPD_FLD_LEN];
/* RoCE mirror */

View File

@@ -23,6 +23,14 @@
static struct dentry *bnxt_re_debugfs_root;
static const char * const bnxt_re_cq_coal_str[] = {
"buf_maxtime",
"normal_maxbuf",
"during_maxbuf",
"en_ring_idle_mode",
"enable",
};
static const char * const bnxt_re_cc_gen0_name[] = {
"enable_cc",
"run_avg_weight_g",
@@ -349,6 +357,123 @@ static void bnxt_re_debugfs_add_info(struct bnxt_re_dev *rdev)
debugfs_create_file("info", 0400, rdev->dbg_root, rdev, &info_fops);
}
static ssize_t cq_coal_cfg_write(struct file *file,
const char __user *buf,
size_t count, loff_t *pos)
{
struct seq_file *s = file->private_data;
struct bnxt_re_cq_coal_param *param = s->private;
struct bnxt_re_dev *rdev = param->rdev;
int offset = param->offset;
char lbuf[16] = { };
u32 val;
if (count > sizeof(lbuf))
return -EINVAL;
if (copy_from_user(lbuf, buf, count))
return -EFAULT;
lbuf[sizeof(lbuf) - 1] = '\0';
if (kstrtou32(lbuf, 0, &val))
return -EINVAL;
switch (offset) {
case BNXT_RE_COAL_CQ_BUF_MAXTIME:
if (val < 1 || val > BNXT_QPLIB_CQ_COAL_MAX_BUF_MAXTIME)
return -EINVAL;
rdev->cq_coalescing.buf_maxtime = val;
break;
case BNXT_RE_COAL_CQ_NORMAL_MAXBUF:
if (val < 1 || val > BNXT_QPLIB_CQ_COAL_MAX_NORMAL_MAXBUF)
return -EINVAL;
rdev->cq_coalescing.normal_maxbuf = val;
break;
case BNXT_RE_COAL_CQ_DURING_MAXBUF:
if (val < 1 || val > BNXT_QPLIB_CQ_COAL_MAX_DURING_MAXBUF)
return -EINVAL;
rdev->cq_coalescing.during_maxbuf = val;
break;
case BNXT_RE_COAL_CQ_EN_RING_IDLE_MODE:
if (val > BNXT_QPLIB_CQ_COAL_MAX_EN_RING_IDLE_MODE)
return -EINVAL;
rdev->cq_coalescing.en_ring_idle_mode = val;
break;
case BNXT_RE_COAL_CQ_ENABLE:
if (val > 1)
return -EINVAL;
rdev->cq_coalescing.enable = val;
break;
default:
return -EINVAL;
}
return count;
}
static int cq_coal_cfg_show(struct seq_file *s, void *unused)
{
struct bnxt_re_cq_coal_param *param = s->private;
struct bnxt_re_dev *rdev = param->rdev;
int offset = param->offset;
u32 val = 0;
switch (offset) {
case BNXT_RE_COAL_CQ_BUF_MAXTIME:
val = rdev->cq_coalescing.buf_maxtime;
break;
case BNXT_RE_COAL_CQ_NORMAL_MAXBUF:
val = rdev->cq_coalescing.normal_maxbuf;
break;
case BNXT_RE_COAL_CQ_DURING_MAXBUF:
val = rdev->cq_coalescing.during_maxbuf;
break;
case BNXT_RE_COAL_CQ_EN_RING_IDLE_MODE:
val = rdev->cq_coalescing.en_ring_idle_mode;
break;
case BNXT_RE_COAL_CQ_ENABLE:
val = rdev->cq_coalescing.enable;
break;
default:
return -EINVAL;
}
seq_printf(s, "%u\n", val);
return 0;
}
DEFINE_SHOW_STORE_ATTRIBUTE(cq_coal_cfg);
static void bnxt_re_cleanup_cq_coal_debugfs(struct bnxt_re_dev *rdev)
{
debugfs_remove_recursive(rdev->cq_coal_cfg);
kfree(rdev->cq_coal_cfg_params);
}
static void bnxt_re_init_cq_coal_debugfs(struct bnxt_re_dev *rdev)
{
struct bnxt_re_dbg_cq_coal_params *dbg_cq_coal_params;
int i;
if (!_is_cq_coalescing_supported(rdev->dev_attr->dev_cap_flags2))
return;
dbg_cq_coal_params = kzalloc(sizeof(*dbg_cq_coal_params), GFP_KERNEL);
if (!dbg_cq_coal_params)
return;
rdev->cq_coal_cfg = debugfs_create_dir("cq_coal_cfg", rdev->dbg_root);
rdev->cq_coal_cfg_params = dbg_cq_coal_params;
for (i = 0; i < BNXT_RE_COAL_CQ_MAX; i++) {
dbg_cq_coal_params->params[i].offset = i;
dbg_cq_coal_params->params[i].rdev = rdev;
debugfs_create_file(bnxt_re_cq_coal_str[i],
0600, rdev->cq_coal_cfg,
&dbg_cq_coal_params->params[i],
&cq_coal_cfg_fops);
}
}
void bnxt_re_debugfs_add_pdev(struct bnxt_re_dev *rdev)
{
struct pci_dev *pdev = rdev->en_dev->pdev;
@@ -374,10 +499,13 @@ void bnxt_re_debugfs_add_pdev(struct bnxt_re_dev *rdev)
rdev->cc_config, tmp_params,
&bnxt_re_cc_config_ops);
}
bnxt_re_init_cq_coal_debugfs(rdev);
}
void bnxt_re_debugfs_rem_pdev(struct bnxt_re_dev *rdev)
{
bnxt_re_cleanup_cq_coal_debugfs(rdev);
debugfs_remove_recursive(rdev->qp_debugfs);
debugfs_remove_recursive(rdev->cc_config);
kfree(rdev->cc_config_params);

View File

@@ -33,4 +33,23 @@ struct bnxt_re_cc_param {
struct bnxt_re_dbg_cc_config_params {
struct bnxt_re_cc_param gen0_parms[BNXT_RE_CC_PARAM_GEN0];
};
struct bnxt_re_cq_coal_param {
struct bnxt_re_dev *rdev;
u32 offset;
};
enum bnxt_re_cq_coal_types {
BNXT_RE_COAL_CQ_BUF_MAXTIME,
BNXT_RE_COAL_CQ_NORMAL_MAXBUF,
BNXT_RE_COAL_CQ_DURING_MAXBUF,
BNXT_RE_COAL_CQ_EN_RING_IDLE_MODE,
BNXT_RE_COAL_CQ_ENABLE,
BNXT_RE_COAL_CQ_MAX
};
struct bnxt_re_dbg_cq_coal_params {
struct bnxt_re_cq_coal_param params[BNXT_RE_COAL_CQ_MAX];
};
#endif

View File

@@ -601,7 +601,8 @@ static int bnxt_re_create_fence_mr(struct bnxt_re_pd *pd)
mr->qplib_mr.va = (u64)(unsigned long)fence->va;
mr->qplib_mr.total_size = BNXT_RE_FENCE_BYTES;
rc = bnxt_qplib_reg_mr(&rdev->qplib_res, &mr->qplib_mr, NULL,
BNXT_RE_FENCE_PBL_SIZE, PAGE_SIZE);
BNXT_RE_FENCE_PBL_SIZE, PAGE_SIZE,
_is_alloc_mr_unified(rdev->dev_attr->dev_cap_flags));
if (rc) {
ibdev_err(&rdev->ibdev, "Failed to register fence-MR\n");
goto fail;
@@ -4027,7 +4028,7 @@ struct ib_mr *bnxt_re_get_dma_mr(struct ib_pd *ib_pd, int mr_access_flags)
mr->qplib_mr.hwq.level = PBL_LVL_MAX;
mr->qplib_mr.total_size = -1; /* Infinte length */
rc = bnxt_qplib_reg_mr(&rdev->qplib_res, &mr->qplib_mr, NULL, 0,
PAGE_SIZE);
PAGE_SIZE, false);
if (rc)
goto fail_mr;
@@ -4257,7 +4258,8 @@ static struct ib_mr *__bnxt_re_user_reg_mr(struct ib_pd *ib_pd, u64 length, u64
umem_pgs = ib_umem_num_dma_blocks(umem, page_size);
rc = bnxt_qplib_reg_mr(&rdev->qplib_res, &mr->qplib_mr, umem,
umem_pgs, page_size);
umem_pgs, page_size,
_is_alloc_mr_unified(rdev->dev_attr->dev_cap_flags));
if (rc) {
ibdev_err(&rdev->ibdev, "Failed to register user MR - rc = %d\n", rc);
rc = -EIO;

View File

@@ -1453,6 +1453,7 @@ static struct bnxt_re_dev *bnxt_re_dev_add(struct auxiliary_device *adev,
atomic_set(&rdev->stats.res.pd_count, 0);
rdev->cosq[0] = 0xFFFF;
rdev->cosq[1] = 0xFFFF;
rdev->cq_coalescing.enable = 1;
rdev->cq_coalescing.buf_maxtime = BNXT_QPLIB_CQ_COAL_DEF_BUF_MAXTIME;
if (bnxt_re_chip_gen_p7(en_dev->chip_num)) {
rdev->cq_coalescing.normal_maxbuf = BNXT_QPLIB_CQ_COAL_DEF_NORMAL_MAXBUF_P7;

View File

@@ -2226,7 +2226,8 @@ int bnxt_qplib_create_cq(struct bnxt_qplib_res *res, struct bnxt_qplib_cq *cq)
req.cq_handle = cpu_to_le64(cq->cq_handle);
req.cq_size = cpu_to_le32(cq->max_wqe);
if (_is_cq_coalescing_supported(res->dattr->dev_cap_flags2)) {
if (_is_cq_coalescing_supported(res->dattr->dev_cap_flags2) &&
cq->coalescing->enable) {
req.flags |= cpu_to_le16(CMDQ_CREATE_CQ_FLAGS_COALESCING_VALID);
coalescing |= ((cq->coalescing->buf_maxtime <<
CMDQ_CREATE_CQ_BUF_MAXTIME_SFT) &

Some files were not shown because too many files have changed in this diff Show More