Commit f2060bdc21 ("nfs/localio: add refcounting for each iocb IO
associated with NFS pgio header") inadvertantly reintroduced the same
potential for __put_cred() triggering BUG_ON(cred == current->cred) that
commit 992203a1fb ("nfs/localio: restore creds before releasing pageio
data") fixed.
Fix this by saving and restoring the cred around each {read,write}_iter
call within the respective for loop of nfs_local_call_{read,write} using
scoped_with_creds().
NOTE: this fix started by first reverting the following commits:
94afb627df ("nfs: use credential guards in nfs_local_call_read()")
bff3c841f7 ("nfs: use credential guards in nfs_local_call_write()")
1d18101a64 ("Merge tag 'kernel-6.19-rc1.cred' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs")
followed by narrowly fixing the cred lifetime issue by using
scoped_with_creds(). In doing so, this commit's changes appear more
extensive than they really are (as evidenced by comparing to v6.18's
fs/nfs/localio.c).
Reported-by: Zorro Lang <zlang@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Acked-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/linux-next/20251205111942.4150b06f@canb.auug.org.au/
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull more SoC driver updates from Arnd Bergmann:
"These updates came a little late, or were based on a later 6.18-rc tag
than the others:
- A new driver for cache management on cxl devices with memory shared
in a coherent cluster. This is part of the drivers/cache/ tree, but
unlike the other drivers that back the dma-mapping interfaces, this
one is needed only during CPU hotplug.
- A shared branch for reset controllers using swnode infrastructure
- Added support for new SoC variants in the Amlogic soc_device
identification
- Minor updates in Freescale, Microchip, Samsung, and Apple SoC
drivers"
* tag 'soc-drivers-6.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (24 commits)
soc: samsung: exynos-pmu: fix device leak on regmap lookup
soc: samsung: exynos-pmu: Fix structure initialization
soc: fsl: qbman: use kmalloc_array() instead of kmalloc()
soc: fsl: qbman: add WQ_PERCPU to alloc_workqueue users
MAINTAINERS: Update email address for Christophe Leroy
MAINTAINERS: refer to intended file in STANDALONE CACHE CONTROLLER DRIVERS
cache: Support cache maintenance for HiSilicon SoC Hydra Home Agent
cache: Make top level Kconfig menu a boolean dependent on RISCV
MAINTAINERS: Add Jonathan Cameron to drivers/cache and add lib/cache_maint.c + header
arm64: Select GENERIC_CPU_CACHE_MAINTENANCE
lib: Support ARCH_HAS_CPU_CACHE_INVALIDATE_MEMREGION
soc: amlogic: meson-gx-socinfo: add new SoCs id
dt-bindings: arm: amlogic: meson-gx-ao-secure: support more SoCs
memregion: Support fine grained invalidate by cpu_cache_invalidate_memregion()
memregion: Drop unused IORES_DESC_* parameter from cpu_cache_invalidate_memregion()
dt-bindings: cache: sifive,ccache0: add a pic64gx compatible
MAINTAINERS: rename Microchip RISC-V entry
MAINTAINERS: add new soc drivers to Microchip RISC-V entry
soc: microchip: add mfd drivers for two syscon regions on PolarFire SoC
dt-bindings: soc: microchip: document the simple-mfd syscon on PolarFire SoC
...
Pull SoC driver updates from Arnd Bergmann:
"This is the first half of the driver changes:
- A treewide interface change to the "syscore" operations for power
management, as a preparation for future Tegra specific changes
- Reset controller updates with added drivers for LAN969x, eic770 and
RZ/G3S SoCs
- Protection of system controller registers on Renesas and Google
SoCs, to prevent trivially triggering a system crash from e.g.
debugfs access
- soc_device identification updates on Nvidia, Exynos and Mediatek
- debugfs support in the ST STM32 firewall driver
- Minor updates for SoC drivers on AMD/Xilinx, Renesas, Allwinner, TI
- Cleanups for memory controller support on Nvidia and Renesas"
* tag 'soc-drivers-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (114 commits)
memory: tegra186-emc: Fix missing put_bpmp
Documentation: reset: Remove reset_controller_add_lookup()
reset: fix BIT macro reference
reset: rzg2l-usbphy-ctrl: Fix a NULL vs IS_ERR() bug in probe
reset: th1520: Support reset controllers in more subsystems
reset: th1520: Prepare for supporting multiple controllers
dt-bindings: reset: thead,th1520-reset: Add controllers for more subsys
dt-bindings: reset: thead,th1520-reset: Remove non-VO-subsystem resets
reset: remove legacy reset lookup code
clk: davinci: psc: drop unused reset lookup
reset: rzg2l-usbphy-ctrl: Add support for RZ/G3S SoC
reset: rzg2l-usbphy-ctrl: Add support for USB PWRRDY
dt-bindings: reset: renesas,rzg2l-usbphy-ctrl: Document RZ/G3S support
reset: eswin: Add eic7700 reset driver
dt-bindings: reset: eswin: Documentation for eic7700 SoC
reset: sparx5: add LAN969x support
dt-bindings: reset: microchip: Add LAN969x support
soc: rockchip: grf: Add select correct PWM implementation on RK3368
soc/tegra: pmc: Add USB wake events for Tegra234
amba: tegra-ahb: Fix device leak on SMMU enable
...
Pull new SoC families update from Arnd Bergmann:
"These three new families of SoC are split out into a separate branch
because they touch multiple parts of the source tree and are better
left separate for the initial merge.
- Black Sesame Technologies C1200 is an automotive SoC using
Cortex-A78 CPU cores
- Anlogic dr1v90 (not to be confused with Amlogic) is an FPGA
platform using a single nuclei ux900 RISC-V core
- Tenstorrent Blackhole is a Neural Processing Unit using custom
"Tensix" cores for computation offload managed by Linux running on
SiFive X280 RISC-V cores.
Support for all three is rather rudimentary at the moment and will get
improved as device drivers are merged through other tree"
* tag 'soc-newsoc-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (24 commits)
MAINTAINERS: add Black Sesame Technologies (BST) ARM SoC support
arm64: defconfig: enable BST platform support
arm64: dts: bst: add support for Black Sesame Technologies C1200 CDCU1.0 board
arm64: Kconfig: add ARCH_BST for Black Sesame Technologies SoCs
dt-bindings: arm: add Black Sesame Technologies (bst) SoC
dt-bindings: vendor-prefixes: Add Black Sesame Technologies Co., Ltd.
MAINTAINERS: Setup support for Anlogic tree
riscv: defconfig: Enable Anlogic SoC
riscv: dts: anlogic: Add Milianke MLKPAI FS01 board
riscv: dts: Add initial Anlogic DR1V90 SoC device tree
riscv: Add Anlogic SoC famly Kconfig support
dt-bindings: serial: snps-dw-apb-uart: Add Anlogic DR1V90 uart
dt-bindings: timer: Add Anlogic DR1V90 ACLINT MTIMER
dt-bindings: riscv: Add Anlogic DR1V90
dt-bindings: riscv: Add Nuclei UX900 compatibles
dt-bindings: vendor-prefixes: Add Anlogic, Milianke and Nuclei
riscv: defconfig: Enable Tenstorrent SoCs
riscv: Kconfig.socs: Add ARCH_TENSTORRENT for Tenstorrent SoCs
riscv: dts: Add Tenstorrent Blackhole SoC PCIe cards
dt-bindings: interrupt-controller: Add Tenstorrent Blackhole compatible
...
Pull SoC devicetree updates from Arnd Bergmann:
"Three new SoCs got added in existing arm64 chip families:
- Renesas R-Car X5H (R8A78000) is a new generation of automotive
SoCs, based on 16 Cortex-A720 (Armv9.2) cores, which makes the the
currently highest-perforance embedded SoC.
- TI AM62L is a new variant of the AM62 family of industrial SoCs,
this one comes without a GPU.
- Qualcomm MSM8937 (Snapdragon 430) is an older mobile phone chip
based on Cortex-A53, and closely related to MSM8917 (Snapdragn
425), which we already support.
In addition, there are a good number of newly supported machines
across SoC families:
- Two Aspeed AST2600 (Cortex-A7) based BMC setups for large servers
- Mobile Phones and tables based on Mediatek MT6582, Nvidia Tegra124,
Qualcomm MSM8937 and Qualcomm MSM8939,
- Two Laptops based on Qualcomm SoCs: one using the older sdm850, the
other using x1p42100.
- One Router based on Rockchips RK3568
- 24 variants of the Enclustra Mercury system-on-module, all based on
32-bit Intel/Altera SocFPGA chips, plus two boards using 64-bit
SocFPGA Agilex chips..
- 30 industrial/embedded boards and single-board computers, using
various chips from NXP, Rockchips, Mediatek, TI, Amlogic, Qualcomm,
Spacemit, and Starfive.
In total there are 783 commits here, the majority of these improving
hardware support and cleaning up devicetree files across the tree,
with the majority of the changes going into the Qualcomm, NXP, Renesas
and Rockchips platforms"
* tag 'soc-dt-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (782 commits)
arm64: dts: mediatek: mt8195: Fix address range for JPEG decoder core 1
ARM: dts: samsung: exynos4412-midas: turn off SDIO WLAN chip during system suspend
ARM: dts: samsung: exynos4210-trats: turn off SDIO WLAN chip during system suspend
ARM: dts: samsung: exynos4210-i9100: turn off SDIO WLAN chip during system suspend
ARM: dts: samsung: universal_c210: turn off SDIO WLAN chip during system suspend
arm64: dts: amlogic: meson-g12b: Fix L2 cache reference for S922X CPUs
arm64: dts: Add gpio_intc node for Amlogic S7D SoCs
arm64: dts: Add gpio_intc node for Amlogic S7 SoCs
arm64: dts: Add gpio_intc node for Amlogic S6 SoCs
arm64: dts: amlogic: s7d: add ao secure node
arm64: dts: amlogic: s7: add ao secure node
arm64: dts: amlogic: s6: add ao secure node
arm64: dts: amlogic: Fix the register name of the 'DBI' region
dts: arm64: amlogic: add a5 pinctrl node
arm64: dts: amlogic: s7d: add power domain controller node
arm64: dts: amlogic: s7: add power domain controller node
arm64: dts: amlogic: s6: add power domain controller node
dts: arm64: amlogic: Add ISP related nodes for C3
arm64: dts: meson: add initial device-tree for Tanix TX9 Pro
dt-bindings: arm: amlogic: add support for Tanix TX9 Pro
...
Pull SoC ARM code updates from Arnd Bergmann:
"These are very minimal changes for 32-bit Arm platform code, enabling
SMP bringup for one more SoC variant (mt6582) among spelling changes
and a build warning fix"
* tag 'soc-arm-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc:
ARM: omap1: avoid symbol clashes in fiq handler
ARM: gemini: fix typos in comments
ARM: versatile: Fix typo in versatile.c
ARM: OMAP2+: Fix falg->flag typo in omap_smc2()
ARM: mediatek: add MT6582 smp bring up code
ARM: mediatek: add board_dt_compat entry for the MT6582 SoC
Pull KVM updates from Paolo Bonzini:
"ARM:
- Support for userspace handling of synchronous external aborts
(SEAs), allowing the VMM to potentially handle the abort in a
non-fatal manner
- Large rework of the VGIC's list register handling with the goal of
supporting more active/pending IRQs than available list registers
in hardware. In addition, the VGIC now supports EOImode==1 style
deactivations for IRQs which may occur on a separate vCPU than the
one that acked the IRQ
- Support for FEAT_XNX (user / privileged execute permissions) and
FEAT_HAF (hardware update to the Access Flag) in the software page
table walkers and shadow MMU
- Allow page table destruction to reschedule, fixing long
need_resched latencies observed when destroying a large VM
- Minor fixes to KVM and selftests
Loongarch:
- Get VM PMU capability from HW GCFG register
- Add AVEC basic support
- Use 64-bit register definition for EIOINTC
- Add KVM timer test cases for tools/selftests
RISC/V:
- SBI message passing (MPXY) support for KVM guest
- Give a new, more specific error subcode for the case when in-kernel
AIA virtualization fails to allocate IMSIC VS-file
- Support KVM_DIRTY_LOG_INITIALLY_SET, enabling dirty log gradually
in small chunks
- Fix guest page fault within HLV* instructions
- Flush VS-stage TLB after VCPU migration for Andes cores
s390:
- Always allocate ESCA (Extended System Control Area), instead of
starting with the basic SCA and converting to ESCA with the
addition of the 65th vCPU. The price is increased number of exits
(and worse performance) on z10 and earlier processor; ESCA was
introduced by z114/z196 in 2010
- VIRT_XFER_TO_GUEST_WORK support
- Operation exception forwarding support
- Cleanups
x86:
- Skip the costly "zap all SPTEs" on an MMIO generation wrap if MMIO
SPTE caching is disabled, as there can't be any relevant SPTEs to
zap
- Relocate a misplaced export
- Fix an async #PF bug where KVM would clear the completion queue
when the guest transitioned in and out of paging mode, e.g. when
handling an SMI and then returning to paged mode via RSM
- Leave KVM's user-return notifier registered even when disabling
virtualization, as long as kvm.ko is loaded. On reboot/shutdown,
keeping the notifier registered is ok; the kernel does not use the
MSRs and the callback will run cleanly and restore host MSRs if the
CPU manages to return to userspace before the system goes down
- Use the checked version of {get,put}_user()
- Fix a long-lurking bug where KVM's lack of catch-up logic for
periodic APIC timers can result in a hard lockup in the host
- Revert the periodic kvmclock sync logic now that KVM doesn't use a
clocksource that's subject to NTP corrections
- Clean up KVM's handling of MMIO Stale Data and L1TF, and bury the
latter behind CONFIG_CPU_MITIGATIONS
- Context switch XCR0, XSS, and PKRU outside of the entry/exit fast
path; the only reason they were handled in the fast path was to
paper of a bug in the core #MC code, and that has long since been
fixed
- Add emulator support for AVX MOV instructions, to play nice with
emulated devices whose guest drivers like to access PCI BARs with
large multi-byte instructions
x86 (AMD):
- Fix a few missing "VMCB dirty" bugs
- Fix the worst of KVM's lack of EFER.LMSLE emulation
- Add AVIC support for addressing 4k vCPUs in x2AVIC mode
- Fix incorrect handling of selective CR0 writes when checking
intercepts during emulation of L2 instructions
- Fix a currently-benign bug where KVM would clobber SPEC_CTRL[63:32]
on VMRUN and #VMEXIT
- Fix a bug where KVM corrupt the guest code stream when re-injecting
a soft interrupt if the guest patched the underlying code after the
VM-Exit, e.g. when Linux patches code with a temporary INT3
- Add KVM_X86_SNP_POLICY_BITS to advertise supported SNP policy bits
to userspace, and extend KVM "support" to all policy bits that
don't require any actual support from KVM
x86 (Intel):
- Use the root role from kvm_mmu_page to construct EPTPs instead of
the current vCPU state, partly as worthwhile cleanup, but mostly to
pave the way for tracking per-root TLB flushes, and elide EPT
flushes on pCPU migration if the root is clean from a previous
flush
- Add a few missing nested consistency checks
- Rip out support for doing "early" consistency checks via hardware
as the functionality hasn't been used in years and is no longer
useful in general; replace it with an off-by-default module param
to WARN if hardware fails a check that KVM does not perform
- Fix a currently-benign bug where KVM would drop the guest's
SPEC_CTRL[63:32] on VM-Enter
- Misc cleanups
- Overhaul the TDX code to address systemic races where KVM (acting
on behalf of userspace) could inadvertantly trigger lock contention
in the TDX-Module; KVM was either working around these in weird,
ugly ways, or was simply oblivious to them (though even Yan's
devilish selftests could only break individual VMs, not the host
kernel)
- Fix a bug where KVM could corrupt a vCPU's cpu_list when freeing a
TDX vCPU, if creating said vCPU failed partway through
- Fix a few sparse warnings (bad annotation, 0 != NULL)
- Use struct_size() to simplify copying TDX capabilities to userspace
- Fix a bug where TDX would effectively corrupt user-return MSR
values if the TDX Module rejects VP.ENTER and thus doesn't clobber
host MSRs as expected
Selftests:
- Fix a math goof in mmu_stress_test when running on a single-CPU
system/VM
- Forcefully override ARCH from x86_64 to x86 to play nice with
specifying ARCH=x86_64 on the command line
- Extend a bunch of nested VMX to validate nested SVM as well
- Add support for LA57 in the core VM_MODE_xxx macro, and add a test
to verify KVM can save/restore nested VMX state when L1 is using
5-level paging, but L2 is not
- Clean up the guest paging code in anticipation of sharing the core
logic for nested EPT and nested NPT
guest_memfd:
- Add NUMA mempolicy support for guest_memfd, and clean up a variety
of rough edges in guest_memfd along the way
- Define a CLASS to automatically handle get+put when grabbing a
guest_memfd from a memslot to make it harder to leak references
- Enhance KVM selftests to make it easer to develop and debug
selftests like those added for guest_memfd NUMA support, e.g. where
test and/or KVM bugs often result in hard-to-debug SIGBUS errors
- Misc cleanups
Generic:
- Use the recently-added WQ_PERCPU when creating the per-CPU
workqueue for irqfd cleanup
- Fix a goof in the dirty ring documentation
- Fix choice of target for directed yield across different calls to
kvm_vcpu_on_spin(); the function was always starting from the first
vCPU instead of continuing the round-robin search"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (260 commits)
KVM: arm64: at: Update AF on software walk only if VM has FEAT_HAFDBS
KVM: arm64: at: Use correct HA bit in TCR_EL2 when regime is EL2
KVM: arm64: Document KVM_PGTABLE_PROT_{UX,PX}
KVM: arm64: Fix spelling mistake "Unexpeced" -> "Unexpected"
KVM: arm64: Add break to default case in kvm_pgtable_stage2_pte_prot()
KVM: arm64: Add endian casting to kvm_swap_s[12]_desc()
KVM: arm64: Fix compilation when CONFIG_ARM64_USE_LSE_ATOMICS=n
KVM: arm64: selftests: Add test for AT emulation
KVM: arm64: nv: Expose hardware access flag management to NV guests
KVM: arm64: nv: Implement HW access flag management in stage-2 SW PTW
KVM: arm64: Implement HW access flag management in stage-1 SW PTW
KVM: arm64: Propagate PTW errors up to AT emulation
KVM: arm64: Add helper for swapping guest descriptor
KVM: arm64: nv: Use pgtable definitions in stage-2 walk
KVM: arm64: Handle endianness in read helper for emulated PTW
KVM: arm64: nv: Stop passing vCPU through void ptr in S2 PTW
KVM: arm64: Call helper for reading descriptors directly
KVM: arm64: nv: Advertise support for FEAT_XNX
KVM: arm64: Teach ptdump about FEAT_XNX permissions
KVM: s390: Use generic VIRT_XFER_TO_GUEST_WORK functions
...
Pull UML updates from Johannes Berg:
"Apart from the usual small churn, we have
- initial SMP support (only kernel)
- major vDSO cleanups (and fixes for 32-bit)"
* tag 'uml-for-linux-6.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/uml/linux: (33 commits)
um: Disable KASAN_INLINE when STATIC_LINK is selected
um: Don't rename vmap to kernel_vmap
um: drivers: virtio: use string choices helper
um: Always set up AT_HWCAP and AT_PLATFORM
x86/um: Remove FIXADDR_USER_START and FIXADDR_USE_END
um: Remove __access_ok_vsyscall()
um: Remove redundant range check from __access_ok_vsyscall()
um: Remove fixaddr_user_init()
x86/um: Drop gate area handling
x86/um: Do not inherit vDSO from host
um: Split out default elf_aux_hwcap
x86/um: Move ELF_PLATFORM fallback to x86-specific code
um: Split out default elf_aux_platform
um: Avoid circular dependency on asm-offsets in pgtable.h
um: Enable SMP support on x86
asm-generic: percpu: Add assembly guard
um: vdso: Remove getcpu support on x86
um: Add initial SMP support
um: Define timers on a per-CPU basis
um: Determine sleep based on need_resched()
...
Pull RISC-V updates from Paul Walmsley:
- Enable parallel hotplug for RISC-V
- Optimize vector regset allocation for ptrace()
- Add a kernel selftest for the vector ptrace interface
- Enable the userspace RAID6 test to build and run using RISC-V vectors
- Add initial support for the Zalasr RISC-V ratified ISA extension
- For the Zicbop RISC-V ratified ISA extension to userspace, expose
hardware and kernel support to userspace and add a kselftest for
Zicbop
- Convert open-coded instances of 'asm goto's that are controlled by
runtime ALTERNATIVEs to use riscv_has_extension_{un,}likely(),
following arm64's alternative_has_cap_{un,}likely()
- Remove an unnecessary mask in the GFP flags used in some calls to
pagetable_alloc()
* tag 'riscv-for-linus-6.19-mw1' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
selftests/riscv: Add Zicbop prefetch test
riscv: hwprobe: Expose Zicbop extension and its block size
riscv: Introduce Zalasr instructions
riscv: hwprobe: Export Zalasr extension
dt-bindings: riscv: Add Zalasr ISA extension description
riscv: Add ISA extension parsing for Zalasr
selftests: riscv: Add test for the Vector ptrace interface
riscv: ptrace: Optimize the allocation of vector regset
raid6: test: Add support for RISC-V
raid6: riscv: Allow code to be compiled in userspace
raid6: riscv: Prevent compiler from breaking inline vector assembly code
riscv: cmpxchg: Use riscv_has_extension_likely
riscv: bitops: Use riscv_has_extension_likely
riscv: hweight: Use riscv_has_extension_likely
riscv: checksum: Use riscv_has_extension_likely
riscv: pgtable: Use riscv_has_extension_unlikely
riscv: Remove __GFP_HIGHMEM masking
RISC-V: Enable HOTPLUG_PARALLEL for secondary CPUs
Pull powerpc updates from Michael Ellerman:
- Restore clearing of MSR[RI] at interrupt/syscall exit on 32-bit
- Fix unpaired stwcx on interrupt exit on 32-bit
- Fix race condition leading to double list-add in
mac_hid_toggle_emumouse()
- Fix mprotect on book3s 32-bit
- Fix SLB multihit issue during SLB preload with 64-bit hash MMU
- Add support for crashkernel CMA reservation
- Add die_id and die_cpumask for Power10 & later to expose chip
hemispheres
- A series of minor fixes and improvements to the hash SLB code
Thanks to Antonio Alvarez Feijoo, Ben Collins, Bhaskar Chowdhury,
Christophe Leroy, Daniel Thompson, Dave Vasilevsky, Donet Tom,
J. Neuschäfer, Kunwu Chan, Long Li, Naresh Kamboju, Nathan Chancellor,
Ritesh Harjani (IBM), Shirisha G, Shrikanth Hegde, Sourabh Jain, Srikar
Dronamraju, Stephen Rothwell, Thomas Zimmermann, Venkat Rao Bagalkote,
and Vishal Chourasia.
* tag 'powerpc-6.19-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (32 commits)
macintosh/via-pmu-backlight: Include <linux/fb.h> and <linux/of.h>
powerpc/powermac: backlight: Include <linux/of.h>
powerpc/64s/slb: Add no_slb_preload early cmdline param
powerpc/64s/slb: Make preload_add return type as void
powerpc/ptdump: Dump PXX level info for kernel_page_tables
powerpc/64s/pgtable: Enable directMap counters in meminfo for Hash
powerpc/64s/hash: Update directMap page counters for Hash
powerpc/64s/hash: Hash hpt_order should be only available with Hash MMU
powerpc/64s/hash: Improve hash mmu printk messages
powerpc/64s/hash: Fix phys_addr_t printf format in htab_initialize()
powerpc/64s/ptdump: Fix kernel_hash_pagetable dump for ISA v3.00 HPTE format
powerpc/64s/hash: Restrict stress_hpt_struct memblock region to within RMA limit
powerpc/64s/slb: Fix SLB multihit issue during SLB preload
powerpc, mm: Fix mprotect on book3s 32-bit
powerpc/smp: Expose die_id and die_cpumask
powerpc/83xx: Add a null pointer check to mcu_gpiochip_add
arch:powerpc:tools This file was missing shebang line, so added it
kexec: Include kernel-end even without crashkernel
powerpc: p2020: Rename wdt@ nodes to watchdog@
powerpc: 86xx: Rename wdt@ nodes to watchdog@
...
Pull vfs fixes from Christian Brauner:
- Fix a type conversion bug in the ipc subsystem
- Fix per-dentry timeout warning in autofs
- Drop the fd conversion from sockets
- Move assert from iput_not_last() to iput()
- Fix reversed check in filesystems_freeze_callback()
- Use proper uapi types for new struct delegation definitions
* tag 'vfs-6.19-rc1.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
vfs: use UAPI types for new struct delegation definition
mqueue: correct the type of ro to int
Revert "net/socket: convert sock_map_fd() to FD_ADD()"
autofs: fix per-dentry timeout warning
fs: assert on I_FREEING not being set in iput() and iput_not_last()
fs: PM: Fix reverse check in filesystems_freeze_callback()
Pull exfat updates from Namjae Jeon:
- Fix a remount failure caused by differing process masks by inheriting
the original mount options during the remount process
- Fix a potential divide-by-zero error and system crash in
exfat_allocate_bitmap that occurred when the readahead count was zero
- Add validation for directory cluster bitmap bits to prevent directory
and root cluster from being incorrectly zeroed out on corrupted
images
- Clear the post-EOF page cache when extending a file to prevent stale
mmap data from becoming visible, addressing an generic/363 failure
- Fix a reference count leak in exfat_find by properly releasing the
dentry set in specific error paths
* tag 'exfat-for-6.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat:
exfat: fix remount failure in different process environments
exfat: fix divide-by-zero in exfat_allocate_bitmap
exfat: validate the cluster bitmap bits of directory
exfat: zero out post-EOF page cache on file extension
exfat: fix refcount leak in exfat_find
Pull fuse updates from Miklos Szeredi:
- Add mechanism for cleaning out unused, stale dentries; controlled via
a module option (Luis Henriques)
- Fix various bugs
- Cleanups
* tag 'fuse-update-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
fuse: Uninitialized variable in fuse_epoch_work()
fuse: fix io-uring list corruption for terminated non-committed requests
fuse: signal that a fuse inode should exhibit local fs behaviors
fuse: Always flush the page cache before FOPEN_DIRECT_IO write
fuse: Invalidate the page cache after FOPEN_DIRECT_IO write
fuse: rename 'namelen' to 'namesize'
fuse: use strscpy instead of strcpy
fuse: refactor fuse_conn_put() to remove negative logic.
fuse: new work queue to invalidate dentries from old epochs
fuse: new work queue to periodically invalidate expired dentries
dcache: export shrink_dentry_list() and add new helper d_dispose_if_unused()
fuse: add WARN_ON and comment for RCU revalidate
fuse: Fix whitespace for fuse_uring_args_to_ring() comment
fuse: missing copy_finish in fuse-over-io-uring argument copies
fuse: fix readahead reclaim deadlock
Pull persistent dentry infrastructure and conversion from Al Viro:
"Some filesystems use a kinda-sorta controlled dentry refcount leak to
pin dentries of created objects in dcache (and undo it when removing
those). A reference is grabbed and not released, but it's not actually
_stored_ anywhere.
That works, but it's hard to follow and verify; among other things, we
have no way to tell _which_ of the increments is intended to be an
unpaired one. Worse, on removal we need to decide whether the
reference had already been dropped, which can be non-trivial if that
removal is on umount and we need to figure out if this dentry is
pinned due to e.g. unlink() not done. Usually that is handled by using
kill_litter_super() as ->kill_sb(), but there are open-coded special
cases of the same (consider e.g. /proc/self).
Things get simpler if we introduce a new dentry flag
(DCACHE_PERSISTENT) marking those "leaked" dentries. Having it set
claims responsibility for +1 in refcount.
The end result this series is aiming for:
- get these unbalanced dget() and dput() replaced with new primitives
that would, in addition to adjusting refcount, set and clear
persistency flag.
- instead of having kill_litter_super() mess with removing the
remaining "leaked" references (e.g. for all tmpfs files that hadn't
been removed prior to umount), have the regular
shrink_dcache_for_umount() strip DCACHE_PERSISTENT of all dentries,
dropping the corresponding reference if it had been set. After that
kill_litter_super() becomes an equivalent of kill_anon_super().
Doing that in a single step is not feasible - it would affect too many
places in too many filesystems. It has to be split into a series.
This work has really started early in 2024; quite a few preliminary
pieces have already gone into mainline. This chunk is finally getting
to the meat of that stuff - infrastructure and most of the conversions
to it.
Some pieces are still sitting in the local branches, but the bulk of
that stuff is here"
* tag 'pull-persistency' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (54 commits)
d_make_discardable(): warn if given a non-persistent dentry
kill securityfs_recursive_remove()
convert securityfs
get rid of kill_litter_super()
convert rust_binderfs
convert nfsctl
convert rpc_pipefs
convert hypfs
hypfs: swich hypfs_create_u64() to returning int
hypfs: switch hypfs_create_str() to returning int
hypfs: don't pin dentries twice
convert gadgetfs
gadgetfs: switch to simple_remove_by_name()
convert functionfs
functionfs: switch to simple_remove_by_name()
functionfs: fix the open/removal races
functionfs: need to cancel ->reset_work in ->kill_sb()
functionfs: don't bother with ffs->ref in ffs_data_{opened,closed}()
functionfs: don't abuse ffs_data_closed() on fs shutdown
convert selinuxfs
...
Pull MM updates from Andrew Morton:
"__vmalloc()/kvmalloc() and no-block support" (Uladzislau Rezki)
Rework the vmalloc() code to support non-blocking allocations
(GFP_ATOIC, GFP_NOWAIT)
"ksm: fix exec/fork inheritance" (xu xin)
Fix a rare case where the KSM MMF_VM_MERGE_ANY prctl state is not
inherited across fork/exec
"mm/zswap: misc cleanup of code and documentations" (SeongJae Park)
Some light maintenance work on the zswap code
"mm/page_owner: add debugfs files 'show_handles' and 'show_stacks_handles'" (Mauricio Faria de Oliveira)
Enhance the /sys/kernel/debug/page_owner debug feature by adding
unique identifiers to differentiate the various stack traces so
that userspace monitoring tools can better match stack traces over
time
"mm/page_alloc: pcp->batch cleanups" (Joshua Hahn)
Minor alterations to the page allocator's per-cpu-pages feature
"Improve UFFDIO_MOVE scalability by removing anon_vma lock" (Lokesh Gidra)
Address a scalability issue in userfaultfd's UFFDIO_MOVE operation
"kasan: cleanups for kasan_enabled() checks" (Sabyrzhan Tasbolatov)
"drivers/base/node: fold node register and unregister functions" (Donet Tom)
Clean up the NUMA node handling code a little
"mm: some optimizations for prot numa" (Kefeng Wang)
Cleanups and small optimizations to the NUMA allocation hinting
code
"mm/page_alloc: Batch callers of free_pcppages_bulk" (Joshua Hahn)
Address long lock hold times at boot on large machines. These were
causing (harmless) softlockup warnings
"optimize the logic for handling dirty file folios during reclaim" (Baolin Wang)
Remove some now-unnecessary work from page reclaim
"mm/damon: allow DAMOS auto-tuned for per-memcg per-node memory usage" (SeongJae Park)
Enhance the DAMOS auto-tuning feature
"mm/damon: fixes for address alignment issues in DAMON_LRU_SORT and DAMON_RECLAIM" (Quanmin Yan)
Fix DAMON_LRU_SORT and DAMON_RECLAIM with certain userspace
configuration
"expand mmap_prepare functionality, port more users" (Lorenzo Stoakes)
Enhance the new(ish) file_operations.mmap_prepare() method and port
additional callsites from the old ->mmap() over to ->mmap_prepare()
"Fix stale IOTLB entries for kernel address space" (Lu Baolu)
Fix a bug (and possible security issue on non-x86) in the IOMMU
code. In some situations the IOMMU could be left hanging onto a
stale kernel pagetable entry
"mm/huge_memory: cleanup __split_unmapped_folio()" (Wei Yang)
Clean up and optimize the folio splitting code
"mm, swap: misc cleanup and bugfix" (Kairui Song)
Some cleanups and a minor fix in the swap discard code
"mm/damon: misc documentation fixups" (SeongJae Park)
"mm/damon: support pin-point targets removal" (SeongJae Park)
Permit userspace to remove a specific monitoring target in the
middle of the current targets list
"mm: MISC follow-up patches for linux/pgalloc.h" (Harry Yoo)
A couple of cleanups related to mm header file inclusion
"mm/swapfile.c: select swap devices of default priority round robin" (Baoquan He)
improve the selection of swap devices for NUMA machines
"mm: Convert memory block states (MEM_*) macros to enums" (Israel Batista)
Change the memory block labels from macros to enums so they will
appear in kernel debug info
"ksm: perform a range-walk to jump over holes in break_ksm" (Pedro Demarchi Gomes)
Address an inefficiency when KSM unmerges an address range
"mm/damon/tests: fix memory bugs in kunit tests" (SeongJae Park)
Fix leaks and unhandled malloc() failures in DAMON userspace unit
tests
"some cleanups for pageout()" (Baolin Wang)
Clean up a couple of minor things in the page scanner's
writeback-for-eviction code
"mm/hugetlb: refactor sysfs/sysctl interfaces" (Hui Zhu)
Move hugetlb's sysfs/sysctl handling code into a new file
"introduce VM_MAYBE_GUARD and make it sticky" (Lorenzo Stoakes)
Make the VMA guard regions available in /proc/pid/smaps and
improves the mergeability of guarded VMAs
"mm: perform guard region install/remove under VMA lock" (Lorenzo Stoakes)
Reduce mmap lock contention for callers performing VMA guard region
operations
"vma_start_write_killable" (Matthew Wilcox)
Start work on permitting applications to be killed when they are
waiting on a read_lock on the VMA lock
"mm/damon/tests: add more tests for online parameters commit" (SeongJae Park)
Add additional userspace testing of DAMON's "commit" feature
"mm/damon: misc cleanups" (SeongJae Park)
"make VM_SOFTDIRTY a sticky VMA flag" (Lorenzo Stoakes)
Address the possible loss of a VMA's VM_SOFTDIRTY flag when that
VMA is merged with another
"mm: support device-private THP" (Balbir Singh)
Introduce support for Transparent Huge Page (THP) migration in zone
device-private memory
"Optimize folio split in memory failure" (Zi Yan)
"mm/huge_memory: Define split_type and consolidate split support checks" (Wei Yang)
Some more cleanups in the folio splitting code
"mm: remove is_swap_[pte, pmd]() + non-swap entries, introduce leaf entries" (Lorenzo Stoakes)
Clean up our handling of pagetable leaf entries by introducing the
concept of 'software leaf entries', of type softleaf_t
"reparent the THP split queue" (Muchun Song)
Reparent the THP split queue to its parent memcg. This is in
preparation for addressing the long-standing "dying memcg" problem,
wherein dead memcg's linger for too long, consuming memory
resources
"unify PMD scan results and remove redundant cleanup" (Wei Yang)
A little cleanup in the hugepage collapse code
"zram: introduce writeback bio batching" (Sergey Senozhatsky)
Improve zram writeback efficiency by introducing batched bio
writeback support
"memcg: cleanup the memcg stats interfaces" (Shakeel Butt)
Clean up our handling of the interrupt safety of some memcg stats
"make vmalloc gfp flags usage more apparent" (Vishal Moola)
Clean up vmalloc's handling of incoming GFP flags
"mm: Add soft-dirty and uffd-wp support for RISC-V" (Chunyan Zhang)
Teach soft dirty and userfaultfd write protect tracking to use
RISC-V's Svrsw60t59b extension
"mm: swap: small fixes and comment cleanups" (Youngjun Park)
Fix a small bug and clean up some of the swap code
"initial work on making VMA flags a bitmap" (Lorenzo Stoakes)
Start work on converting the vma struct's flags to a bitmap, so we
stop running out of them, especially on 32-bit
"mm/swapfile: fix and cleanup swap list iterations" (Youngjun Park)
Address a possible bug in the swap discard code and clean things
up a little
[ This merge also reverts commit ebb9aeb980 ("vfio/nvgrace-gpu:
register device memory for poison handling") because it looks
broken to me, I've asked for clarification - Linus ]
* tag 'mm-stable-2025-12-03-21-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (321 commits)
mm: fix vma_start_write_killable() signal handling
mm/swapfile: use plist_for_each_entry in __folio_throttle_swaprate
mm/swapfile: fix list iteration when next node is removed during discard
fs/proc/task_mmu.c: fix make_uffd_wp_huge_pte() huge pte handling
mm/kfence: add reboot notifier to disable KFENCE on shutdown
memcg: remove inc/dec_lruvec_kmem_state helpers
selftests/mm/uffd: initialize char variable to Null
mm: fix DEBUG_RODATA_TEST indentation in Kconfig
mm: introduce VMA flags bitmap type
tools/testing/vma: eliminate dependency on vma->__vm_flags
mm: simplify and rename mm flags function for clarity
mm: declare VMA flags by bit
zram: fix a spelling mistake
mm/page_alloc: optimize lowmem_reserve max lookup using its semantic monotonicity
mm/vmscan: skip increasing kswapd_failures when reclaim was boosted
pagemap: update BUDDY flag documentation
mm: swap: remove scan_swap_map_slots() references from comments
mm: swap: change swap_alloc_slow() to void
mm, swap: remove redundant comment for read_swap_cache_async
mm, swap: use SWP_SOLIDSTATE to determine if swap is rotational
...
Pull sysctl updates from Joel Granados:
- Move jiffies converters out of kernel/sysctl.c
Move the jiffies converters into kernel/time/jiffies.c and replace
the pipe-max-size proc_handler converter with a macro based version.
This is all part of the effort to relocate non-sysctl logic out of
kernel/sysctl.c into more relevant subsystems. No functional changes.
- Generalize proc handler converter creation
Remove duplicated sysctl converter logic by consolidating it in
macros. These are used inside sysctl core as well as in pipe.c and
jiffies.c. Converter kernel and user space pointer args are now
automatically const qualified for the convenience of the caller. No
functional changes.
- Miscellaneous
Fix kernel-doc format warnings, remove unnecessary __user
qualifiers, and move the nmi_watchdog sysctl into .rodata.
- Testing
This series was run through sysctl selftests/kunit test suite in
x86_64. It went into linux-next after rc2, giving it a good 4/5 weeks
of testing.
* tag 'sysctl-6.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl: (21 commits)
sysctl: Wrap do_proc_douintvec with the public function proc_douintvec_conv
sysctl: Create pipe-max-size converter using sysctl UINT macros
sysctl: Move proc_doulongvec_ms_jiffies_minmax to kernel/time/jiffies.c
sysctl: Move jiffies converters to kernel/time/jiffies.c
sysctl: Move UINT converter macros to sysctl header
sysctl: Move INT converter macros to sysctl header
sysctl: Allow custom converters from outside sysctl
sysctl: remove __user qualifier from stack_erasing_sysctl buffer argument
sysctl: Create macro for user-to-kernel uint converter
sysctl: Add optional range checking to SYSCTL_UINT_CONV_CUSTOM
sysctl: Create unsigned int converter using new macro
sysctl: Add optional range checking to SYSCTL_INT_CONV_CUSTOM
sysctl: Create integer converters with one macro
sysctl: Create converter functions with two new macros
sysctl: Discriminate between kernel and user converter params
sysctl: Indicate the direction of operation with macro names
sysctl: Remove superfluous __do_proc_* indirection
sysctl: Remove superfluous tbl_data param from "dovec" functions
sysctl: Replace void pointer with const pointer to ctl_table
sysctl: fix kernel-doc format warning
...
Samsung DTS ARM changes for v6.19
Fix WiFi on Exynos4210 and Exynos4412 boards with Broadcom chip after
system suspend and resume, by using cap-power-off-card to power off the
WiFi during suspend.
* tag 'samsung-dt-6.19' of https://git.kernel.org/pub/scm/linux/kernel/git/krzk/linux:
ARM: dts: samsung: exynos4412-midas: turn off SDIO WLAN chip during system suspend
ARM: dts: samsung: exynos4210-trats: turn off SDIO WLAN chip during system suspend
ARM: dts: samsung: exynos4210-i9100: turn off SDIO WLAN chip during system suspend
ARM: dts: samsung: universal_c210: turn off SDIO WLAN chip during system suspend
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Samsung SoC drivers for v6.19, part two
Two fixes for Exynos PMU (Power Management Unit) driver:
1. Silence lockdep warning being actually a false positive, but quite
disturbing during testing. Issue was introduced in v6.18.
2. Drop device refcount when requesting device regmap with
exynos_get_pmu_regmap_by_phandle(). Issue was introduced much
earlier (around v6.9), with code being rewritten in between.
* tag 'samsung-drivers-6.19-2-late' of https://git.kernel.org/pub/scm/linux/kernel/git/krzk/linux:
soc: samsung: exynos-pmu: fix device leak on regmap lookup
soc: samsung: exynos-pmu: Fix structure initialization
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
The ams-delta-fiq-handler.S file has a number of symbols with fairly
generic names, including one named 'exit' that causes a compiler warning
in some configuration options:
vmlinux.o: error: exit() function name creates ambiguity with -ffunction-sections
Change all these symbols to use a .L prefix to make them local to
the fiq handler.
Reviewed-by: Janusz Krzysztofik <jmkrzyszt@gmail.com>
Link: https://lore.kernel.org/r/20251204095355.1032786-1-arnd@kernel.org
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
The check that determines if the message that warns about the per-dentry
timeout being greater than the super block timeout is not correct.
The initial value for this field is -1 and the type of the field is
unsigned long.
I could change the type to long but the message is in the wrong place
too, it should come after the timeout setting. So leave everything else
as it is and move the message and check the timeout is actually set
as an additional condition on issuing the message. Also fix the timeout
comparison.
Signed-off-by: Ian Kent <raven@themaw.net>
Link: https://patch.msgid.link/20251111060439.19593-2-raven@themaw.net
Signed-off-by: Christian Brauner <brauner@kernel.org>
The freeze_all_ptr check in filesystems_freeze_callback() introduced by
commit a3f8f86627 ("power: always freeze efivarfs") is reverse which
quite confusingly causes all file systems to be frozen when
filesystem_freeze_enabled is false.
On my systems it causes the WARN_ON_ONCE() in __set_task_frozen() to
trigger, most likely due to an attempt to freeze a file system that is
not ready for that.
Add a logical negation to the check in question to reverse it as
appropriate.
Fixes: a3f8f86627 ("power: always freeze efivarfs")
Cc: 6.18+ <stable@vger.kernel.org> # 6.18+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://patch.msgid.link/12788397.O9o76ZdvQC@rafael.j.wysocki
Signed-off-by: Christian Brauner <brauner@kernel.org>
The kernel test robot reported that the exFAT remount operation
failed. The reason for the failure was that the process's umask
is different between mount and remount, causing fs_fmask and
fs_dmask are changed.
Potentially, both gid and uid may also be changed. Therefore, when
initializing fs_context for remount, inherit these mount options
from the options used during mount.
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202511251637.81670f5c-lkp@intel.com
Signed-off-by: Yuezhang Mo <Yuezhang.Mo@sony.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
The variable max_ra_count can be 0 in exfat_allocate_bitmap(),
which causes a divide-by-zero error in the subsequent modulo operation
(i % max_ra_count), leading to a system crash.
When max_ra_count is 0, it means that readahead is not used. This patch
load the bitmap without readahead.
Fixes: 9fd688678d ("exfat: optimize allocation bitmap loading time")
Reported-by: Jiaming Zhang <r772577952@gmail.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Syzbot created this issue by testing an image that did not have the root
cluster bitmap bit marked. After accessing a file through the root
directory via exfat_lookup, when creating a file again with mkdir,
the root cluster bit can be allocated for direcotry, which can cause
the root cluster to be zeroed out and the same entry can be allocated
in the same cluster. This patch improved this issue by adding
exfat_test_bitmap to validate the cluster bits of the root directory
and directory. And the first cluster bit of the root directory should
never be unset except when storage is corrupted. This bit is set to
allow operations after mount.
Reported-by: syzbot+5216036fc59c43d1ee02@syzkaller.appspotmail.com
Tested-by: syzbot+5216036fc59c43d1ee02@syzkaller.appspotmail.com
Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com>
Reviewed-by: Yuezhang Mo <Yuezhang.Mo@sony.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
xfstests generic/363 was failing due to unzeroed post-EOF page
cache that allowed mmap writes beyond EOF to become visible
after file extension.
For example, in following xfs_io sequence, 0x22 should not be
written to the file but would become visible after the extension:
xfs_io -f -t -c "pwrite -S 0x11 0 8" \
-c "mmap 0 4096" \
-c "mwrite -S 0x22 32 32" \
-c "munmap" \
-c "pwrite -S 0x33 512 32" \
$testfile
This violates the expected behavior where writes beyond EOF via
mmap should not persist after the file is extended. Instead, the
extended region should contain zeros.
Fix this by using truncate_pagecache() to truncate the page cache
after the current EOF when extending the file.
Signed-off-by: Yuezhang Mo <Yuezhang.Mo@sony.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Fix refcount leaks in `exfat_find` related to `exfat_get_dentry_set`.
Function `exfat_get_dentry_set` would increase the reference counter of
`es->bh` on success. Therefore, `exfat_put_dentry_set` must be called
after `exfat_get_dentry_set` to ensure refcount consistency. This patch
relocate two checks to avoid possible leaks.
Fixes: 82ebecdc74 ("exfat: fix improper check of dentry.stream.valid_size")
Fixes: 13940cef95 ("exfat: add a check for invalid data size")
Signed-off-by: Shuhao Fu <sfual@cse.ust.hk>
Reviewed-by: Yuezhang Mo <Yuezhang.Mo@sony.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
KVM/arm64 updates for 6.19
- Support for userspace handling of synchronous external aborts (SEAs),
allowing the VMM to potentially handle the abort in a non-fatal
manner.
- Large rework of the VGIC's list register handling with the goal of
supporting more active/pending IRQs than available list registers in
hardware. In addition, the VGIC now supports EOImode==1 style
deactivations for IRQs which may occur on a separate vCPU than the
one that acked the IRQ.
- Support for FEAT_XNX (user / privileged execute permissions) and
FEAT_HAF (hardware update to the Access Flag) in the software page
table walkers and shadow MMU.
- Allow page table destruction to reschedule, fixing long need_resched
latencies observed when destroying a large VM.
- Minor fixes to KVM and selftests
KVM/riscv changes for 6.19
- SBI MPXY support for KVM guest
- New KVM_EXIT_FAIL_ENTRY_NO_VSFILE for the case when in-kernel
AIA virtualization fails to allocate IMSIC VS-file
- Support enabling dirty log gradually in small chunks
- Fix guest page fault within HLV* instructions
- Flush VS-stage TLB after VCPU migration for Andes cores
LoongArch KVM changes for v6.19
1. Get VM PMU capability from HW GCFG register.
2. Add AVEC basic support.
3. Use 64-bit register definition for EIOINTC.
4. Add KVM timer test cases for tools/selftests.
* kvm-arm64/nv-xnx-haf: (22 commits)
: Support for FEAT_XNX and FEAT_HAF in nested
:
: Add support for a couple of MMU-related features that weren't
: implemented by KVM's software page table walk:
:
: - FEAT_XNX: Allows the hypervisor to describe execute permissions
: separately for EL0 and EL1
:
: - FEAT_HAF: Hardware update of the Access Flag, which in the context of
: nested means software walkers must also set the Access Flag.
:
: The series also adds some basic support for testing KVM's emulation of
: the AT instruction, including the implementation detail that AT sets the
: Access Flag in KVM.
KVM: arm64: at: Update AF on software walk only if VM has FEAT_HAFDBS
KVM: arm64: at: Use correct HA bit in TCR_EL2 when regime is EL2
KVM: arm64: Document KVM_PGTABLE_PROT_{UX,PX}
KVM: arm64: Fix spelling mistake "Unexpeced" -> "Unexpected"
KVM: arm64: Add break to default case in kvm_pgtable_stage2_pte_prot()
KVM: arm64: Add endian casting to kvm_swap_s[12]_desc()
KVM: arm64: Fix compilation when CONFIG_ARM64_USE_LSE_ATOMICS=n
KVM: arm64: selftests: Add test for AT emulation
KVM: arm64: nv: Expose hardware access flag management to NV guests
KVM: arm64: nv: Implement HW access flag management in stage-2 SW PTW
KVM: arm64: Implement HW access flag management in stage-1 SW PTW
KVM: arm64: Propagate PTW errors up to AT emulation
KVM: arm64: Add helper for swapping guest descriptor
KVM: arm64: nv: Use pgtable definitions in stage-2 walk
KVM: arm64: Handle endianness in read helper for emulated PTW
KVM: arm64: nv: Stop passing vCPU through void ptr in S2 PTW
KVM: arm64: Call helper for reading descriptors directly
KVM: arm64: nv: Advertise support for FEAT_XNX
KVM: arm64: Teach ptdump about FEAT_XNX permissions
KVM: arm64: nv: Forward FEAT_XNX permissions to the shadow stage-2
...
Signed-off-by: Oliver Upton <oupton@kernel.org>
* kvm-arm64/vgic-lr-overflow: (50 commits)
: Support for VGIC LR overflows, courtesy of Marc Zyngier
:
: Address deficiencies in KVM's GIC emulation when a vCPU has more active
: IRQs than can be represented in the VGIC list registers. Sort the AP
: list to prioritize inactive and pending IRQs, potentially spilling
: active IRQs outside of the LRs.
:
: Handle deactivation of IRQs outside of the LRs for both EOImode=0/1,
: which involves special consideration for SPIs being deactivated from a
: different vCPU than the one that acked it.
KVM: arm64: Convert ICH_HCR_EL2_TDIR cap to EARLY_LOCAL_CPU_FEATURE
KVM: arm64: selftests: vgic_irq: Add timer deactivation test
KVM: arm64: selftests: vgic_irq: Add Group-0 enable test
KVM: arm64: selftests: vgic_irq: Add asymmetric SPI deaectivation test
KVM: arm64: selftests: vgic_irq: Perform EOImode==1 deactivation in ack order
KVM: arm64: selftests: vgic_irq: Remove LR-bound limitation
KVM: arm64: selftests: vgic_irq: Exclude timer-controlled interrupts
KVM: arm64: selftests: vgic_irq: Change configuration before enabling interrupt
KVM: arm64: selftests: vgic_irq: Fix GUEST_ASSERT_IAR_EMPTY() helper
KVM: arm64: selftests: gic_v3: Disable Group-0 interrupts by default
KVM: arm64: selftests: gic_v3: Add irq group setting helper
KVM: arm64: GICv2: Always trap GICV_DIR register
KVM: arm64: GICv2: Handle deactivation via GICV_DIR traps
KVM: arm64: GICv2: Handle LR overflow when EOImode==0
KVM: arm64: GICv3: Force exit to sync ICH_HCR_EL2.En
KVM: arm64: GICv3: nv: Plug L1 LR sync into deactivation primitive
KVM: arm64: GICv3: nv: Resync LRs/VMCR/HCR early for better MI emulation
KVM: arm64: GICv3: Avoid broadcast kick on CPUs lacking TDIR
KVM: arm64: GICv3: Handle in-LR deactivation when possible
KVM: arm64: GICv3: Add SPI tracking to handle asymmetric deactivation
...
Signed-off-by: Oliver Upton <oupton@kernel.org>
* kvm-arm64/sea-user:
: Userspace handling of SEAs, courtesy of Jiaqi Yan
:
: Add support for processing external aborts in userspace in situations
: where the host has failed to do so, allowing the VMM to potentially
: reinject an external abort into the VM.
Documentation: kvm: new UAPI for handling SEA
KVM: selftests: Test for KVM_EXIT_ARM_SEA
KVM: arm64: VM exit to userspace to handle SEA
Signed-off-by: Oliver Upton <oupton@kernel.org>
* kvm-arm64/misc:
: Miscellaneous fixes/cleanups for KVM/arm64
:
: - Fix for need_resched warnings on non-preemptible kernels when
: tearing down a VM's stage-2
:
: - Improvements to KVM struct allocation, getting rid of pointless
: __GFP_HIGHMEM and switching to kvzalloc()
:
: - SYNC ITS configuration before injecting LPIs in vgic_lpi_stress
: selftest
KVM: arm64: Reschedule as needed when destroying the stage-2 page-tables
KVM: arm64: Split kvm_pgtable_stage2_destroy()
KVM: arm64: Only drop references on empty tables in stage2_free_walker
KVM: selftests: SYNC after guest ITS setup in vgic_lpi_stress
KVM: selftests: Assert GICR_TYPER.Processor_Number matches selftest CPU number
KVM: arm64: Use kvzalloc() for kvm struct allocation
KVM: arm64: Drop useless __GFP_HIGHMEM from kvm struct allocation
Signed-off-by: Oliver Upton <oupton@kernel.org>
A guest can write 1 to TCR_ELx.HA, making the KVM software walker update
the access flag in a table descriptor even if FEAT_HAFDBS is not present.
Avoid this by making wi->ha depend on FEAT_HAFDBS being enabled in the VM,
similar to how the software walker treats FEAT_HPDS.
This is not needed for VTCR_EL2.HA, since a guest will always write to
the in-memory copy of the register, where the HA bit is masked (set to
0) by KVM if the VM doesn't have FEAT_HAFDBS.
Fixes: c59ca4b5b0c3 ("KVM: arm64: Implement HW access flag management in stage-1 SW PTW")
Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
Link: https://msgid.link/20251128100946.74210-5-alexandru.elisei@arm.com
Signed-off-by: Oliver Upton <oupton@kernel.org>
Clang warns (or errors with CONFIG_WERROR=y / W=e):
arch/arm64/kvm/hyp/pgtable.c:757:2: error: label at end of compound statement is a C23 extension [-Werror,-Wc23-extensions]
757 | }
| ^
With older versions of clang (15 and older) and GCC (at least the minimum
supported, 8.1), this is an unconditional hard error:
arch/arm64/kvm/hyp/pgtable.c: In function 'kvm_pgtable_stage2_pte_prot':
arch/arm64/kvm/hyp/pgtable.c:756:2: error: label at end of compound statement
default:
^~~~~~~
arch/arm64/kvm/hyp/pgtable.c:756:10: error: label at end of compound statement: expected statement
default:
^
;
Add a break statement to this default case to clear up the error/warning.
Fixes: 2608563b46 ("KVM: arm64: Add support for FEAT_XNX stage-2 permissions")
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://msgid.link/20251125-arm64-kvm-hyp-pgtable-fix-c23-ext-warn-v1-1-98b506ddefbf@kernel.org
Signed-off-by: Oliver Upton <oupton@kernel.org>
Atomically update the Access flag at stage-1 when the guest has
configured the MMU to do so. Make the implementation choice (and liberal
interpretation of speculation) that any access type updates the Access
flag, including AT and CMO instructions.
Restart the entire walk by returning to the exception-generating
instruction in the case of a failed Access flag update.
Reviewed-by: Marc Zyngier <maz@kernel.org>
Tested-by: Marc Zyngier <maz@kernel.org>
Link: https://msgid.link/20251124190158.177318-13-oupton@kernel.org
Signed-off-by: Oliver Upton <oupton@kernel.org>
KVM's software PTW will soon support 'hardware' updates to the access
flag. Similar to fault handling, races to update the descriptor will be
handled by restarting the instruction. Prepare for this by propagating
errors up to the AT emulation, only retiring the instruction if the walk
succeeds.
Reviewed-by: Marc Zyngier <maz@kernel.org>
Tested-by: Marc Zyngier <maz@kernel.org>
Link: https://msgid.link/20251124190158.177318-12-oupton@kernel.org
Signed-off-by: Oliver Upton <oupton@kernel.org>
Implementing FEAT_HAFDBS in KVM's software PTWs requires the ability to
CAS a descriptor to update the in-memory value. Add an accessor to do
exactly that, coping with the fact that guest descriptors are in user
memory (duh).
While FEAT_LSE required on any system that implements NV, KVM now uses
the stage-1 PTW for non-nested use cases meaning an LL/SC implementation
is necessary as well.
Reviewed-by: Marc Zyngier <maz@kernel.org>
Tested-by: Marc Zyngier <maz@kernel.org>
Link: https://msgid.link/20251124190158.177318-11-oupton@kernel.org
Signed-off-by: Oliver Upton <oupton@kernel.org>
Implementing FEAT_HAFDBS means adding another descriptor accessor that
needs to deal with the guest-configured endianness. Prepare by moving
the endianness handling into the read accessor and out of the main body
of the S1/S2 PTWs.
Reviewed-by: Marc Zyngier <maz@kernel.org>
Tested-by: Marc Zyngier <maz@kernel.org>
Link: https://msgid.link/20251124190158.177318-9-oupton@kernel.org
Signed-off-by: Oliver Upton <oupton@kernel.org>
Although KVM doesn't make direct use of the feature, guest hypervisors
can use FEAT_XNX which influences the permissions of the shadow stage-2.
Update ptdump to separately print the privileged and unprivileged
execute permissions.
Reviewed-by: Marc Zyngier <maz@kernel.org>
Tested-by: Marc Zyngier <maz@kernel.org>
Link: https://msgid.link/20251124190158.177318-5-oupton@kernel.org
Signed-off-by: Oliver Upton <oupton@kernel.org>
Patch series "mm/swapfile: fix and cleanup swap list iterations", v2.
This series fixes a potential list iteration issue in swap_sync_discard()
when devices are removed, and includes a cleanup for
__folio_throttle_swaprate().
This patch (of 2):
When the next node is removed from the plist (e.g. by swapoff),
plist_del() makes the node point to itself, causing the iteration to loop
on the same entry indefinitely.
Add a plist_node_empty() check to detect this case and restart iteration,
allowing swap_sync_discard() to continue processing remaining swap devices
that still have pending discard entries.
Additionally, switch from swap_avail_lock/swap_avail_head to
swap_lock/swap_active_head so that iteration is only affected by swapoff
operations rather than frequent availability changes, reducing exceptional
condition checks and lock contention.
Link: https://lkml.kernel.org/r/20251127100303.783198-1-youngjun.park@lge.com
Link: https://lkml.kernel.org/r/20251127100303.783198-2-youngjun.park@lge.com
Fixes: 686ea517f471 ("mm, swap: do not perform synchronous discard during allocation")
Signed-off-by: Youngjun Park <youngjun.park@lge.com>
Suggested-by: Kairui Song <kasong@tencent.com>
Acked-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Baoquan He <bhe@redhat.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Chris Li <chrisl@kernel.org>
Cc: Kemeng Shi <shikemeng@huaweicloud.com>
Cc: Nhat Pham <nphamcs@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "initial work on making VMA flags a bitmap", v3.
We are in the rather silly situation that we are running out of VMA flags
as they are currently limited to a system word in size.
This leads to absurd situations where we limit features to 64-bit
architectures only because we simply do not have the ability to add a flag
for 32-bit ones.
This is very constraining and leads to hacks or, in the worst case, simply
an inability to implement features we want for entirely arbitrary reasons.
This also of course gives us something of a Y2K type situation in mm where
we might eventually exhaust all of the VMA flags even on 64-bit systems.
This series lays the groundwork for getting away from this limitation by
establishing VMA flags as a bitmap whose size we can increase in future
beyond 64 bits if required.
This is necessarily a highly iterative process given the extensive use of
VMA flags throughout the kernel, so we start by performing basic steps.
Firstly, we declare VMA flags by bit number rather than by value,
retaining the VM_xxx fields but in terms of these newly introduced
VMA_xxx_BIT fields.
While we are here, we use sparse annotations to ensure that, when dealing
with VMA bit number parameters, we cannot be passed values which are not
declared as such - providing some useful type safety.
We then introduce an opaque VMA flag type, much like the opaque mm_struct
flag type introduced in commit bb6525f2f8 ("mm: add bitmap mm->flags
field"), which we establish in union with vma->vm_flags (but still set at
system word size meaning there is no functional or data type size change).
We update the vm_flags_xxx() helpers to use this new bitmap, introducing
sensible helpers to do so.
This series lays the foundation for further work to expand the use of
bitmap VMA flags and eventually eliminate these arbitrary restrictions.
This patch (of 4):
In order to lay the groundwork for VMA flags being a bitmap rather than a
system word in size, we need to be able to consistently refer to VMA flags
by bit number rather than value.
Take this opportunity to do so in an enum which we which is additionally
useful for tooling to extract metadata from.
This additionally makes it very clear which bits are being used for what
at a glance.
We use the VMA_ prefix for the bit values as it is logical to do so since
these reference VMAs. We consistently suffix with _BIT to make it clear
what the values refer to.
We declare bit values even when the flags that use them would not be
enabled by config options as this is simply clearer and clearly defines
what bit numbers are used for what, at no additional cost.
We declare a sparse-bitwise type vma_flag_t which ensures that users can't
pass around invalid VMA flags by accident and prepares for future work
towards VMA flags being a bitmap where we want to ensure bit values are
type safe.
To make life easier, we declare some macro helpers - DECLARE_VMA_BIT()
allows us to avoid duplication in the enum bit number declarations (and
maintaining the sparse __bitwise attribute), and INIT_VM_FLAG() is used to
assist with declaration of flags.
Unfortunately we can't declare both in the enum, as we run into issue with
logic in the kernel requiring that flags are preprocessor definitions, and
additionally we cannot have a macro which declares another macro so we
must define each flag macro directly.
Additionally, update the VMA userland testing vma_internal.h header to
include these changes.
We also have to fix the parameters to the vma_flag_*_atomic() functions
since VMA_MAYBE_GUARD_BIT is now of type vma_flag_t and sparse will
complain otherwise.
We have to update some rather silly if-deffery found in mm/task_mmu.c
which would otherwise break.
Finally, we update the rust binding helper as now it cannot auto-detect
the flags at all.
Link: https://lkml.kernel.org/r/cover.1764064556.git.lorenzo.stoakes@oracle.com
Link: https://lkml.kernel.org/r/3a35e5a0bcfa00e84af24cbafc0653e74deda64a.1764064556.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Pedro Falcato <pfalcato@suse.de>
Acked-by: Alice Ryhl <aliceryhl@google.com> [rust]
Cc: Alex Gaynor <alex.gaynor@gmail.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Andreas Hindborg <a.hindborg@kernel.org>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Björn Roy Baron <bjorn3_gh@protonmail.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Byungchul Park <byungchul@sk.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Cc: Chris Li <chrisl@kernel.org>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Gary Guo <gary@garyguo.net>
Cc: Gregory Price <gourry@gourry.net>
Cc: "Huang, Ying" <ying.huang@linux.alibaba.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Kairui Song <kasong@tencent.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Kemeng Shi <shikemeng@huaweicloud.com>
Cc: Lance Yang <lance.yang@linux.dev>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Mathew Brost <matthew.brost@intel.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mel Gorman <mgorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Miguel Ojeda <ojeda@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Nico Pache <npache@redhat.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Rakie Kim <rakie.kim@sk.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Trevor Gross <tmgross@umich.edu>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Wei Xu <weixugc@google.com>
Cc: xu xin <xu.xin16@zte.com.cn>
Cc: Yuanchu Xie <yuanchu@google.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
calculate_totalreserve_pages() currently finds the maximum
lowmem_reserve[j] for a zone by scanning the full forward range [j =
zone_idx .. MAX_NR_ZONES). However, for a given zone i, the
lowmem_reserve[j] array (for j > i) is naturally expected to form a
monotonically non-decreasing sequence in j, not as an implementation
detail, but as a consequence that naturally arises from the semantics of
lowmem_reserve[].
For zone "i", lowmem_reserve[j] expresses how many pages in zone i must
effectively be kept in reserve when deciding whether an allocation class
that may allocate from zones up to j is allowed to fall back into i. It
protects less flexible allocation classes (which cannot use higher zones)
from being starved by more flexible ones.
Viewed from this semantics, it is natural to expect a partial ordering in
j: as j increases, the allocation class gains access to a strictly larger
set of fallback zones. Therefore lowmem_reserve[j] is expected to be
monotonically non-decreasing in j: more flexible allocation classes must
not be allowed to deplete low zones more aggressively than less flexible
ones.
In other words, if lowmem_reserve[j] were ever observed to *decrease* as j
grows, that would be unexpected from the reserve semantics' point of view
and would likely indicate a semantic change or a misconfiguration.
The current implementation in setup_per_zone_lowmem_reserve() reflects
this policy by accumulating managed pages from higher zones and applying
the configured ratio, which results in a non-decreasing sequence. This
patch makes calculate_totalreserve_pages() rely on that monotonicity
explicitly and finds the maximum reserve value by scanning backward and
stopping at the first non-zero entry. This avoids unnecessary iteration
and reflects the conceptual model more directly. No functional behavior
changes.
To maintain this assumption explicitly, a comment is added next to
setup_per_zone_lowmem_reserve() documenting the monotonicity expectation
and noting that calculate_totalreserve_pages() relies on it.
Link: https://lkml.kernel.org/r/tencent_EB0FED91B01B1F8B6DAEE96719C5F5797F07@qq.com
Signed-off-by: fujunjie <fujunjie1@qq.com>
Acked-by: Zi Yan <ziy@nvidia.com>
Cc: Brendan Jackman <jackmanb@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
We have a colocation cluster used for deploying both offline and online
services simultaneously. In this environment, we encountered a
scenario where direct memory reclamation was triggered due to kswapd
not running.
1. When applications start up, rapidly consume memory, or experience
network traffic bursts, the kernel reaches steal_suitable_fallback(),
which sets watermark_boost and subsequently wakes kswapd.
2. In the core logic of kswapd thread (balance_pgdat()), when reclaim is
triggered by watermark_boost, the maximum priority is 10. Higher
priority values mean less aggressive LRU scanning, which can result in
no pages being reclaimed during a single scan cycle:
if (nr_boost_reclaim && sc.priority == DEF_PRIORITY - 2)
raise_priority = false;
3. Additionally, many of our pods are configured with memory.low, which
prevents memory reclamation in certain cgroups, further increasing the
chance of failing to reclaim memory.
4. This eventually causes pgdat->kswapd_failures to continuously
accumulate, exceeding MAX_RECLAIM_RETRIES, and consequently kswapd
stops working. At this point, the system's available memory is still
significantly above the high watermark -- it's inappropriate for kswapd
to stop under these conditions.
The final observable issue is that a brief period of rapid memory
allocation causes kswapd to stop running, ultimately triggering direct
reclaim and making the applications unresponsive.
This problem leading to direct memory reclamation has been a
long-standing issue in our production environment. We initially held
the simple assumption that it was caused by applications allocating
memory too rapidly for kswapd to keep up with reclamation. However,
after we began monitoring kswapd's runtime behavior, we discovered a
different pattern:
kswapd initially exhibits very aggressive activity even when there is
still considerable free memory, but it subsequently stops running
entirely, even as memory levels approach the low watermark.
In summary, both boosted watermarks and memory.low increase the
probability of kswapd operation failures.
This patch specifically addresses the scenario involving boosted
watermarks by not incrementing kswapd_failures when reclamation fails.
A more general solution, potentially addressing memory.low or other
cases, requires further discussion.
Link: https://lkml.kernel.org/r/53de0b3ee0b822418e909db29bfa6513faff9d36@linux.dev
Link: https://lkml.kernel.org/r/20251024022711.382238-1-jiayuan.chen@linux.dev
Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Wei Xu <weixugc@google.com>
Cc: Yuanchu Xie <yuanchu@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Switch to using the generic infrastructure to check for and handle pending
work before transitioning into guest mode.
xfer_to_guest_mode_handle_work() does a few more things than the current
code does when deciding whether or not to exit the __vcpu_run() loop. The
exittime tests from kvm-unit-tests, in my tests, were within a few percent
compared to before this series, which is within noise tolerance.
Co-developed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Andrew Donnellan <ajd@linux.ibm.com>
Acked-by: Janosch Frank <frankja@linux.ibm.com>
[frankja@linux.ibm.com: Removed semicolon]
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
With time counter test, it is to verify that time count starts from 0
and always grows up then.
Signed-off-by: Bibo Mao <maobibo@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
This test case setup one-shot timer and execute idle instruction
immediately to indicate giving up CPU, hypervisor will emulate SW
hrtimer and wakeup vCPU when SW hrtimer is fired.
Signed-off-by: Bibo Mao <maobibo@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
Add timer test case based on common arch_timer code, timer interrupt
with one-shot and period mode is tested.
Signed-off-by: Bibo Mao <maobibo@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
RISC-V config for v6.19
Spacemit:
The Spacemit k1 wants the freescale qspi driver enabled as a module
as they appear to be rather similar IPs.
Signed-off-by: Conor Dooley <conor.dooley@microchip.com>
* tag 'riscv-config-for-v6.19' of https://git.kernel.org/pub/scm/linux/kernel/git/conor/linux:
riscv: defconfig: enable SPI_FSL_QUADSPI as a module
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
standalone cache drivers for v6.19
ccache:
Add a compatible for the pic64gx SoC. No driver change needed, as it
falls back to the PolarFire SoC.
hisi hha/generic cpu cache maintenance:
Add support for a non-architectural mechanism for invalidating memory
regions, needed for some cxl implementations on arm64 (and probably
elsewhere in the future). The HiSilicon Hydra Home Agent is the first
driver to provide this support.
Signed-off-by: Conor Dooley <conor.dooley@microchip.com>
* tag 'cache-for-v6.19' of https://git.kernel.org/pub/scm/linux/kernel/git/conor/linux:
MAINTAINERS: refer to intended file in STANDALONE CACHE CONTROLLER DRIVERS
cache: Support cache maintenance for HiSilicon SoC Hydra Home Agent
cache: Make top level Kconfig menu a boolean dependent on RISCV
MAINTAINERS: Add Jonathan Cameron to drivers/cache and add lib/cache_maint.c + header
arm64: Select GENERIC_CPU_CACHE_MAINTENANCE
lib: Support ARCH_HAS_CPU_CACHE_INVALIDATE_MEMREGION
memregion: Support fine grained invalidate by cpu_cache_invalidate_memregion()
memregion: Drop unused IORES_DESC_* parameter from cpu_cache_invalidate_memregion()
dt-bindings: cache: sifive,ccache0: add a pic64gx compatible
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
RISC-V soc-drivers for v6.19
Microchip:
Add bindings and mfd drivers for two syscon regions on PolarFire SoC,
needed as part of a rework of the devicetree to permit supporting,
among other things, pinctrl sanely and avoiding the "new" pic64gx SoC
ever using the original incorrect clock nodes. Fiddle with the Microchip
RISC-V MAINTAINERS entry to add these drivers and avoid branding it FPGA
only.
Signed-off-by: Conor Dooley <conor.dooley@microchip.com>
* tag 'soc-drivers-for-v6.19' of https://git.kernel.org/pub/scm/linux/kernel/git/conor/linux:
MAINTAINERS: rename Microchip RISC-V entry
MAINTAINERS: add new soc drivers to Microchip RISC-V entry
soc: microchip: add mfd drivers for two syscon regions on PolarFire SoC
dt-bindings: soc: microchip: document the simple-mfd syscon on PolarFire SoC
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Apple SoC driver updates for 6.18
Two small fixes:
- mailbox: Stop leaking a reference to the mbox platform device during
lookup
- sart: drop device reference after lookup since it's no longer used
afterwards
Signed-off-by: Sven Peter <sven@kernel.org>
* tag 'apple-soc-drivers-6.19' of https://git.kernel.org/pub/scm/linux/kernel/git/sven/linux:
soc: apple: sart: drop device reference after lookup
soc: apple: mailbox: fix device leak on lookup
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
FSL SOC Changes for 6.19
- A couple misc changes to fsl/qbman
- Update email address for Christophe Leroy in MAINTAINERS
* tag 'soc_fsl-6.19-1' of git://git.kernel.org/pub/scm/linux/kernel/git/chleroy/linux:
soc: fsl: qbman: use kmalloc_array() instead of kmalloc()
soc: fsl: qbman: add WQ_PERCPU to alloc_workqueue users
MAINTAINERS: Update email address for Christophe Leroy
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reset/GPIO/swnode changes for v6.19
* Extend software node implementation, allowing its properties to reference
existing firmware nodes.
* Update the GPIO property interface to use reworked swnode macros.
* Rework reset-gpio code to use GPIO lookup via swnode.
* Fix spi-cs42l43 driver to work with swnode changes.
* tag 'reset-gpio-for-v6.19' of https://git.pengutronix.de/git/pza/linux:
reset: gpio: use software nodes to setup the GPIO lookup
reset: gpio: convert the driver to using the auxiliary bus
reset: make the provider of reset-gpios the parent of the reset device
reset: order includes alphabetically in reset/core.c
gpio: swnode: allow referencing GPIO chips by firmware nodes
spi: cs42l43: Use actual ACPI firmware node for chip selects
software node: allow referencing firmware nodes
software node: increase the reference of the swnode by its fwnode
software node: read the reference args via the fwnode API
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
RISC-V Devicetrees for v6.19
MAINTAINERS:
There's some re-jigging of things to reduce duplication, by moving me
into the StarFive entry and my tree into the Microchip one. The
other platforms that I look after (SiFive and Canaan) are marked as Odd
Fixes to better represent their status. Nothing functionally changes.
Microchip:
Add adc and mmc nodes for the Beagle-V Fire.
SiFive:
Add pwm fans to the unmatched board.
StarFive:
Add the Orange PI RV board, another VisionFive 2 derived SBC. This
required moving a mmc related nodes out of the common file, into
<board>.dts. Yet more things moved out of the common file when the
VisionFive 2 Lite boards were added, which use the JH7110S SoC instead of
the JH7110. The difference here between SoCs is just temperature and
frequency ranges, but the boards differ enough that the pool of common
nodes decreases a little further. There's an eMMC and an SD variant
here, that are different SKUs, bringing the total new StarFive boards to
three.
Signed-off-by: Conor Dooley <conor.dooley@microchip.com>
* tag 'riscv-dt-for-v6.19' of https://git.kernel.org/pub/scm/linux/kernel/git/conor/linux:
riscv: dts: starfive: add Orange Pi RV
dt-bindings: riscv: starfive: add xunlong,orangepi-rv
riscv: dts: starfive: Add VisionFive 2 Lite eMMC board device tree
riscv: dts: starfive: Add VisionFive 2 Lite board device tree
riscv: dts: starfive: Add common board dtsi for VisionFive 2 Lite variants
riscv: dts: starfive: jh7110-common: Move out some nodes to the board dts
dt-bindings: riscv: Add StarFive JH7110S SoC and VisionFive 2 Lite board
MAINTAINERS: degrade RISC-V MISC SOC SUPPORT to Odd Fixes
MAINTAINERS: add tree to RISC-V Microchip entry
MAINTAINERS: remove patchwork from RISC-V MISC SOC SUPPORT
MAINTAINERS: add Conor to StarFive entry
riscv: dts: sifive: unmatched: Add PWM controlled fans
riscv: dts: microchip: enable qspi adc/mmc-spi-slot on BeagleV Fire
dts: starfive: jh7110-common: split out mmc0 reset pins from common into boards
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
The base address of JPEG decoder core 1 should start at 0x10000, and
have a size of 0x10000, i.e. it is right after core 0.
Instead the core has the same base address as core 0, and with a crazy
large size. This looks like a mixup of address and size cells when the
ranges were converted.
This causes the kernel to fail to register the second core due to sysfs
name conflicts:
sysfs: cannot create duplicate filename '/devices/platform/soc/soc:jpeg-decoder@1a040000/1a040000.jpgdec'
Fix up the address range.
Fixes: a9eac43d03 ("arm64: dts: mediatek: mt8195: Fix ranges for jpeg enc/decoder nodes")
Signed-off-by: Chen-Yu Tsai <wenst@chromium.org>
Acked-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
Link: https://lore.kernel.org/r/20251127100044.612825-1-wenst@chromium.org
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Commit 8c3170628a ("wifi: brcmfmac: keep power during suspend if board
requires it") changed default behavior of the BRCMFMAC driver, which now
keeps SDIO card powered during system suspend to enable optional support
for WOWL. This feature is not supported by the legacy Exynos4 based
boards and leads to WLAN disfunction after system suspend/resume cycle.
Fix this by annotating SDIO host used by WLAN chip with
'cap-power-off-card' property, which should have been there from the
beginning.
Fixes: f77cbb9a3e ("ARM: dts: exynos: Add bcm4334 device node to Trats2")
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Link: https://patch.msgid.link/20251126102618.3103517-5-m.szyprowski@samsung.com
Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org>
Commit 8c3170628a ("wifi: brcmfmac: keep power during suspend if board
requires it") changed default behavior of the BRCMFMAC driver, which now
keeps SDIO card powered during system suspend to enable optional support
for WOWL. This feature is not supported by the legacy Exynos4 based
boards and leads to WLAN disfunction after system suspend/resume cycle.
Fix this by annotating SDIO host used by WLAN chip with
'cap-power-off-card' property, which should have been there from the
beginning.
Fixes: a19f6efc01 ("ARM: dts: exynos: Enable WLAN support for the Trats board")
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Link: https://patch.msgid.link/20251126102618.3103517-4-m.szyprowski@samsung.com
Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org>
Commit 8c3170628a ("wifi: brcmfmac: keep power during suspend if board
requires it") changed default behavior of the BRCMFMAC driver, which now
keeps SDIO card powered during system suspend to enable optional support
for WOWL. This feature is not supported by the legacy Exynos4 based
boards and leads to WLAN disfunction after system suspend/resume cycle.
Fix this by annotating SDIO host used by WLAN chip with
'cap-power-off-card' property, which should have been there from the
beginning.
Fixes: 8620cc2f99 ("ARM: dts: exynos: Add devicetree file for the Galaxy S2")
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Link: https://patch.msgid.link/20251126102618.3103517-3-m.szyprowski@samsung.com
Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org>
Commit 8c3170628a ("wifi: brcmfmac: keep power during suspend if board
requires it") changed default behavior of the BRCMFMAC driver, which now
keeps SDIO card powered during system suspend to enable optional support
for WOWL. This feature is not supported by the legacy Exynos4 based
boards and leads to WLAN disfunction after system suspend/resume cycle.
Fix this by annotating SDIO host used by WLAN chip with
'cap-power-off-card' property, which should have been there from the
beginning.
Fixes: f1b0ffaa68 ("ARM: dts: exynos: Enable WLAN support for the UniversalC210 board")
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Link: https://patch.msgid.link/20251126102618.3103517-2-m.szyprowski@samsung.com
Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org>
Commit 78b72897a5 ("soc: samsung: exynos-pmu: Enable CPU Idle for
gs101") added system wide suspend/resume callbacks to Exynos PMU driver,
but some items used by these callbacks are initialized only on
GS101-compatible boards. Move that initialization to exynos_pmu_probe()
to avoid potential lockdep warnings like below observed during system
suspend/resume cycle:
INFO: trying to register non-static key.
The code is fine but needs lockdep annotation, or maybe
you didn't initialize this object before use?
turning off the locking correctness validator.
CPU: 0 UID: 0 PID: 2134 Comm: rtcwake Not tainted 6.18.0-rc7-next-20251126-00039-g1d656a1af243 #11794 PREEMPT
Hardware name: Samsung Exynos (Flattened Device Tree)
Call trace:
unwind_backtrace from show_stack+0x10/0x14
show_stack from dump_stack_lvl+0x68/0x88
dump_stack_lvl from register_lock_class+0x970/0x988
register_lock_class from __lock_acquire+0xc8/0x29ec
__lock_acquire from lock_acquire+0x134/0x39c
lock_acquire from _raw_spin_lock+0x38/0x48
_raw_spin_lock from exynos_cpupm_suspend_noirq+0x18/0x34
exynos_cpupm_suspend_noirq from dpm_run_callback+0x98/0x2b8
dpm_run_callback from device_suspend_noirq+0x8c/0x310
Fixes: 78b72897a5 ("soc: samsung: exynos-pmu: Enable CPU Idle for gs101")
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Link: https://patch.msgid.link/20251126110038.3326768-1-m.szyprowski@samsung.com
[krzk: include calltrace into commit msg]
Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org>
Make do_proc_douintvec static and export proc_douintvec_conv wrapper
function for external use. This is to keep with the design in sysctl.c.
Update fs/pipe.c to use the new public API.
Signed-off-by: Joel Granados <joel.granados@kernel.org>
Create a converter for the pipe-max-size proc_handler using the
SYSCTL_UINT_CONV_CUSTOM. Move SYSCTL_CONV_IDENTITY macro to the sysctl
header to make it available for pipe size validation. Keep returning
-EINVAL when (val == 0) by using a range checking converter and setting
the minimal valid value (extern1) to SYSCTL_ONE. Keep round_pipe_size by
passing it as the operation for SYSCTL_USER_TO_KERN_INT_CONV.
Signed-off-by: Joel Granados <joel.granados@kernel.org>
Move proc_doulongvec_ms_jiffies_minmax to kernel/time/jiffies.c. Create
a non static wrapper function proc_doulongvec_minmax_conv that
forwards the custom convmul and convdiv argument values to the internal
do_proc_doulongvec_minmax. Remove unused linux/times.h include from
kernel/sysctl.c.
Signed-off-by: Joel Granados <joel.granados@kernel.org>
Move integer jiffies converters (proc_dointvec{_,_ms_,_userhz_}jiffies
and proc_dointvec_ms_jiffies_minmax) to kernel/time/jiffies.c. Error
stubs for when CONFIG_PRCO_SYSCTL is not defined are not reproduced
because all the jiffies converters go through proc_dointvec_conv which
is already stubbed. This is part of the greater effort to move sysctl
logic out of kernel/sysctl.c thereby reducing merge conflicts in
kernel/sysctl.c.
Signed-off-by: Joel Granados <joel.granados@kernel.org>
Move SYSCTL_USER_TO_KERN_UINT_CONV and SYSCTL_UINT_CONV_CUSTOM macros to
include/linux/sysctl.h. No need to embed sysctl_kern_to_user_uint_conv
in a macro as it will not need a custom kernel pointer operation. This
is a preparation commit to enable jiffies converter creation outside
kernel/sysctl.c.
Signed-off-by: Joel Granados <joel.granados@kernel.org>
Move direction macros (SYSCTL_{USER_TO_KERN,KERN_TO_USER}) and the
integer converter macros (SYSCTL_{USER_TO_KERN,KERN_TO_USER}_INT_CONV,
SYSCTL_INT_CONV_CUSTOM) into include/linux/sysctl.h. This is a
preparation commit to enable jiffies converter creation outside
kernel/sysctl.c.
Signed-off-by: Joel Granados <joel.granados@kernel.org>
The new non-static proc_dointvec_conv forwards a custom converter
function to do_proc_dointvec from outside the sysctl scope. Rename the
do_proc_dointvec call points so any future changes to proc_dointvec_conv
are propagated in sysctl.c This is a preparation commit that allows the
integer jiffie converter functions to move out of kernel/sysctl.c.
Signed-off-by: Joel Granados <joel.granados@kernel.org>
Replace sysctl_user_to_kern_uint_conv function with
SYSCTL_USER_TO_KERN_UINT_CONV macro that accepts u_ptr_op parameter for
value transformation. Replacing sysctl_kern_to_user_uint_conv is not
needed as it will only be used from within sysctl.c. This is a
preparation commit for creating a custom converter in fs/pipe.c. No
Functional changes are intended.
Signed-off-by: Joel Granados <joel.granados@kernel.org>
Add k_ptr_range_check parameter to SYSCTL_UINT_CONV_CUSTOM macro to
enable range validation using table->extra1/extra2. Replace
do_proc_douintvec_minmax_conv with do_proc_uint_conv_minmax generated
by the updated macro.
Signed-off-by: Joel Granados <joel.granados@kernel.org>
Pass sysctl_{user_to_kern,kern_to_user}_uint_conv (unsigned integer
uni-directional converters) to the new SYSCTL_UINT_CONV_CUSTOM macro
to create do_proc_douintvec_conv's replacement (do_proc_uint_conv).
This is a preparation commit to use the unsigned integer converter from
outside sysctl. No functional change is intended.
Signed-off-by: Joel Granados <joel.granados@kernel.org>
Extend the SYSCTL_INT_CONV_CUSTOM macro with a k_ptr_range_check
parameter to conditionally generate range validation code. When enabled,
validation is done against table->extra1 (min) and table->extra2 (max)
bounds before assignment. Add base minmax and ms_jiffies_minmax
converter instances that utilize the range checking functionality.
Signed-off-by: Joel Granados <joel.granados@kernel.org>
New SYSCTL_INT_CONV_CUSTOM macro creates "bi-directional" converters
from a user-to-kernel and a kernel-to-user functions. Replace integer
versions of do_proc_*_conv functions with the ones from the new macro.
Rename "_dointvec_" to just "_int_" as these converters are not applied
to vectors and the "do" is already in the name.
Move the USER_HZ validation directly into proc_dointvec_userhz_jiffies()
Signed-off-by: Joel Granados <joel.granados@kernel.org>
Eight converter functions are created using two new macros
(SYSCTL_USER_TO_KERN_INT_CONV & SYSCTL_KERN_TO_USER_INT_CONV); they are
called from four pre-existing converter functions: do_proc_dointvec_conv
and do_proc_dointvec{,_userhz,_ms}_jiffies_conv. The function names
generated by the macros are differentiated by a string suffix passed as
the first macro argument.
The SYSCTL_USER_TO_KERN_INT_CONV macro first executes the u_ptr_op
operation, then checks for overflow, assigns sign (-, +) and finally
writes to the kernel var with WRITE_ONCE; it always returns an -EINVAL
when an overflow is detected. The SYSCTL_KERN_TO_USER_INT_CONV uses
READ_ONCE, casts to unsigned long, then executes the k_ptr_op before
assigning the value to the user space buffer.
The overflow check is always done against MAX_INT after applying
{k,u}_ptr_op. This approach avoids rounding or precision errors that
might occur when using the inverse operations.
Signed-off-by: Joel Granados <joel.granados@kernel.org>
Rename converter parameter to indicate data flow direction: "lvalp" to
"u_ptr" indicating a user space parsed value pointer. "valp" to "k_ptr"
indicating a kernel storage value pointer. This facilitates the
identification of discrepancies between direction (copy to kernel or
copy to user space) and the modified variable. This is a preparation
commit for when the converter functions are exposed to the rest of the
kernel.
Signed-off-by: Joel Granados <joel.granados@kernel.org>
Replace the "write" integer parameter with SYSCTL_USER_TO_KERN() and
SYSCTL_KERN_TO_USER() that clearly indicate data flow direction in
sysctl operations.
"write" originates in proc_sysctl.c (proc_sys_{read,write}) and can take
one of two values: "0" or "1" when called from proc_sys_read and
proc_sys_write respectively. When write has a value of zero, data is
"written" to a user space buffer from a kernel variable (usually
ctl_table->data). Whereas when write has a value greater than zero, data
is "written" to an internal kernel variable from a user space buffer.
Remove this ambiguity by introducing macros that clearly indicate the
direction of the "write".
The write mode names in sysctl_writes_mode are left unchanged as these
directly relate to the sysctl_write_strict file in /proc/sys where the
word "write" unambiguously refers to writing to a file.
Signed-off-by: Joel Granados <joel.granados@kernel.org>
Remove "__" from __do_proc_do{intvec,uintvec,ulongvec_minmax} internal
functions and delete their corresponding do_proc_do* wrappers. These
indirections are unnecessary as they do not add extra logic nor do they
indicate a layer separation.
Signed-off-by: Joel Granados <joel.granados@kernel.org>
Remove superfluous tbl_data param from do_proc_douintvec{,_r,_w}
and __do_proc_do{intvec,uintvec,ulongvec_minmax}. There is no need to
pass it as it is always contained within the ctl_table struct.
Signed-off-by: Joel Granados <joel.granados@kernel.org>
* Replace void* data in the converter functions with a const struct
ctl_table* table as it was only getting forwarding values from
ctl_table->extra{1,2}.
* Remove the void* data in the do_proc_* functions as they already had a
pointer to the ctl_table.
* Remove min/max structures do_proc_do{uint,int}vec_minmax_conv_param;
the min/max values get passed directly in ctl_table.
* Keep min/max initialization in extra{1,2} in proc_dou8vec_minmax.
* The do_proc_douintvec was adjusted outside sysctl.c as it is exported
to fs/pipe.c.
Signed-off-by: Joel Granados <joel.granados@kernel.org>
Move enabling and disabling of interrupts around the SIE instruction to
entry code. Enabling interrupts only after the __TI_sie flag has been set
guarantees that the SIE instruction is not executed if an interrupt happens
between enabling interrupts and the execution of the SIE instruction.
Interrupt handlers and machine check handler forward the PSW to the
sie_exit label in such cases.
This is a prerequisite for VIRT_XFER_TO_GUEST_WORK to prevent that guest
context is entered when e.g. a scheduler IPI, indicating that a reschedule
is required, happens right before the SIE instruction, which could lead to
long delays.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Tested-by: Andrew Donnellan <ajd@linux.ibm.com>
Signed-off-by: Andrew Donnellan <ajd@linux.ibm.com>
Reviewed-by: Janosch Frank <frankja@linux.ibm.com>
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
Add a signal_exits counter for s390, as exists on arm64, loongarch, mips,
powerpc, riscv and x86.
This is used by kvm_handle_signal_exit(), which we will use when we
later enable CONFIG_VIRT_XFER_TO_GUEST_WORK.
Signed-off-by: Andrew Donnellan <ajd@linux.ibm.com>
Reviewed-by: Janosch Frank <frankja@linux.ibm.com>
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
Add some basic function interfaces such as CSR register access, local
irq enable or disable APIs.
Signed-off-by: Bibo Mao <maobibo@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
When system returns from exception with ertn instruction, PC comes from
LOONGARCH_CSR_ERA, and CSR.CRMD comes LOONGARCH_CSR_PRMD.
Here save CSR register CSR.ERA and CSR.PRMD into stack, and then restore
them from stack. So it can be modified by exception handlers in future.
Signed-off-by: Bibo Mao <maobibo@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
With in-kernel emulated eiointc driver, hardware register can be
accessed by different size, there is reg_u8/reg_u16/reg_u32/reg_u64
union type with EIOINTC register.
Here use 64-bit type with register definition and remove union type
since most registers are accessed with 64-bit method. And this makes
EIOINTC emulated driver simpler.
Signed-off-by: Bibo Mao <maobibo@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
Check whether the host CPU supported AVEC, and save/restore CSR_MSGIS0-
CSR_MSGIS3 when necessary.
Reviewed-by: Bibo Mao <maobibo@loongson.cn>
Signed-off-by: Song Gao <gaosong@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
Now VM PMU capability comes from host PMU capability directly, instead
bit 23 of HW GCFG CSR register also show PMU capability for VM. It will
be better if it comes from HW GCFG CSR register rather than just host
PMU capability, especially when LVZ feature is emulated in TCG mode, in
which case without PMU capability.
Signed-off-by: Bibo Mao <maobibo@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
The fuse_ilookup() function only sets *fm on the success path so this
"if (fm) {" NULL check doesn't work. The "fm" pointer is either
uninitialized or valid. Check the "inode" pointer instead.
Also, while it's not necessary, it is cleaner to move the iput(inode)
under the NULL check as well.
Fixes: 64becd224f ("fuse: new work queue to invalidate dentries from old epochs")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: Luis Henriques <luis@igalia.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
When a request is terminated before it has been committed, the request
is not removed from the queue's list. This leaves a dangling list entry
that leads to list corruption and use-after-free issues.
Remove the request from the queue's list for terminated non-committed
requests.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Fixes: c090c8abae ("fuse: Add io-uring sqe commit and fetch support")
Cc: stable@vger.kernel.org
Reviewed-by: Bernd Schubert <bschubert@ddn.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Currently if a user enqueues a work item using schedule_delayed_work() the
used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to
schedule_work() that is using system_wq and queue_work(), that makes use
again of WORK_CPU_UNBOUND.
This lack of consistency cannot be addressed without refactoring the API.
alloc_workqueue() treats all queues as per-CPU by default, while unbound
workqueues must opt-in via WQ_UNBOUND.
This default is suboptimal: most workloads benefit from unbound queues,
allowing the scheduler to place worker threads where they’re needed and
reducing noise when CPUs are isolated.
This continues the effort to refactor workqueue APIs, which began with
the introduction of new workqueues and a new alloc_workqueue flag in:
commit 128ea9f6cc ("workqueue: Add system_percpu_wq and system_dfl_wq")
commit 930c2ea566 ("workqueue: Add new WQ_PERCPU flag")
This change adds a new WQ_PERCPU flag to explicitly request
alloc_workqueue() to be per-cpu when WQ_UNBOUND has not been specified.
With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND),
any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND
must now use WQ_PERCPU.
Once migration is complete, WQ_UNBOUND can be removed and unbound will
become the implicit default.
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Link: https://lore.kernel.org/r/20251107152950.293899-1-marco.crivellari@suse.com
Signed-off-by: Christophe Leroy (CS GROUP) <chleroy@kernel.org>
My address at csgroup.eu is redirected to the new one at
cs-soprasteria.com which is a Professionnal Microsoft account without
SMTP gateway. We still have the SMTP gateway for csgroup.eu but it is
not maintained anymore and might stop working at anytime. In addition
the DKIM signature is not performed allthough the domain has DMARC
set up.
Switch to kernel.org email address and add entries in mailmap.
Link: https://lore.kernel.org/r/d9b6758297d7dcddf79feb4459ceaedd7d6f1f2e.1764155757.git.chleroy@kernel.org
Signed-off-by: Christophe Leroy (CS GROUP) <chleroy@kernel.org>
KVM SVM changes for 6.19:
- Fix a few missing "VMCB dirty" bugs.
- Fix the worst of KVM's lack of EFER.LMSLE emulation.
- Add AVIC support for addressing 4k vCPUs in x2AVIC mode.
- Fix incorrect handling of selective CR0 writes when checking intercepts
during emulation of L2 instructions.
- Fix a currently-benign bug where KVM would clobber SPEC_CTRL[63:32] on
VMRUN and #VMEXIT.
- Fix a bug where KVM corrupt the guest code stream when re-injecting a soft
interrupt if the guest patched the underlying code after the VM-Exit, e.g.
when Linux patches code with a temporary INT3.
- Add KVM_X86_SNP_POLICY_BITS to advertise supported SNP policy bits to
userspace, and extend KVM "support" to all policy bits that don't require
any actual support from KVM.
KVM VMX changes for 6.19:
- Use the root role from kvm_mmu_page to construct EPTPs instead of the
current vCPU state, partly as worthwhile cleanup, but mostly to pave the
way for tracking per-root TLB flushes so that KVM can elide EPT flushes on
pCPU migration if KVM has flushed the root at least once.
- Add a few missing nested consistency checks.
- Rip out support for doing "early" consistency checks via hardware as the
functionality hasn't been used in years and is no longer useful in general,
and replace it with an off-by-default module param to detected missed
consistency checks (i.e. WARN if hardware finds a check that KVM does not).
- Fix a currently-benign bug where KVM would drop the guest's SPEC_CTRL[63:32]
on VM-Enter.
- Misc cleanups.
KVM TDX changes for 6.19:
- Overhaul the TDX code to address systemic races where KVM (acting on behalf
of userspace) could inadvertantly trigger lock contention in the TDX-Module,
which KVM was either working around in weird, ugly ways, or was simply
oblivious to (as proven by Yan tripping several KVM_BUG_ON()s with clever
selftests).
- Fix a bug where KVM could corrupt a vCPU's cpu_list when freeing a vCPU if
creating said vCPU failed partway through.
- Fix a few sparse warnings (bad annotation, 0 != NULL).
- Use struct_size() to simplify copying capabilities to userspace.
KVM x86 MMU changes for 6.19:
- Skip the costly "zap all SPTEs" on an MMIO generation wrap if MMIO SPTE
caching is disabled, as there can't be any relevant SPTEs to zap.
- Relocate a misplace export.
The original addition of cache information for the Amlogic S922X SoC
used the wrong next-level cache node for CPU cores 100 and 101,
incorrectly referencing `l2_cache_l`. These cores actually belong to
the big cluster and should reference `l2_cache_b`. Update the device
tree accordingly.
Fixes: e7f85e6c15 ("arm64: dts: amlogic: Add cache information to the Amlogic S922X SoC")
Signed-off-by: Guillaume La Roque <glaroque@baylibre.com>
Reviewed-by: Neil Armstrong <neil.armstrong@linaro.org>
Link: https://patch.msgid.link/20251123-fixkhadas-v1-1-045348f0a4c2@baylibre.com
Signed-off-by: Neil Armstrong <neil.armstrong@linaro.org>
KVM selftests changes for 6.19:
- Fix a math goof in mmu_stress_test when running on a single-CPU system/VM.
- Forcefully override ARCH from x86_64 to x86 to play nice with specifying
ARCH=x86_64 on the command line.
- Extend a bunch of nested VMX to validate nested SVM as well.
- Add support for LA57 in the core VM_MODE_xxx macro, and add a test to
verify KVM can save/restore nested VMX state when L1 is using 5-level
paging, but L2 is not.
- Clean up the guest paging code in anticipation of sharing the core logic for
nested EPT and nested NPT.
KVM x86 misc changes for 6.19:
- Fix an async #PF bug where KVM would clear the completion queue when the
guest transitioned in and out of paging mode, e.g. when handling an SMI and
then returning to paged mode via RSM.
- Fix a bug where TDX would effectively corrupt user-return MSR values if the
TDX Module rejects VP.ENTER and thus doesn't clobber host MSRs as expected.
- Leave the user-return notifier used to restore MSRs registered when
disabling virtualization, and instead pin kvm.ko. Restoring host MSRs via
IPI callback is either pointless (clean reboot) or dangerous (forced reboot)
since KVM has no idea what code it's interrupting.
- Use the checked version of {get,put}_user(), as Linus wants to kill them
off, and they're measurably faster on modern CPUs due to the unchecked
versions containing an LFENCE.
- Fix a long-lurking bug where KVM's lack of catch-up logic for periodic APIC
timers can result in a hard lockup in the host.
- Revert the periodic kvmclock sync logic now that KVM doesn't use a
clocksource that's subject to NPT corrections.
- Clean up KVM's handling of MMIO Stale Data and L1TF, and bury the latter
behind CONFIG_CPU_MITIGATIONS.
- Context switch XCR0, XSS, and PKRU outside of the entry/exit fastpath as
the only reason they were handled in the faspath was to paper of a bug in
the core #MC code that has long since been fixed.
- Add emulator support for AVX MOV instructions to play nice with emulated
devices whose PCI BARs guest drivers like to access with large multi-byte
instructions.
KVM guest_memfd changes for 6.19:
- Add NUMA mempolicy support for guest_memfd, and clean up a variety of
rough edges in guest_memfd along the way.
- Define a CLASS to automatically handle get+put when grabbing a guest_memfd
from a memslot to make it harder to leak references.
- Enhance KVM selftests to make it easer to develop and debug selftests like
those added for guest_memfd NUMA support, e.g. where test and/or KVM bugs
often result in hard-to-debug SIGBUS errors.
- Misc cleanups.
KVM generic changes for 6.19:
- Use the recently-added WQ_PERCPU when creating the per-CPU workqueue for
irqfd cleanup.
- Fix a goof in the dirty ring documentation.
Suzuki notices that making the ICH_HCR_EL2_TDIR capability a system
one isn't a very good idea, should we end-up with CPUs that have
asymmetric TDIR support (somehow unlikely, but you never know what
level of stupidity vendors are up to). For this hypothetical setup,
making this an "EARLY_LOCAL_CPU_FEATURE" is a much better option.
This is actually consistent with what we already do with GICv5
legacy interface, so flip the capability over.
Reported-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Fixes: 2a28810cbb ("KVM: arm64: GICv3: Detect and work around the lack of ICV_DIR_EL1 trapping")
Link: https://lore.kernel.org/r/5df713d4-8b79-4456-8fd1-707ca89a61b6@arm.com
Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Link: https://msgid.link/20251125160144.1086511-1-maz@kernel.org
Signed-off-by: Oliver Upton <oupton@kernel.org>
Orange Pi RV is a SBC based on the StarFive VisionFive 2 board.
Orange Pi RV features:
- StarFive JH7110 SoC
- GbE port connected to JH7110 GMAC0 via YT8531 PHY
- 4x USB ports via VL805 PCIe USB controller connected to JH7110 pcie0
- M.2 M-key slot connected to JH7110 pcie1
- HDMI video output
- 3.5mm audio output
- Ampak AP6256 SDIO Wi-Fi/Bluetooth module on mmc0
- microSD slot on mmc1
- SPI NOR flash memory
- 24c02 EEPROM (read only by default)
Signed-off-by: Icenowy Zheng <uwu@icenowy.me>
Signed-off-by: E Shattow <e@freeshell.de>
[conor: amend comment to say what's missing]
Signed-off-by: Conor Dooley <conor.dooley@microchip.com>
Some node in this file are not used by the upcoming VisionFive 2 Lite
board. Move them to the board dts to prepare for adding the new
VisionFive 2 Lite device tree.
Tested-by: Matthias Brugger <mbrugger@suse.com>
Signed-off-by: Hal Feng <hal.feng@starfivetech.com>
Signed-off-by: Conor Dooley <conor.dooley@microchip.com>
Add device tree bindings for the StarFive JH7110S SoC
and the VisionFive 2 Lite board equipped with it.
JH7110S SoC is an industrial SoC which can run at -40~85 degrees centigrade
and up to 1.25GHz. Its CPU cores and peripherals are the same as
those of the JH7110 SoC.
VisionFive 2 Lite boards have MicroSD card version (default) and eMMC
version, which are called "VisionFive 2 Lite" and "VisionFive 2 Lite eMMC"
respectively.
Acked-by: Rob Herring (Arm) <robh@kernel.org>
Tested-by: Matthias Brugger <mbrugger@suse.com>
Reviewed-by: Heinrich Schuchardt <heinrich.schuchardt@canonical.com>
Signed-off-by: Hal Feng <hal.feng@starfivetech.com>
Signed-off-by: Conor Dooley <conor.dooley@microchip.com>
The SiFive and Canaan platforms are not being actively looked after at
this point, but fixes for them would be applied if/when the patches
appeared. Since they're now the only things in the RISC-V MISC SOC
SUPPORT, mark them as Odd Fixes. I don't believe this is a functional
change, it just represents what's actually happening - particularly
since the Canaan k230 never built up enough steam to get merged and the
new SiFive demo chips have been done in partnership with with other
companies, e.g. Eswin, and will reside in their directories instead.
Reviewed-by: Paul Walmsley <pjw@kernel.org>
Signed-off-by: Conor Dooley <conor.dooley@microchip.com>
In fairness to my own employer, lumping it in as "misc" is not quite
accurate when they do pay me to look after the platform. Move the tree
link for it to its entry, rather than having the RISC-V MISC SOC SUPPORT
entry cover it.
Signed-off-by: Conor Dooley <conor.dooley@microchip.com>
I don't use the main riscv patchwork for anything to do with SoCs,
remove them from here to avoid confusion.
Signed-off-by: Conor Dooley <conor.dooley@microchip.com>
I apply the patches for StarFive devicetrees, add me to the entry along
with my tree location etc. This is not a functional change, as this info
was in the "RISC-V MISC" entry but I'd rather not have the duplication
of entries covering the StarFive directory.
Acked-by: Emil Renner Berthing <kernel@esmil.dk>
Signed-off-by: Conor Dooley <conor.dooley@microchip.com>
MediaTek soc driver updates
This adds socinfo entries for MT8189 Kompanio 540, an extra entry
for a variant of MT8391 (AV/AZA) Genio 720 SoC, and support for
the PMIC Wrapper (by adding a compatible string) in MT8189.
* tag 'mtk-soc-for-v6.19' of https://git.kernel.org/pub/scm/linux/kernel/git/mediatek/linux:
dt-bindings: soc: mediatek: pwrap: Add compatible for MT8189 SoC
soc: mediatek: mtk-socinfo: Add entry for MT8391AV/AZA Genio 720
soc: mediatek: mtk-socinfo: Add extra entry for MT8189
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reset controller updates for v6.19
* Add support for LAN969x, eic770 and RZ/G3S reset controllers,
for the RZ/G3S USB-PHY reset controller, and for the remaining
TH1520 reset controllers.
* Drop legacy reset control lookup code.
* Include linux/bits.h from linux/reset.h to make it self-contained.
* tag 'reset-for-v6.19' of https://git.pengutronix.de/git/pza/linux:
Documentation: reset: Remove reset_controller_add_lookup()
reset: fix BIT macro reference
reset: rzg2l-usbphy-ctrl: Fix a NULL vs IS_ERR() bug in probe
reset: th1520: Support reset controllers in more subsystems
reset: th1520: Prepare for supporting multiple controllers
dt-bindings: reset: thead,th1520-reset: Add controllers for more subsys
dt-bindings: reset: thead,th1520-reset: Remove non-VO-subsystem resets
reset: remove legacy reset lookup code
clk: davinci: psc: drop unused reset lookup
reset: rzg2l-usbphy-ctrl: Add support for RZ/G3S SoC
reset: rzg2l-usbphy-ctrl: Add support for USB PWRRDY
dt-bindings: reset: renesas,rzg2l-usbphy-ctrl: Document RZ/G3S support
reset: eswin: Add eic7700 reset driver
dt-bindings: reset: eswin: Documentation for eic7700 SoC
reset: sparx5: add LAN969x support
dt-bindings: reset: microchip: Add LAN969x support
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
STM32 Firewall bus for v6.19, round 1
Highlights:
----------
The STM32MP21x platforms have a slightly different RIFSC. Add support
for these platforms.
Also, the RIF is a complex firewall framework which can be tricky
to debug. To facilitate the latter, add a debugfs entry that can
be used to display the whole RIFSC firewall configuration at runtime.
* tag 'stm32-bus-firewall-for-v6.19-1' of git://git.kernel.org/pub/scm/linux/kernel/git/atorgue/stm32:
bus: rifsc: add debugfs entry to dump the firewall configuration
dt-bindings: bus: add stm32mp21 RIFSC compatible
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Some additional sane defaults for the oldish rk3368 soc.
* tag 'v6.19-rockchip-drivers1' of git://git.kernel.org/pub/scm/linux/kernel/git/mmind/linux-rockchip:
soc: rockchip: grf: Add select correct PWM implementation on RK3368
soc: rockchip: grf: Set pwm2/xin32k pad default to xin32k for rk3368
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Qualcomm driver updates for v6.19
Support for hardware-keymanager v1 support for wrapped keys is introduce
in the ICE driver.
Support for the new Kaanapali mobile platform is added to last-level
cache controller, pd-mapper, and UBWC drivers.
UBWC driver gains support for the Monaco and Glymur platforms.
The PMIC GLINK driver is extended to handle the differences found in
targets where the related firmware runs on the SoCCP.
Support for running on targets without initialized SMEM is provided, by
reworking the SMEM driver to differentiate between "not yet probed" and
"probed but there was no SMEM". An unwanted WARN_ON() that triggered if
clients asked for a SMEM item beyond the currently running system's
limit, was removed, to allow new use cases to gracefully fail on old
targets.
The Qualcomm socinfo driver is extended with support for version 20
through 23 and support for providing version information about more than
32 remote processors. Identifiers for QCS6490 and SM8850 are also added.
Additionally, a number of smaller bug fixes and cleanups in PBS, OCMEM,
GSBI, TZMEM, and MDT-loader are included.
* tag 'qcom-drivers-for-6.19' of https://git.kernel.org/pub/scm/linux/kernel/git/qcom/linux: (31 commits)
soc: qcom: mdt_loader: rename 'firmware' parameter of qcom_mdt_load()
soc: qcom: mdt_loader: merge __qcom_mdt_load() and qcom_mdt_load_no_init()
soc: qcom: socinfo: Add reserve field to support future extension
soc: qcom: socinfo: Add support for new fields in revision 20
dt-bindings: firmware: qcom,scm: Document SCM on Kaanapali SOC
soc: qcom: socinfo: add support to extract more than 32 image versions
soc: qcom: smem: drop the WARN_ON() on SMEM item validation
soc: qcom: ubwc: Add config for Kaanapali
soc: qcom: socinfo: Add SoC ID for QCS6490
dt-bindings: arm: qcom,ids: Add SoC ID for QCS6490
soc: qcom: ice: Add HWKM v1 support for wrapped keys
soc: qcom: smem: better track SMEM uninitialized state
err.h: add INIT_ERR_PTR() macro
soc: qcom: smem: fix hwspinlock resource leak in probe error paths
dt-bindings: soc: qcom,aoss-qmp: Document the Glymur AOSS side channel
dt-bindings: soc: qcom,aoss-qmp: Document the Kaanapali AOSS channel
soc: qcom: ubwc: Add QCS8300 UBWC cfg
dt-bindings: firmware: qcom,scm: Document Glymur scm
soc: qcom: socinfo: Add SM8850 SoC ID
dt-bindings: arm: qcom,ids: Add SoC ID for SM8850
...
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
ti-sysc: allow OMAP2 and OMAP4 timers to be reserved on AM33xx
* tag 'omap-for-v6.19/drivers-signed' of git://git.kernel.org/pub/scm/linux/kernel/git/khilman/linux-omap:
ti-sysc: allow OMAP2 and OMAP4 timers to be reserved on AM33xx
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
TI SoC driver updates for v6.19
- ti_sci: Add Partial-IO poweroff support and sys_off handler integration
- ti_sci: Gate IO isolation programming on firmware capability flag
- ti_sci: cleanup by replacing ifdeffery in PM ops with pm_sleep_ptr() macro
* tag 'ti-driver-soc-for-v6.19' of https://git.kernel.org/pub/scm/linux/kernel/git/ti/linux:
firmware: ti_sci: Partial-IO support
firmware: ti_sci: Support transfers without response
firmware: ti_sci: Set IO Isolation only if the firmware is capable
firmware: ti_sci: Replace ifdeffery by pm_sleep_ptr() macro
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
i.MX drivers update for 6.19:
- A series from Peng Fan to to improve i.MX SCU firmware drivers
* tag 'imx-drivers-6.19' of https://git.kernel.org/pub/scm/linux/kernel/git/shawnguo/linux:
firmware: imx: scu: Use devm_mutex_init
firmware: imx: scu: Suppress bind attrs
firmware: imx: scu: Update error code
firmware: imx: scu-irq: Remove unused export of imx_scu_enable_general_irq_channel
firmware: imx: scu-irq: Set mu_resource_id before get handle
firmware: imx: scu-irq: Init workqueue before request mbox channel
firmware: imx: scu-irq: Free mailbox client on failure at imx_scu_enable_general_irq_channel()
firmware: imx: scu-irq: fix OF node leak in
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Allwinner driver changes for 6.19
Just one cleanup change that is part of tree wide cleanup of redundant
pm_runtime_mark_last_busy() calls.
* tag 'sunxi-drivers-for-6.19' of https://git.kernel.org/pub/scm/linux/kernel/git/sunxi/linux:
bus: sunxi-rsb: Remove redundant pm_runtime_mark_last_busy() calls
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
soc/tegra: Changes for v6.19-rc1
A couple of small fixes across the board: ACPI support on FUSE no longer
exposes duplicate SoC information, speedo IDs for Tegra210 are updated,
some comments see typo fixes or kerneldoc additions. Finally, support
for USB wake events is added on Tegra234, which allow these systems to
resume from suspend on USB activity.
* tag 'tegra-for-6.19-soc' of git://git.kernel.org/pub/scm/linux/kernel/git/tegra/linux:
soc/tegra: pmc: Add USB wake events for Tegra234
soc/tegra: pmc: Document tegra_pmc.syscore field
soc/tegra: pmc: Don't fail if "aotag" is not present
soc/tegra: fuse: speedo-tegra210: Add SoC speedo 2
soc/tegra: fuse: speedo-tegra210: Update speedo IDs
soc/tegra: Resolve a spelling error in the tegra194-cbb.c
soc/tegra: fuse: Do not register SoC device on ACPI boot
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
syscore: Changes for v6.19-rc1
Add a parameter to syscore operations to allow passing contextual data,
which in turn enables refactoring of drivers to make them independent of
global data. This initially only contains the API changes along with the
updates for existing drivers. Subsequent work will make use of this to
improve drivers.
* tag 'tegra-for-6.19-syscore' of git://git.kernel.org/pub/scm/linux/kernel/git/tegra/linux:
syscore: Pass context data to callbacks
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
amba: Fixes for v6.19-rc1
Fix a device leak. Could go into v6.18 as a fix, but since this problem
has existed for a long time and nobody has reported it before it doesn't
seem critical enough and sufficient to get it into 6.19 and then
backported.
* tag 'tegra-for-6.19-core' of git://git.kernel.org/pub/scm/linux/kernel/git/tegra/linux:
amba: tegra-ahb: Fix device leak on SMMU enable
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Samsung SoC drivers for v6.19
1. ChipID driver: Add support for identifying Exynos8890 and Exynos9610.
2. PMU driver: Allow specifying list of valid registers for the custom
regmap used on Google GS101 SoC. The PMU (Power Management Unit) on
that SoC uses more complex access to registers than simple MMIO and
invalid registers trigger aborts halting the system.
3. Few minor cleanups.
4. Several new bindings for compatible devices.
* tag 'samsung-drivers-6.19' of https://git.kernel.org/pub/scm/linux/kernel/git/krzk/linux:
dt-bindings: soc: samsung: exynos-pmu: allow mipi-phy subnode for Exynos7870 PMU
soc: samsung: exynos-chipid: use a local dev variable
dt-bindings: soc: samsung: exynos-sysreg: add gs101 hsi0 and misc compatibles
dt-bindings: soc: samsung: exynos-sysreg: add power-domains
soc: samsung: gs101-pmu: implement access tables for read and write
soc: samsung: exynos-pmu: move some gs101 related code into new file
soc: samsung: exynos-pmu: allow specifying read & write access tables for secure regmap
dt-bindings: samsung: exynos-sysreg: add exynos7870 sysregs
soc: samsung: exynos-chipid: add exynos8890 SoC support
dt-bindings: hwinfo: samsung,exynos-chipid: add exynos8890-chipid compatible
dt-bindings: soc: samsung: exynos-pmu: add exynos8890 compatible
soc: samsung: exynos-pmu: Annotate online/offline functions with __must_hold
soc: samsung: exynos-chipid: Add exynos9610 SoC support
dt-bindings: hwinfo: samsung,exynos-chipid: add exynos9610 compatible
dt-bindings: soc: samsung: exynos-sysreg: Add Exynos990 PERIC0/1 compatibles
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
In order to work around the existence of a vmap symbol in libpcap, the
UML makefile unconditionally redefines vmap to kernel_vmap. However,
this not only affects the actual vmap symbol, but also anything else
named vmap, including a number of struct members in DRM.
This would not be too much of a problem, since all uses are also
updated, except we now have Rust DRM bindings, which expect the
corresponding Rust structs to have 'vmap' names. Since the redefinition
applies in bindgen, but not to Rust code, we end up with errors such as:
error[E0560]: struct `drm_gem_object_funcs` has no fields named `vmap`
--> rust/kernel/drm/gem/mod.rs:210:9
Since libpcap support was removed in commit 12b8e7e69a ("um: Remove
obsolete pcap driver"), remove the, now unnecessary, define as well.
We also take this opportunity to update the comment.
Signed-off-by: David Gow <davidgow@google.com>
Acked-by: Miguel Ojeda <ojeda@kernel.org>
Link: https://patch.msgid.link/20251122083213.3996586-1-davidgow@google.com
Fixes: 12b8e7e69a ("um: Remove obsolete pcap driver")
[adjust commmit message a bit]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Memory controller drivers for v6.19
1. Tegra drivers: Several cleanups (dev_err_probe(), error messages).
2. Renesas RPC IF: Add system suspend support.
* tag 'memory-controller-drv-6.19-2' of https://git.kernel.org/pub/scm/linux/kernel/git/krzk/linux-mem-ctrl:
memory: tegra186-emc: Fix missing put_bpmp
memory: renesas-rpc-if: Add suspend/resume support
memory: tegra30-emc: Add the SoC model prefix to functions
memory: tegra20-emc: Add the SoC model prefix to functions
memory: tegra186-emc: Add the SoC model prefix to functions
memory: tegra124-emc: Add the SoC model prefix to functions
memory: tegra124-emc: Simplify and handle deferred probe with dev_err_probe()
memory: tegra186-emc: Simplify and handle deferred probe with dev_err_probe()
memory: tegra20-emc: Simplify and handle deferred probe with dev_err_probe()
memory: tegra30-emc: Simplify and handle deferred probe with dev_err_probe()
memory: tegra30-emc: Do not print error on icc_node_create() failure
memory: tegra20-emc: Do not print error on icc_node_create() failure
memory: tegra186-emc: Do not print error on icc_node_create() failure
memory: tegra124-emc: Do not print error on icc_node_create() failure
memory: tegra124-emc: Simplify return of emc_init()
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Renesas driver updates for v6.19
- Keep the WDTRSTCR.RESBAR2S bit in the default state on R-Car Gen4.
* tag 'renesas-drivers-for-v6.19-tag1' of git://git.kernel.org/pub/scm/linux/kernel/git/geert/renesas-devel:
soc: renesas: rcar-rst: Keep RESBAR2S in default state
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
In kvm_vcpu_on_spin(), the loop counter 'i' is incorrectly written to
last_boosted_vcpu instead of the actual vCPU index 'idx'. This causes
last_boosted_vcpu to store the loop iteration count rather than the
vCPU index, leading to incorrect round-robin behavior in subsequent
directed yield operations.
Fix this by using 'idx' instead of 'i' in the assignment.
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20251110033232.12538-7-kernellwp@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
New boards: QNAP TS233 (2-bay variant of the RK3568 NAS series) and
Asus Tinkerboard 3 + 3S.
Additional peripherals enabled on 100ASK DshanPi A1, Orange Pi 3B,
Indiedroid Nova, QNAP-TSx33 series + LED states on Radxa boards,
power-domains for the previously added RK3368 display components.
* tag 'v6.19-rockchip-dts64-2' of git://git.kernel.org/pub/scm/linux/kernel/git/mmind/linux-rockchip: (22 commits)
arm64: dts: rockchip: enable RTC for 100ASK DshanPi A1
arm64: dts: rockchip: enable USB for 100ASK DshanPi A1
arm64: dts: rockchip: enable button for 100ASK DshanPi A1
arm64: dts: rockchip: add mmc aliases for 100ASK DshanPi A1
arm64: dts: rockchip: remove mmc max-frequency for 100ASK DshanPi A1
arm64: dts: rockchip: Enable i2c2 on Orange Pi 3B
arm64: dts: rockchip: Use default-state for power LED for Radxa boards
arm64: dts: rockchip: fix PCIe 3.3V regulator voltage on 9Tripod X3568 v4
arm64: dts: rockchip: Add power-domain to RK3368 VOP controller
arm64: dts: rockchip: Add power-domain to RK3368 DSI controller
arm64: dts: rockchip: Add host wake pin for wifi on Indiedroid Nova
arm64: dts: rockchip: Correct pinctrl for pcie for Indiedroid Nova
arm64: dts: rockchip: Define regulator for pcie2x1l2 on Indiedroid Nova
arm64: dts: rockchip: Add clk32k_in for Indiedroid Nova
arm64: dts: rockchip: Add Asus Tinker Board 3 and 3S device tree
dt-bindings: arm: rockchip: Add Asus Tinker Board 3/3S
dt-bindings: arm: rockchip: merge Asus Tinker and Tinker S
arm64: dts: rockchip: add QNAP TS233 devicetree
dt-bindings: arm: rockchip: add TS233 to RK3568-based QNAP NAS devices
arm64: dts: rockchip: move common qnap tsx33 parts to dtsi
...
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
The various "syscon" nodes in SC9860 are only referenced by clock
provider nodes in a 1:1 relationship, and nothing else references the
"syscon" nodes. There's no apparent reason for this split. The 2 nodes
can simply be merged into 1 node. The clock driver has supported using
either "reg" or "sprd,syscon" to access registers from the start, so
there shouldn't be any compatibility issues.
With this, DT schema warnings for missing a specific compatible with
"syscon" and non-MMIO devices on "simple-bus" are fixed.
Reviewed-by: Chunyan Zhang <zhang.lyra@gmail.com>
Signed-off-by: Rob Herring (Arm) <robh@kernel.org>
Link: https://lore.kernel.org/r/20251124210031.767382-2-robh@kernel.org
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Patch series "mm: swap: small fixes and comment cleanups", v2.
This series provides a few small fixes and cleanups for the swap code.
The first patch fixes a memory leak in an error path that was recently
introduced. The subsequent patches include minor logic adjustments and
the removal of redundant comments.
This patch (of 5):
setup_clusters() could leak 'cluster_info' memory if an error occurred on
a path that did not jump to the 'err_free' label.
This patch simplifies the error handling by removing the goto label and
instead calling free_cluster_info() on all error exit paths.
The new logic is safe, as free_cluster_info() already handles NULL pointer
inputs.
Link: https://lkml.kernel.org/r/20251031065011.40863-1-youngjun.park@lge.com
Link: https://lkml.kernel.org/r/20251031065011.40863-2-youngjun.park@lge.com
Fixes: 07adc4cf1e ("mm, swap: implement dynamic allocation of swap table")
Signed-off-by: Youngjun Park <youngjun.park@lge.com>
Reviewed-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Baoquan He <bhe@redhat.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Chris Li <chrisl@kernel.org>
Cc: Kemeng Shi <shikemeng@huaweicloud.com>
Cc: Nhat Pham <nphamcs@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
When the page size exceeds 4KB, if bd_wb_limit is set to a value that is
not aligned with the page size, it will cause a numerical wrap-around
issue for bd_wb_limit. For example, when the page size is set to 16KB and
bd_wb_limit is set to 3, after one write-back operation, the value of
bd_wb_limit will become -1. More seriously, since bd_wb_limit is an
unsigned number, its value may become as large as 2^64 - 1.
The core reason for this problem is that the unit of bd_wb_limit is 4KB.
For example, when a write-back occurs on a system with a page size of
16KB, 4 needs to be subtracted from bd_wb_limit. This operation takes
place in the zram_account_writeback_submit function.
This patch fixes the issue by limiting bd_wb_limit to be an integer
multiple of PAGE_SIZE / 4096.
Link: https://lkml.kernel.org/r/tencent_5936CFE72BAB2BA76887BB69DCC1B5E67C05@qq.com
Fixes: 1d69a3f8ae ("zram: idle writeback fixes and cleanup")
Signed-off-by: Yuwen Chen <ywen.chen@foxmail.com>
Acked-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Brian Geffon <bgeffon@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Richard Chang <richardycc@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
commit 97f0b13452 ("tracing: add trace event for
memory-failure") introduces the selection of RAS in memory-failure. This
commit is just a tracing feature; in reality, there is no dependency
between memory-failure and RAS. RAS increases the size of the bzImage
image by 8k, which is very valuable for embedded devices.
Move the memory-failure traceing code from ras_event.h to
memory-failure.h and remove the selection of RAS.
Link: https://lkml.kernel.org/r/20251119095943.67125-1-xieyuanbin1@huawei.com
Signed-off-by: Xie Yuanbin <xieyuanbin1@huawei.com>
Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>
Acked-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Borislav Petkov <bp@alien8.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "make vmalloc gfp flags usage more apparent", v4.
We should do a better job at enforcing gfp flags for vmalloc. Right now,
we have a kernel-doc for __vmalloc_node_range(), and hope callers pass in
supported flags. If a caller were to pass in an unsupported flag, we may
BUG, silently clear it, or completely ignore it.
If we are more proactive about enforcing gfp flags, we can making sure
callers know when they may be asking for unsupported behavior.
This patchset lets vmalloc control the incoming gfp flags, and cleans up
some hard to read gfp code.
This patch (of 4):
Vmalloc explicitly supports a list of flags, but we never enforce them.
vmalloc has been trying to handle unsupported flags by clearing and
setting flags wherever necessary. This is messy and makes the code harder
to understand, when we could simply check for a supported input
immediately instead.
Define a helper mask and function telling callers they have passed in
invalid flags, and clear those unsupported vmalloc flags.
Link: https://lkml.kernel.org/r/20251121094405.40628-1-vishal.moola@gmail.com
Link: https://lkml.kernel.org/r/20251121094405.40628-2-vishal.moola@gmail.com
Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Suggested-by: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Uladzislau Rezki (Sony)" <urezki@gmail.com>
Acked-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "memcg: cleanup the memcg stats interfaces".
The memcg stats are safe against irq (and nmi) context and thus does not
require disabling irqs. However for some stats which are also maintained
at node level, it is using irq unsafe interface and thus requiring the
users to still disables irqs or use interfaces which explicitly disables
irqs. Let's move memcg code to use irq safe node level stats function
which is already optimized for architectures with HAVE_CMPXCHG_LOCAL (all
major ones), so there will not be any performance penalty for its usage.
This patch (of 4):
The memcg stats are safe against irq (and nmi) context and thus does not
require disabling irqs. However some code paths for memcg stats also
update the node level stats and use irq unsafe interface and thus require
the users to disable irqs. However node level stats, on architectures
with HAVE_CMPXCHG_LOCAL (all major ones), has interface which does not
require irq disabling. Let's move memcg stats code to start using that
interface for node level stats.
Link: https://lkml.kernel.org/r/20251110232008.1352063-1-shakeel.butt@linux.dev
Link: https://lkml.kernel.org/r/20251110232008.1352063-2-shakeel.butt@linux.dev
Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
First, writeback bdev ->bitmap bits are set only from one context, as we
can have only one single task performing writeback, so we cannot race with
anything else. Remove retry path.
Second, we always check ZRAM_WB flag to distinguish writtenback slots, so
we should not confuse 0 bdev block index and 0 handle. We can use first
bdev block (0 bit) for writeback as well.
While at it, give functions slightly more accurate names, as we don't
alloc/free anything there, we reserve a block for async writeback or
release the block.
Link: https://lkml.kernel.org/r/20251122074029.3948921-6-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Reviewed-by: Brian Geffon <bgeffon@google.com>
Cc: Minchan Kim <minchan@google.com>
Cc: Richard Chang <richardycc@google.com>
Cc: Yuwen Chen <ywen.chen@foxmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "zram: introduce writeback bio batching", v6.
As writeback is becoming more and more common the longstanding limitations
of zram writeback throughput are becoming more visible. Introduce
writeback bio batching so that multiple writeback bios can be processed
simultaneously.
This patch (of 6):
As was stated in a comment [1] a single page writeback IO is not
efficient, but it works. It's time to address this throughput limitation
as writeback becomes used more often. Introduce batched (multiple) bio
writeback support to take advantage of parallel requests processing and
better requests scheduling.
Approach used in this patch doesn't use a dedicated kthread like in [2],
or blk-plug like in [3]. Dedicated kthread adds complexity, which can be
avoided. Apart from that not all zram setups use writeback, so having
numerous per-device kthreads (on systems that create multiple zram
devices) hanging around is not the most optimal thing to do. blk-plug, on
the other hand, works best when request are sequential, which doesn't
particularly fit zram writebck IO patterns: zram writeback IO patterns are
expected to be random, due to how bdev block reservation/release are
handled. blk-plug approach also works in cycles: idle IO, when zram sets
up requests in a batch, is followed by bursts of IO, when zram submits the
entire batch.
Instead we use a batch of requests and submit new bio as soon as one of
the in-flight requests completes.
For the time being the writeback batch size (maximum number of in-flight
bio requests) is set to 32 for all devices. A follow up patch adds a
writeback_batch_size device attribute, so the batch size becomes run-time
configurable.
Link: https://lkml.kernel.org/r/20251122074029.3948921-1-senozhatsky@chromium.org
Link: https://lkml.kernel.org/r/20251122074029.3948921-2-senozhatsky@chromium.org
Link: https://lore.kernel.org/all/20181203024045.153534-6-minchan@kernel.org/ [1]
Link: https://lore.kernel.org/all/20250731064949.1690732-1-richardycc@google.com/ [2]
Link: https://lore.kernel.org/all/tencent_78FC2C4FE16BA1EBAF0897DB60FCD675ED05@qq.com/ [3]
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Co-developed-by: Yuwen Chen <ywen.chen@foxmail.com>
Co-developed-by: Richard Chang <richardycc@google.com>
Suggested-by: Minchan Kim <minchan@google.com>
Cc: Brian Geffon <bgeffon@google.com>
Cc: Richard Chang <richardycc@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The maintenance of the folio->_deferred_list is intricate because it's
reused in a local list.
Here are some peculiarities:
1) When a folio is removed from its split queue and added to a local
on-stack list in deferred_split_scan(), the ->split_queue_len isn't
updated, leading to an inconsistency between it and the actual
number of folios in the split queue.
2) When the folio is split via split_folio() later, it's removed from
the local list while holding the split queue lock. At this time,
the lock is not needed as it is not protecting anything.
3) To handle the race condition with a third-party freeing or migrating
the preceding folio, we must ensure there's always one safe (with
raised refcount) folio before by delaying its folio_put(). More
details can be found in commit e66f3185fa ("mm/thp: fix deferred
split queue not partially_mapped"). It's rather tricky.
We can use the folio_batch infrastructure to handle this clearly. In this
case, ->split_queue_len will be consistent with the real number of folios
in the split queue. If list_empty(&folio->_deferred_list) returns false,
it's clear the folio must be in its split queue (not in a local list
anymore).
In the future, we will reparent LRU folios during memcg offline to
eliminate dying memory cgroups, which requires reparenting the split queue
to its parent first. So this patch prepares for using
folio_split_queue_lock_irqsave() as the memcg may change then.
Link: https://lkml.kernel.org/r/59cb6b6fb5ffcff9d23b81890b252960139ad8e7.1762762324.git.zhengqi.arch@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Lance Yang <lance.yang@linux.dev>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Nico Pache <npache@redhat.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The kernel maintains leaf page table entries which contain either:
The kernel maintains leaf page table entries which contain either:
- Nothing ('none' entries)
- Present entries*
- Everything else that will cause a fault which the kernel handles
* Present entries are either entries the hardware can navigate without page
fault or special cases like NUMA hint protnone or PMD with cleared
present bit which contain hardware-valid entries modulo the present bit.
In the 'everything else' group we include swap entries, but we also
include a number of other things such as migration entries, device private
entries and marker entries.
Unfortunately this 'everything else' group expresses everything through a
swp_entry_t type, and these entries are referred to swap entries even
though they may well not contain a... swap entry.
This is compounded by the rather mind-boggling concept of a non-swap swap
entry (checked via non_swap_entry()) and the means by which we twist and
turn to satisfy this.
This patch lays the foundation for reducing this confusion.
We refer to 'everything else' as a 'software-define leaf entry' or
'softleaf'. for short And in fact we scoop up the 'none' entries into
this concept also so we are left with:
- Present entries.
- Softleaf entries (which may be empty).
This allows for radical simplification across the board - one can simply
convert any leaf page table entry to a leaf entry via softleaf_from_pte().
If the entry is present, we return an empty leaf entry, so it is assumed
the caller is aware that they must differentiate between the two
categories of page table entries, checking for the former via
pte_present().
As a result, we can eliminate a number of places where we would otherwise
need to use predicates to see if we can proceed with leaf page table entry
conversion and instead just go ahead and do it unconditionally.
We do so where we can, adjusting surrounding logic as necessary to
integrate the new softleaf_t logic as far as seems reasonable at this
stage.
We typedef swp_entry_t to softleaf_t for the time being until the
conversion can be complete, meaning everything remains compatible
regardless of which type is used. We will eventually remove swp_entry_t
when the conversion is complete.
We introduce a new header file to keep things clear - leafops.h - this
imports swapops.h so can direct replace swapops imports without issue, and
we do so in all the files that require it.
Additionally, add new leafops.h file to core mm maintainers entry.
Link: https://lkml.kernel.org/r/c879383aac77d96a03e4d38f7daba893cd35fc76.1762812360.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Byungchul Park <byungchul@sk.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Cc: Chris Li <chrisl@kernel.org>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Gregory Price <gourry@gourry.net>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: "Huang, Ying" <ying.huang@linux.alibaba.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Janosch Frank <frankja@linux.ibm.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
Cc: Kairui Song <kasong@tencent.com>
Cc: Kemeng Shi <shikemeng@huaweicloud.com>
Cc: Lance Yang <lance.yang@linux.dev>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Mathew Brost <matthew.brost@intel.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Nico Pache <npache@redhat.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Rakie Kim <rakie.kim@sk.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Wei Xu <weixugc@google.com>
Cc: xu xin <xu.xin16@zte.com.cn>
Cc: Yuanchu Xie <yuanchu@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "mm: remove is_swap_[pte, pmd]() + non-swap entries,
introduce leaf entries", v3.
There's an established convention in the kernel that we treat leaf page
tables (so far at the PTE, PMD level) as containing 'swap entries' should
they be neither empty (i.e. p**_none() evaluating true) nor present (i.e.
p**_present() evaluating true).
However, at the same time we also have helper predicates - is_swap_pte(),
is_swap_pmd() - which are inconsistently used.
This is problematic, as it is logical to assume that should somebody wish
to operate upon a page table swap entry they should first check to see if
it is in fact one.
It also implies that perhaps, in future, we might introduce a non-present,
none page table entry that is not a swap entry.
This series resolves this issue by systematically eliminating all use of
the is_swap_pte() and is swap_pmd() predicates so we retain only the
convention that should a leaf page table entry be neither none nor present
it is a swap entry.
We also have the further issue that 'swap entry' is unfortunately a really
rather overloaded term and in fact refers to both entries for swap and for
other information such as migration entries, page table markers, and
device private entries.
We therefore have the rather 'unique' concept of a 'non-swap' swap entry.
This series therefore introduces the concept of 'software leaf entries',
of type softleaf_t, to eliminate this confusion.
A software leaf entry in this sense is any page table entry which is
non-present, and represented by the softleaf_t type. That is - page table
leaf entries which are software-controlled by the kernel.
This includes 'none' or empty entries, which are simply represented by an
zero leaf entry value.
In order to maintain compatibility as we transition the kernel to this new
type, we simply typedef swp_entry_t to softleaf_t.
We introduce a number of predicates and helpers to interact with software
leaf entries in include/linux/leafops.h which, as it imports swapops.h,
can be treated as a drop-in replacement for swapops.h wherever leaf entry
helpers are used.
Since softleaf_from_[pte, pmd]() treats present entries as they were
empty/none leaf entries, this allows for a great deal of simplification of
code throughout the code base, which this series utilises a great deal.
We additionally change from swap entry to software leaf entry handling
where it makes sense to and eliminate functions from swapops.h where
software leaf entries obviate the need for the functions.
This patch (of 16):
PTE markers were previously only concerned with UFFD-specific logic - that
is, PTE entries with the UFFD WP marker set or those marked via
UFFDIO_POISON.
However since the introduction of guard markers in commit 7c53dfbdb0
("mm: add PTE_MARKER_GUARD PTE marker"), this has no longer been the case.
Issues have been avoided as guard regions are not permitted in conjunction
with UFFD, but it still leaves very confusing logic in place, most notably
the misleading and poorly named pte_none_mostly() and
huge_pte_none_mostly().
This predicate returns true for PTE entries that ought to be treated as
none, but only in certain circumstances, and on the assumption we are
dealing with H/W poison markers or UFFD WP markers.
This patch removes these functions and makes each invocation of these
functions instead explicitly check what it needs to check.
As part of this effort it introduces is_uffd_pte_marker() to explicitly
determine if a marker in fact is used as part of UFFD or not.
In the HMM logic we note that the only time we would need to check for a
fault is in the case of a UFFD WP marker, otherwise we simply encounter a
fault error (VM_FAULT_HWPOISON for H/W poisoned marker, VM_FAULT_SIGSEGV
for a guard marker), so only check for the UFFD WP case.
While we're here we also refactor code to make it easier to understand.
[akpm@linux-foundation.org: fix comment typo, per Mike]
Link: https://lkml.kernel.org/r/cover.1762812360.git.lorenzo.stoakes@oracle.com
Link: https://lkml.kernel.org/r/c38625fd9a1c1f1cf64ae8a248858e45b3dcdf11.1762812360.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Byungchul Park <byungchul@sk.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Cc: Chris Li <chrisl@kernel.org>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Gregory Price <gourry@gourry.net>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: "Huang, Ying" <ying.huang@linux.alibaba.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Janosch Frank <frankja@linux.ibm.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
Cc: Kairui Song <kasong@tencent.com>
Cc: Kemeng Shi <shikemeng@huaweicloud.com>
Cc: Lance Yang <lance.yang@linux.dev>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Mathew Brost <matthew.brost@intel.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Nico Pache <npache@redhat.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Rakie Kim <rakie.kim@sk.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Wei Xu <weixugc@google.com>
Cc: xu xin <xu.xin16@zte.com.cn>
Cc: Yuanchu Xie <yuanchu@google.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
uniform_split_supported() and non_uniform_split_supported() share
significantly similar logic.
The only functional difference is that uniform_split_supported() includes
an additional check on the requested @new_order.
The reason for this check comes from the following two aspects:
* some file system or swap cache just supports order-0 folio
* the behavioral difference between uniform/non-uniform split
The behavioral difference between uniform split and non-uniform:
* uniform split splits folio directly to @new_order
* non-uniform split creates after-split folios with orders from
folio_order(folio) - 1 to new_order.
This means for non-uniform split or !new_order split we should check the
file system and swap cache respectively.
This commit unifies the logic and merge the two functions into a single
combined helper, removing redundant code and simplifying the split
support checking mechanism.
Link: https://lkml.kernel.org/r/20251106034155.21398-3-richard.weiyang@gmail.com
Fixes: c010d47f10 ("mm: thp: split huge page to any lower order pages")
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: "David Hildenbrand (Red Hat)" <david@kernel.org>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Lance Yang <lance.yang@linux.dev>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Nico Pache <npache@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "mm/huge_memory: Define split_type and consolidate split
support checks", v3.
This two-patch series focuses on improving code clarity and removing
redundancy in the huge memory handling logic related to folio splitting.
The series is based on an original proposal to merge two significantly
identical functions that check folio split support[1]. During this
process, we found an opportunity to improve readability by explicitly
defining the split types.
Patch 1: define split_type and use it
Patch 2: merge uniform_split_supported() and non_uniform_split_supported()
This patch (of 2):
We currently handle two distinct types of large folio splitting:
* uniform split
* non-uniform split
Differentiating between these types using a simple boolean variable is not
obvious and can harm code readability.
This commit introduces enum split_type to explicitly define these two
types. Replacing the existing boolean variable with this enumeration
significantly improves code clarity and expressiveness when dealing with
folio splitting logic.
No functional change is expected.
[akpm@linux-foundation.org: tweak layout, per David]
Link: https://lkml.kernel.org/r/20251106034155.21398-1-richard.weiyang@gmail.com
Link: https://lkml.kernel.org/r/20251106034155.21398-2-richard.weiyang@gmail.com
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: "David Hildenbrand (Red Hat)" <david@kernel.org>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Lance Yang <lance.yang@linux.dev>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Nico Pache <npache@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
dmirror_device_init() calls device_initialize() which sets the device
reference count to 1, but fails to call put_device() when error occurs
after dev_set_name() or cdev_device_add() failures. This results in
memory leaks of struct device objects. Additionally,
dmirror_device_remove() lacks the final put_device() call to properly
release the device reference.
Found by code review.
Link: https://lkml.kernel.org/r/20251108115346.6368-1-make24@iscas.ac.cn
Fixes: 6a760f58c7 ("mm/hmm/test: use char dev with struct device to get device node")
Signed-off-by: Ma Ke <make24@iscas.ac.cn>
Cc: Haoxiang Li <make24@iscas.ac.cn>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Mika Penttilä <mpenttil@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "Optimize folio split in memory failure", v5.
This patchset optimizes folio split operations in memory failure code by
always splitting a folio to min_order_for_split() to minimize unusable
pages, even if min_order_for_split() is non zero and memory failure code
would take the failed path eventually for a successfully split folio.
This means instead of making the entire original folio unusable memory
failure code would only make its after-split folio, which has order of
min_order_for_split() and contains HWPoison page, unusable.
For soft offline case, since the original folio is still accessible, no
split is performed if the folio cannot be split to order-0 to prevent
potential performance loss.
In addition, add split_huge_page_to_order() to improve code readability
and fix kernel-doc comment format for folio_split() and other related
functions.
Background
==========
This patchset is a follow-up of "[PATCH v3] mm/huge_memory: do not change
split_huge_page*() target order silently."[1] and [PATCH v4]
mm/huge_memory: preserve PG_has_hwpoisoned if a folio is split to >0
order[2], since both are separated out as hotfixes. It improves how
memory failure code handles large block size(LBS) folios with
min_order_for_split() > 0. By splitting a large folio containing HW
poisoned pages to min_order_for_split(), the after-split folios without HW
poisoned pages could be freed for reuse. To achieve this, folio split
code needs to set has_hwpoisoned on after-split folios containing HW
poisoned pages and it is done in the hotfix in [2].
This patchset includes:
1. A patch adds split_huge_page_to_order(),
2. Patch 2 and Patch 3 of "[PATCH v2 0/3] Do not change split folio target
order"[3],
This patch (of 3):
When the caller does not supply a list to
split_huge_page_to_list_to_order(), use split_huge_page_to_order()
instead.
Link: https://lkml.kernel.org/r/20251031162001.670503-1-ziy@nvidia.com
Link: https://lkml.kernel.org/r/20251031162001.670503-2-ziy@nvidia.com
Link: https://lore.kernel.org/all/20251017013630.139907-1-ziy@nvidia.com/ [1]
Link: https://lore.kernel.org/all/20251023030521.473097-1-ziy@nvidia.com/ [2]
Link: https://lore.kernel.org/all/20251016033452.125479-1-ziy@nvidia.com/ [3]
Signed-off-by: Zi Yan <ziy@nvidia.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Reviewed-by: Lance Yang <lance.yang@linux.dev>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Luis Chamberalin <mcgrof@kernel.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
Cc: Nico Pache <npache@redhat.com>
Cc: Pankaj Raghav <kernel@pankajraghav.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
MIGRATE_VMA_SELECT_COMPOUND will be used to select THP pages during
migrate_vma_setup() and MIGRATE_PFN_COMPOUND will make migrating device
pages as compound pages during device pfn migration.
migrate_device code paths go through the collect, setup and finalize
phases of migration.
The entries in src and dst arrays passed to these functions still remain
at a PAGE_SIZE granularity. When a compound page is passed, the first
entry has the PFN along with MIGRATE_PFN_COMPOUND and other flags set
(MIGRATE_PFN_MIGRATE, MIGRATE_PFN_VALID), the remaining entries
(HPAGE_PMD_NR - 1) are filled with 0's. This representation allows for
the compound page to be split into smaller page sizes.
migrate_vma_collect_hole(), migrate_vma_collect_pmd() are now THP page
aware. Two new helper functions migrate_vma_collect_huge_pmd() and
migrate_vma_insert_huge_pmd_page() have been added.
migrate_vma_collect_huge_pmd() can collect THP pages, but if for some
reason this fails, there is fallback support to split the folio and
migrate it.
migrate_vma_insert_huge_pmd_page() closely follows the logic of
migrate_vma_insert_page()
Support for splitting pages as needed for migration will follow in later
patches in this series.
Link: https://lkml.kernel.org/r/20251001065707.920170-8-balbirs@nvidia.com
Signed-off-by: Balbir Singh <balbirs@nvidia.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
Cc: Rakie Kim <rakie.kim@sk.com>
Cc: Byungchul Park <byungchul@sk.com>
Cc: Gregory Price <gourry@gourry.net>
Cc: Ying Huang <ying.huang@linux.alibaba.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Nico Pache <npache@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona@ffwll.ch>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mika Penttilä <mpenttil@redhat.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Francois Dugast <francois.dugast@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "mm: support device-private THP", v7.
This patch series introduces support for Transparent Huge Page (THP)
migration in zone device-private memory. The implementation enables
efficient migration of large folios between system memory and
device-private memory
Background
Current zone device-private memory implementation only supports PAGE_SIZE
granularity, leading to:
- Increased TLB pressure
- Inefficient migration between CPU and device memory
This series extends the existing zone device-private infrastructure to
support THP, leading to:
- Reduced page table overhead
- Improved memory bandwidth utilization
- Seamless fallback to base pages when needed
In my local testing (using lib/test_hmm) and a throughput test, the series
shows a 350% improvement in data transfer throughput and a 80% improvement
in latency
These patches build on the earlier posts by Ralph Campbell [1]
Two new flags are added in vma_migration to select and mark compound
pages. migrate_vma_setup(), migrate_vma_pages() and
migrate_vma_finalize() support migration of these pages when
MIGRATE_VMA_SELECT_COMPOUND is passed in as arguments.
The series also adds zone device awareness to (m)THP pages along with
fault handling of large zone device private pages. page vma walk and the
rmap code is also zone device aware. Support has also been added for
folios that might need to be split in the middle of migration (when the
src and dst do not agree on MIGRATE_PFN_COMPOUND), that occurs when src
side of the migration can migrate large pages, but the destination has not
been able to allocate large pages. The code supported and used
folio_split() when migrating THP pages, this is used when
MIGRATE_VMA_SELECT_COMPOUND is not passed as an argument to
migrate_vma_setup().
The test infrastructure lib/test_hmm.c has been enhanced to support THP
migration. A new ioctl to emulate failure of large page allocations has
been added to test the folio split code path. hmm-tests.c has new test
cases for huge page migration and to test the folio split path. A new
throughput test has been added as well.
The nouveau dmem code has been enhanced to use the new THP migration
capability.
mTHP support:
The patches hard code, HPAGE_PMD_NR in a few places, but the code has been
kept generic to support various order sizes. With additional refactoring
of the code support of different order sizes should be possible.
The future plan is to post enhancements to support mTHP with a rough
design as follows:
1. Add the notion of allowable thp orders to the HMM based test driver
2. For non PMD based THP paths in migrate_device.c, check to see if
a suitable order is found and supported by the driver
3. Iterate across orders to check the highest supported order for migration
4. Migrate and finalize
The mTHP patches can be built on top of this series, the key design
elements that need to be worked out are infrastructure and driver support
for multiple ordered pages and their migration.
HMM support for large folios was added in 10b9feee2d ("mm/hmm:
populate PFNs from PMD swap entry").
This patch (of 16)
Add routines to support allocation of large order zone device folios and
helper functions for zone device folios, to check if a folio is device
private and helpers for setting zone device data.
When large folios are used, the existing page_free() callback in pgmap is
called when the folio is freed, this is true for both PAGE_SIZE and higher
order pages.
Zone device private large folios do not support deferred split and scan
like normal THP folios.
Link: https://lkml.kernel.org/r/20251001065707.920170-1-balbirs@nvidia.com
Link: https://lkml.kernel.org/r/20251001065707.920170-2-balbirs@nvidia.com
Link: https://lore.kernel.org/linux-mm/20201106005147.20113-1-rcampbell@nvidia.com/ [1]
Signed-off-by: Balbir Singh <balbirs@nvidia.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
Cc: Rakie Kim <rakie.kim@sk.com>
Cc: Byungchul Park <byungchul@sk.com>
Cc: Gregory Price <gourry@gourry.net>
Cc: Ying Huang <ying.huang@linux.alibaba.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Nico Pache <npache@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona@ffwll.ch>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mika Penttilä <mpenttil@redhat.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Francois Dugast <francois.dugast@intel.com>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: "Christian König" <christian.koenig@amd.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Add a new test case that makes an interrupt pending on a vcpu,
activates it, do the priority drop, and then get *another* vcpu
to do the deactivation.
Special care is taken not to trigger an exit in the process, so
that we are sure that the active interrupt is in an LR. Joy.
Tested-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Tested-by: Mark Brown <broonie@kernel.org>
Link: https://msgid.link/20251120172540.2267180-48-maz@kernel.org
Signed-off-by: Oliver Upton <oupton@kernel.org>
Good news: our GIC emulation is not completely broken, and we can
activate as many interrupts as we want.
Bump the test to cover all the SGIs, all the allowed PPIs, and
31 SPIs. Yes, 31, because we have 31 available priorities, and the
test is not happy with having two interrupts with the same priority.
Tested-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Tested-by: Mark Brown <broonie@kernel.org>
Link: https://msgid.link/20251120172540.2267180-46-maz@kernel.org
Signed-off-by: Oliver Upton <oupton@kernel.org>
FEAT_NV2 is pretty terrible for anything that tries to enforce immediate
effects, and writing to ICH_HCR_EL2 in the hope to disable a maintenance
interrupt is vain. This only hits memory, and the guest hasn't cleared
anything -- the MI will fire.
For example, running the vgic_irq test under NV results in about 800
maintenance interrupts being actually handled by the L1 guest,
when none were expected.
As a cheap workaround, read back ICH_MISR_EL2 after writing 0 to
ICH_HCR_EL2. This is very cheap on real HW, and causes a trap to
the host in NV, giving it the opportunity to retire the pending MI.
With this, the above test runs to completion without any MI being
actually handled.
Yes, this is really poor...
Tested-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Tested-by: Mark Brown <broonie@kernel.org>
Link: https://msgid.link/20251120172540.2267180-37-maz@kernel.org
Signed-off-by: Oliver Upton <oupton@kernel.org>
Pretty much like the rest of the LR handling, deactivation of an
L2 interrupt gets reflected in the L1 LRs, and therefore must be
propagated into the L1 shadow state if the interrupt is HW-bound.
Instead of directly handling the active state (which looks a bit
off as it ignores locking and L1->L0 HW propagation), use the new
deactivation primitive to perform the deactivation and deal with
the required maintenance.
Tested-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Tested-by: Mark Brown <broonie@kernel.org>
Link: https://msgid.link/20251120172540.2267180-36-maz@kernel.org
Signed-off-by: Oliver Upton <oupton@kernel.org>
The current approach to nested GICv3 support is to not do anything
while L2 is running, wait a transition from L2 to L1 to resync
LRs, VMCR and HCR, and only then evaluate the state to decide
whether to generate a maintenance interrupt.
This doesn't provide a good quality of emulation, and it would be
far preferable to find out early that we need to perform a switch.
Move the LRs/VMCR and HCR resync into vgic_v3_sync_nested(), so
that we have most of the state available. As we turning the vgic
off at this stage to avoid a screaming host MI, add a new helper
vgic_v3_flush_nested() that switches the vgic on again. The MI can
then be directly injected as required.
Tested-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Tested-by: Mark Brown <broonie@kernel.org>
Link: https://msgid.link/20251120172540.2267180-35-maz@kernel.org
Signed-off-by: Oliver Upton <oupton@kernel.org>
CPUs lacking TDIR always trap ICV_DIR_EL1, no matter what, since
we have ICH_HCR_EL2.TC set permanently. For these CPUs, it is
useless to use a broadcast kick on SPI injection, as the sole
purpose of this is to set TDIR.
We can therefore skip this on these CPUs, which are challenged
enough not to be burdened by extra IPIs. As a consequence,
permanently set the TDIR bit in the shadow state to notify the
fast-path emulation code of the exit reason.
Tested-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Tested-by: Mark Brown <broonie@kernel.org>
Link: https://msgid.link/20251120172540.2267180-34-maz@kernel.org
Signed-off-by: Oliver Upton <oupton@kernel.org>
Even when we have either an LR overflow or SPIs in flight, it is
extremely likely that the interrupt being deactivated is still in
the LRs, and that going all the way back to the the generic trap
handling code is a waste of time.
Instead, try and deactivate in place when possible, and only if
this fails, perform a full exit.
Tested-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Tested-by: Mark Brown <broonie@kernel.org>
Link: https://msgid.link/20251120172540.2267180-33-maz@kernel.org
Signed-off-by: Oliver Upton <oupton@kernel.org>
SPIs are specially annpying, as they can be activated on a CPU and
deactivated on another. WHich means that when an SPI is in flight
anywhere, all CPUs need to have their TDIR trap bit set.
This translates into broadcasting an IPI across all CPUs to make sure
they set their trap bit, The number of in-flight SPIs is kept in
an atomic variable so that CPUs can turn the trap bit off as soon
as possible.
Tested-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Tested-by: Mark Brown <broonie@kernel.org>
Link: https://msgid.link/20251120172540.2267180-32-maz@kernel.org
Signed-off-by: Oliver Upton <oupton@kernel.org>
Deactivation via ICV_DIR_EL1 is both relatively straightforward
(we have the interrupt that needs deactivation) and really awkward.
The main issue is that the interrupt may either be in an LR on
another CPU, or ourside of any LR.
In the former case, we process the deactivation is if ot was
a write to GICD_CACTIVERn, which is already implemented as a big
hammer IPI'ing all vcpus. In the latter case, we just perform
a normal deactivation, similar to what we do for EOImode==0.
Another annoying aspect is that we need to tell the CPU owning
the interrupt that its ap_list needs laudering. We use a brand new
vcpu request to that effect.
Note that this doesn't address deactivation via the GICV MMIO view,
which will be taken care of in a later change.
Tested-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Tested-by: Mark Brown <broonie@kernel.org>
Link: https://msgid.link/20251120172540.2267180-29-maz@kernel.org
Signed-off-by: Oliver Upton <oupton@kernel.org>
Now that we can identify interrupts that have not made it into the LRs,
it becomes relatively easy to use EOIcount to walk the overflow list.
What is a bit odd is that we compute a fake LR for the original
state of the interrupt, clear the active bit, and feed into the existing
logic for processing. In a way, this is what would have happened if
the interrupt was in an LR.
Tested-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Tested-by: Mark Brown <broonie@kernel.org>
Link: https://msgid.link/20251120172540.2267180-28-maz@kernel.org
Signed-off-by: Oliver Upton <oupton@kernel.org>
Now that we always reconfigure the vgic HCR register on entry,
the "enable" part of kvm_vgic_vcpu_enable() is pretty useless.
Removing the enable bits from these functions makes it plain that
they are just about computing the reset state. Just rename the
functions accordingly.
Tested-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Tested-by: Mark Brown <broonie@kernel.org>
Link: https://msgid.link/20251120172540.2267180-23-maz@kernel.org
Signed-off-by: Oliver Upton <oupton@kernel.org>
We currently don't use the maintenance interrupt very much, apart
from EOI on level interrupts, and for LR underflow in limited cases.
However, as we are moving toward a setup where active interrupts
can live outside of the LRs, we need to use the MIs in a more
diverse set of cases.
Add a new helper that produces a digest of the ap_list, and use
that summary to set the various control bits as required.
This slightly changes the way v2 SGIs are handled, as they used to
count for more than one interrupt, but not anymore.
Tested-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Tested-by: Mark Brown <broonie@kernel.org>
Link: https://msgid.link/20251120172540.2267180-22-maz@kernel.org
Signed-off-by: Oliver Upton <oupton@kernel.org>
We currently save/restore the VMCR register in a pretty lazy way
(on load/put, consistently with what we do with the APRs).
However, we are going to need the group-enable bits that are backed
by VMCR on each entry (so that we can avoid injecting interrupts for
disabled groups).
Move the synchronisation from put to sync, which results in some minor
churn in the nVHE hypercalls to simplify things.
Tested-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Tested-by: Mark Brown <broonie@kernel.org>
Link: https://msgid.link/20251120172540.2267180-21-maz@kernel.org
Signed-off-by: Oliver Upton <oupton@kernel.org>
Not programming GICH_HCR while no LRs are populated is a bit
of an issue, as we otherwise don't see any maintenance interrupt
when the guest interacts with the LRs.
Decouple the two and always program the control register, even when
we don't have to touch the LRs.
This is very similar to what we are already doing for GICv3.
Tested-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Tested-by: Mark Brown <broonie@kernel.org>
Link: https://msgid.link/20251120172540.2267180-17-maz@kernel.org
Signed-off-by: Oliver Upton <oupton@kernel.org>
EOIcount is how the virtual CPU interface signals that the guest
is deactivating interrupts outside of the LRs when EOImode==0.
We therefore need to preserve that information so that we can find
out what actually needs deactivating, just like we already do on
GICv3.
Tested-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Tested-by: Mark Brown <broonie@kernel.org>
Link: https://msgid.link/20251120172540.2267180-16-maz@kernel.org
Signed-off-by: Oliver Upton <oupton@kernel.org>
Not programming ICH_HCR_EL2 while no LRs are populated is a bit
of an issue, as we otherwise don't see any maintenance interrupt
when the guest interacts with the LRs.
Decouple the two and always program the control register, even when
we don't have to touch the LRs.
Tested-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Tested-by: Mark Brown <broonie@kernel.org>
Link: https://msgid.link/20251120172540.2267180-13-maz@kernel.org
Signed-off-by: Oliver Upton <oupton@kernel.org>
Despite LPIs not having an active state, *virtual* LPIs do have
one, which gets cleared on EOI. So far, so good.
However, this leads to a small problem: when an active LPI is not
in the LRs, that EOImode==0 and that the guest EOIs it, EOIcount
doesn't get bumped up. Which means that in these condition, the
LPI would stay active forever.
Clearly, we can't have that. So if we spot an active LPI, we drop
that state. It's pretty pointless anyway, and only serves as a way
to trip SW over.
Tested-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Tested-by: Mark Brown <broonie@kernel.org>
Link: https://msgid.link/20251120172540.2267180-11-maz@kernel.org
Signed-off-by: Oliver Upton <oupton@kernel.org>
A long time ago, an unsuspecting architect forgot to add a trap
bit for ICV_DIR_EL1 in ICH_HCR_EL2. Which was unfortunate, but
what's a bit of spec between friends? Thankfully, this was fixed
in a later revision, and ARM "deprecates" the lack of trapping
ability.
Unfortuantely, a few (billion) CPUs went out with that defect,
anything ARMv8.0 from ARM, give or take. And on these CPUs,
you can't trap DIR on its own, full stop.
As the next best thing, we can trap everything in the common group,
which is a tad expensive, but hey ho, that's what you get. You can
otherwise recycle the HW in the neaby bin.
Tested-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Tested-by: Mark Brown <broonie@kernel.org>
Link: https://msgid.link/20251120172540.2267180-7-maz@kernel.org
Signed-off-by: Oliver Upton <oupton@kernel.org>
As we are about to start trapping a bunch of extra things, augment
the pKVM trap description with all the registers trapped by ICH_HCR_EL2.TC,
making them legal instead of resulting in a UNDEF injection in the guest.
While we're at it, ensure that pKVM captures the vgic model so that it
can be checked by the emulation code.
Tested-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Tested-by: Mark Brown <broonie@kernel.org>
Link: https://msgid.link/20251120172540.2267180-6-maz@kernel.org
Signed-off-by: Oliver Upton <oupton@kernel.org>
The trap bits are currently only set to manage CPU errata. However,
we are about to make use of them for purposes beyond beating broken
CPUs into submission.
For this purpose, turn these errata-driven bits into a patched-in
constant that is merged with the KVM-driven value at the point of
programming the ICH_HCR_EL2 register, rather than being directly
stored with with the shadow value..
This allows the KVM code to distinguish between a trap being handled
for the purpose of an erratum workaround, or for KVM's own need.
Tested-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Tested-by: Mark Brown <broonie@kernel.org>
Link: https://msgid.link/20251120172540.2267180-5-maz@kernel.org
Signed-off-by: Oliver Upton <oupton@kernel.org>
Add support for FEAT_XNX to shadow stage-2 MMUs, being careful to only
evaluate XN[0] when the feature is actually exposed to the VM.
Restructure the layering of permissions in the fault handler to assume
pX and uX then restricting based on the guest's stage-2 afterwards.
Reviewed-by: Marc Zyngier <maz@kernel.org>
Tested-by: Marc Zyngier <maz@kernel.org>
Link: https://msgid.link/20251124190158.177318-4-oupton@kernel.org
Signed-off-by: Oliver Upton <oupton@kernel.org>
FEAT_XNX adds support for encoding separate execute permissions for EL0
and EL1 at stage-2. Add support for this to the page table library,
hiding the unintuitive encoding scheme behind generic pX and uX
permission flags.
Reviewed-by: Marc Zyngier <maz@kernel.org>
Tested-by: Marc Zyngier <maz@kernel.org>
Link: https://msgid.link/20251124190158.177318-3-oupton@kernel.org
Signed-off-by: Oliver Upton <oupton@kernel.org>
Most implementations cache the combined result of two-stage translation,
but some, like Andes cores, use split TLBs that store VS-stage and
G-stage entries separately.
On such systems, when a VCPU migrates to another CPU, an additional
HFENCE.VVMA is required to avoid using stale VS-stage entries, which
could otherwise cause guest faults.
Introduce a static key to identify CPUs with split two-stage TLBs.
When enabled, KVM issues an extra HFENCE.VVMA on VCPU migration to
prevent stale VS-stage mappings.
Signed-off-by: Hui Min Mina Chou <minachou@andestech.com>
Signed-off-by: Ben Zong-You Xie <ben717@andestech.com>
Reviewed-by: Radim Krčmář <rkrcmar@ventanamicro.com>
Reviewed-by: Nutty Liu <nutty.liu@hotmail.com>
Link: https://lore.kernel.org/r/20251117084555.157642-1-minachou@andestech.com
Signed-off-by: Anup Patel <anup@brainfault.org>
When executing HLV* instructions at the HS mode, a guest page fault
may occur when a g-stage page table migration between triggering the
virtual instruction exception and executing the HLV* instruction.
This may be a corner case, and one simpler way to handle this is to
re-execute the instruction where the virtual instruction exception
occurred, and the guest page fault will be automatically handled.
Fixes: b91f0e4cb8 ("RISC-V: KVM: Factor-out instruction emulation into separate sources")
Signed-off-by: Fangyu Yu <fangyu.yu@linux.alibaba.com>
Reviewed-by: Anup Patel <anup@brainfault.org>
Link: https://lore.kernel.org/r/20251121133543.46822-1-fangyu.yu@linux.alibaba.com
Signed-off-by: Anup Patel <anup@brainfault.org>
There is already support of enabling dirty log gradually in small chunks
for x86 in commit 3c9bd4006b ("KVM: x86: enable dirty log gradually in
small chunks") and c862626 ("KVM: arm64: Support enabling dirty log
gradually in small chunks"). This adds support for riscv.
x86 and arm64 writes protect both huge pages and normal pages now, so
riscv protect also protects both huge pages and normal pages.
On a nested virtualization setup (RISC-V KVM running inside a QEMU VM
on an [Intel® Core™ i5-12500H] host), I did some tests with a 2G Linux
VM using different backing page sizes. The time taken for
memory_global_dirty_log_start in the L2 QEMU is listed below:
Page Size Before After Optimization
4K 4490.23ms 31.94ms
2M 48.97ms 45.46ms
1G 28.40ms 30.93ms
Signed-off-by: Quan Zhou <zhouquan@iscas.ac.cn>
Signed-off-by: Dong Yang <dayss1224@gmail.com>
Reviewed-by: Anup Patel <anup@brainfault.org>
Link: https://lore.kernel.org/r/20251103062825.9084-1-dayss1224@gmail.com
Signed-off-by: Anup Patel <anup@brainfault.org>
This adds the two PWM-controlled fans of the HiFive Unmatched board to
the device tree.
Signed-off-by: René Rebe <rene@exactco.de>
Signed-off-by: Conor Dooley <conor.dooley@microchip.com>
Commit a52ddb98a6 ("memory: tegra186-emc: Simplify and handle deferred
probe with dev_err_probe()") accidently dropped a call to 'put_bpmp' to
release a handle to the BPMP when getting the EMC clock fails. Fix this
by restoring the 'goto put_bpmp' if devm_clk_get() fails.
Fixes: a52ddb98a6 ("memory: tegra186-emc: Simplify and handle deferred probe with dev_err_probe()")
Signed-off-by: Jon Hunter <jonathanh@nvidia.com>
Link: https://patch.msgid.link/20251106190550.1776974-1-jonathanh@nvidia.com
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Rockchip support for basic camera interface (CIF) and Synopsis DW-DP
driver, as well as the CEC extension to the DW-HDMI-QP driver.
* tag 'v6.19-rockchip-defconfig64-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mmind/linux-rockchip:
arm64: defconfig: enable rockchip camera interface
arm64: defconfig: Enable DW HDMI QP CEC support
arm64: defconfig: Enable Rockchip extensions for Synopsys DW DP
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Qualcomm Arm64 defconfig updates for v6.19
Enable config options for the hardare used across Fairphone 3, 4, and 5.
Then enable Novatek display panels founds on Xiaomi Pocophone F1, and
the SM8750 MTP, eUSB2 PHY found in SM8750, NSS clock controller found in
IPQ5424, the SX150x gpio expander used in QCS615 reference device, and
the support for UFS inline crypto.
* tag 'qcom-arm64-defconfig-for-6.19' of https://git.kernel.org/pub/scm/linux/kernel/git/qcom/linux:
arm64: defconfig: Enable SX150x GPIO expander driver
arm64: defconfig: Build NSS clock controller driver for IPQ5424
arm64: defconfig: Enable SCSI UFS Crypto and Block Inline encryption drivers
arm64: defconfig: Add M31 eUSB2 PHY config
arm64: defconfig: Enable configs for Fairphone 3, 4, 5 smartphones
arm64: defconfig: Enable two Novatek display panels for MTP8750 and Tianma
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Microchip AT91 defconfig updates for v6.19
This update includes:
- CONFIG_MMC_SPI is set to module for at91_dt_defconfig
* tag 'at91-defconfig-6.19' of https://git.kernel.org/pub/scm/linux/kernel/git/at91/linux:
ARM: at91: at91_dt_defconfig: set MMC_SPI to module
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
arm64: tegra: Default configuration changes for v6.19-rc1
Enable the new driver for the VRS PSEQ RTC found on Tegra234 and later.
* tag 'tegra-for-6.19-arm64-defconfig' of git://git.kernel.org/pub/scm/linux/kernel/git/tegra/linux:
arm64: defconfig: Enable NVIDIA VRS PSEQ RTC
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
ARM: tegra: Default configuration changes for v6.19-rc1
Enable ext4 by default on Tegra to restore systems booting from MMC.
* tag 'tegra-for-6.19-arm-defconfig' of git://git.kernel.org/pub/scm/linux/kernel/git/tegra/linux:
ARM: tegra: Enable EXT4 for Tegra
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
MediaTek defconfig updates
As MediaTek boards with UFS appeared some time ago, this adds a
single commit enabling the MediaTek UFS driver, allowing those
boards to boot over UFS as primary storage.
* tag 'mtk-defconfig-for-v6.19' of https://git.kernel.org/pub/scm/linux/kernel/git/mediatek/linux:
arm64: defconfig: Enable UFS support for MediaTek Genio 1200 EVK UFS board
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Renesas ARM defconfig updates for v6.19
- Enable support for the Renesas RZ/G3S and RZ/G3E thermal drivers,
and the RZ/T2H and RZ/N2H ADC drivers in the ARM64 defconfig,
- Refresh the ARM SH-Mobile defconfig for v6.18-rc1.
* tag 'renesas-arm-defconfig-for-v6.19-tag1' of git://git.kernel.org/pub/scm/linux/kernel/git/geert/renesas-devel:
arm64: defconfig: Enable RZ/T2H / RZ/N2H ADC driver
ARM: shmobile: defconfig: Refresh for v6.18-rc1
arm64: defconfig: Enable the Renesas RZ/G3E thermal driver
arm64: defconfig: Enable Renesas RZ/G3S thermal driver
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Initial Anlogic Platform Support
Add bindings for the serial and timer peripherals, and a basic soc dtsi
for the Anlogic dr1v90 SoC. The Milianke MLKPAI FS01 is the first board
for this SoC. Add myself as maintainer for this platform for the time
being.
Signed-off-by: Conor Dooley <conor.dooley@microchip.com>
* tag 'anlogic-initial-6.19-v2' of https://git.kernel.org/pub/scm/linux/kernel/git/conor/linux:
MAINTAINERS: Setup support for Anlogic tree
riscv: defconfig: Enable Anlogic SoC
riscv: dts: anlogic: Add Milianke MLKPAI FS01 board
riscv: dts: Add initial Anlogic DR1V90 SoC device tree
riscv: Add Anlogic SoC famly Kconfig support
dt-bindings: serial: snps-dw-apb-uart: Add Anlogic DR1V90 uart
dt-bindings: timer: Add Anlogic DR1V90 ACLINT MTIMER
dt-bindings: riscv: Add Anlogic DR1V90
dt-bindings: riscv: Add Nuclei UX900 compatibles
dt-bindings: vendor-prefixes: Add Anlogic, Milianke and Nuclei
MediaTek mach ARM32 updates
This adds support for the MT6582 SoC and its SMP bringup code.
This SoC is found in old smartphones and tablets from various
manufacturers.
* tag 'mtk-arm32-for-v6.19' of https://git.kernel.org/pub/scm/linux/kernel/git/mediatek/linux:
ARM: mediatek: add MT6582 smp bring up code
ARM: mediatek: add board_dt_compat entry for the MT6582 SoC
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
This patch series introduces platform support for Black Sesame Technologies
(BST) C1200 SoC and CDCU1.0 ADAS 4C2G board. BST is a leading automotive-grade
computing SoC provider focusing on intelligent driving, computer vision, and AI
capabilities for ADAS and autonomous driving applications. You can find more
information about the SoC and related boards at: https://bst.ai
This series provides the foundational platform enablement including device tree
bindings, SoC and board device trees, platform configuration, and maintainer
information. MMC/SDHCI driver support will be submitted in a separate patch series.
* bst/newsoc:
MAINTAINERS: add Black Sesame Technologies (BST) ARM SoC support
arm64: defconfig: enable BST platform support
arm64: dts: bst: add support for Black Sesame Technologies C1200 CDCU1.0 board
arm64: Kconfig: add ARCH_BST for Black Sesame Technologies SoCs
dt-bindings: arm: add Black Sesame Technologies (bst) SoC
dt-bindings: vendor-prefixes: Add Black Sesame Technologies Co., Ltd.
Link: https://lore.kernel.org/all/20251016120558.2390960-1-yangzh0906@thundersoft.com/
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Add a MAINTAINERS entry for Black Sesame Technologies (BST) ARM SoC
support. This entry covers device tree bindings, drivers, and board files
for BST SoCs, and platform support.
Signed-off-by: Albert Yang <yangzh0906@thundersoft.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Enable support for Black Sesame Technologies (BST) platform
in the ARM64 defconfig:
- CONFIG_ARCH_BST: Enable BST SoC platform support
Signed-off-by: Albert Yang <yangzh0906@thundersoft.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Add device tree support for the Black Sesame Technologies (BST) C1200
CDCU1.0 ADAS 4C2G platform. This platform is based on the BST C1200 SoC
family.
The changes include:
- Adding a new BST device tree directory
- Adding Makefile entries to build the BST platform device trees
- Adding the device tree for the BST C1200 CDCU1.0 ADAS 4C2G board
This board features a quad-core Cortex-A78 CPU, and various peripherals
including UART, and interrupt controller.
Signed-off-by: Albert Yang <yangzh0906@thundersoft.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Add ARCH_BST configuration option to enable support for Black Sesame
Technologies SoC family. BST produces automotive-grade system-on-chips
for intelligent driving, focusing on computer vision and AI capabilities.
The BST C1200 family includes SoCs for ADAS and autonomous driving
applications.
Signed-off-by: Albert Yang <yangzh0906@thundersoft.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Add device tree bindings for Black Sesame Technologies Arm SoC,
it consists several SoC models like C1200, etc.
Signed-off-by: Albert Yang <yangzh0906@thundersoft.com>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Black Sesame Technologies Co., Ltd.s a leading automotive-grade
computing SoC and SoC-based intelligent vehicle solution provider.
Link: https://bst.ai/.
Signed-off-by: Albert Yang <yangzh0906@thundersoft.com>
Acked-by: Rob Herring (Arm) <robh@kernel.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
TI K3 device tree updates for v6.19 part2
Late fixes and cleanups:
* Fix build warnings for unapplied overlays for PHYTEC, SA67 and certain TI EVM
* Fix pinmux of SD regulator control line on J721e SK
* Correct unit address of cbass_wakeup node for AM62L
* tag 'ti-k3-dt-for-v6.19-part2' of https://git.kernel.org/pub/scm/linux/kernel/git/ti/linux:
arm64: dts: ti: k3-am62l: Fix unit address of cbass_wakeup
arm64: dts: ti: k3-j721e-sk: Fix pinmux for pin Y1 used by power regulator
arm64: dts: ti: Add missing applied DT overlay targets
arm64: dts: ti: sa67: add build time dtb for overlays
arm64: dts: ti: Enable build testing of PHYTEC board overlays
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
RISC-V Devicetrees for v6.19
Sophgo:
For CV18xx serials:
Add top syscon device related DTS change, the top system
controller provides register access to configure some
misc modules, such as usb2 phy and a dma multiplexer.
For SG2042:
There are two changes. The first one is to add DTS
definition for PCIe controllers for SoC SG2042 and
boards such as Pioneerbox/EVB_V1/EVB_V2 uses SG2042.
The second one is to add DTS to support SPI-NOR flash
controllers for this SoC and the same for related boards.
Signed-off-by: Chen Wang <unicorn_wang@outlook.com>
* tag 'riscv-sophgo-dt-for-v6.19' of https://github.com/sophgo/linux:
riscv: dts: sophgo: Enable SPI NOR node for SG2042_EVB_V2
riscv: dts: sophgo: Enable SPI NOR node for SG2042_EVB_V1
riscv: dts: sophgo: Enable SPI NOR node for PioneerBox
riscv: dts: sophgo: Add SPI NOR node for SG2042
riscv: dts: sophgo: Add USB support for cv18xx
riscv: dts: sophgo: Add syscon node for cv18xx
dt-bindings: soc: sophgo: add TOP syscon for CV18XX/SG200X series SoC
riscv: sophgo: dts: enable PCIe for SG2042_EVB_V2.0
riscv: sophgo: dts: enable PCIe for SG2042_EVB_V1.X
riscv: sophgo: dts: enable PCIe for PioneerBox
riscv: sophgo: dts: add PCIe controllers for SG2042
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
STM32 DT for v6.19, round 1
Highlights:
-----------
- MPU:
- STM32MP13:
- Add and enable the ARM SMC watchdog to use IWDG1 in the secure
world.
- STMP32MP15:
- Phytec SOM: Fix STMPE811 touchscreen
- LXA: drop unnecessary vusb_d/a-supply as already defined by
"phy-supply" and "vdda1v8-supply".
- STM32MP23:
- Use the RIFSC as an access controler (firewall) as it is done
for STM32MP25 and STM32MP23.
- STM32MP25:
- Add OSPI memory region name.
- Add I/O synchronization properties to satisfy RGMII
specification.
* tag 'stm32-dt-for-v6.19-1' of git://git.kernel.org/pub/scm/linux/kernel/git/atorgue/stm32:
arm64: dts: st: set RIFSC as an access controller on stm32mp21x platforms
ARM: dts: stm32: add the IWDG2 interrupt line in stm32mp131.dtsi
ARM: dts: stm32: enable the ARM SMC watchdog node in stm32mp135f-dk
ARM: dts: stm32: add the ARM SMC watchdog in stm32mp131.dtsi
ARM: dts: stm32: add iwdg1 node in stm32mp131.dtsi
arm64: dts: st: Add I/O sync to eth pinctrl in stm32mp25-pinctrl.dtsi
arm64: dts: st: Add memory-region-names property for stm32mp257f-ev1
ARM: dts: stm32: lxa: drop unnecessary vusb_d/a-supply
ARM: dts: stm32: stm32mp157c-phycore: Fix STMPE811 touchscreen node properties
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
New boards: 9Tripod X3568, 100ASK DShanPi A1, LinkEase EasePi R1,
FriendlyElec NanoPi R76S
Interesting archeological addition: RK3368 (2015) gets display
output afterall.
New peripherals: vicap on px30 and rk356x, PCIe Gen2x1 on RK3528,
use actual clock-ids for SCMI clocks - not hardcoded numbers,
CQE support for the eMMC on RK3588.
As well as a number of enablements for individual boards.
For example enablement for the now usable NPU.
* tag 'v6.19-rockchip-dts64-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mmind/linux-rockchip: (43 commits)
arm64: dts: rockchip: add vicap node to rk356x
arm64: dts: rockchip: add the vip node to px30
arm64: dts: rockchip: fixes audio for 100ASK DshanPi A1
arm64: dts: rockchip: fixes vcc3v3_s0 supply for 100ASK DshanPi A1
arm64: dts: rockchip: fixes ethernet for 100ASK DshanPi A1
arm64: dts: rockchip: fixes regulator for 100ASK DshanPi A1
arm64: dts: rockchip: correct assigned-clock-rates spelling on 2 boards
arm64: dts: rockchip: clean up devicetree for 9Tripod X3568 v4
arm64: dts: rockchip: Enable USB-C DP Alt for Indiedroid Nova
arm64: dts: rockchip: add eMMC CQE support for rk3588
arm64: dts: rockchip: enable HDMI audio on Rock 5 ITX
arm64: dts: rockchip: Add eeprom vcc-supply for Radxa ROCK 3C
arm64: dts: rockchip: Add eeprom vcc-supply for Radxa ROCK 5A
arm64: dts: rockchip: Move the EEPROM to correct I2C bus on Radxa ROCK 5A
arm64: dts: rockchip: use SCMI clock id for gpu clock on rk356x
arm64: dts: rockchip: Remove sdmmc max-frequency on RK3588S EVB1 board
arm64: dts: rockchip: Remove sdmmc max-frequency for Radxa ROCK 5 ITX/5B/5B+/5T
arm64: dts: rockchip: Switch microSD card detect to gpio on Radxa ROCK 5 ITX/5C
arm64: dts: rockchip: Add devicetree for the 9Tripod X3568 v4
dt-bindings: arm: rockchip: Add 9Tripod X3568 series
...
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
A number of cleanups for older socs.
* tag 'v6.19-rockchip-dts32-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mmind/linux-rockchip:
ARM: dts: rockchip: move edp assigned-clocks to edp node on rk3288
ARM: dts: rockchip: Add spi_flash label to rk3288-veyron
ARM: dts: rockchip: Remove mshc aliases from RK3288
ARM: dts: rockchip: Adapt tps65910 nodes on RK3066 boards
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Commit 23db6eed72bd ("MAINTAINERS: Add Jonathan Cameron to drivers/cache
and add lib/cache_maint.c + header") intends to add a file entry pointing
to the cache_coherency.h file, but messes up to name the right path.
Update the entry to the intended file.
Signed-off-by: Lukas Bulwahn <lukas.bulwahn@redhat.com>
Acked-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Signed-off-by: Conor Dooley <conor.dooley@microchip.com>
Hydra Home Agent is a device used to maintain cache coherency. Add support
for explicit cache maintenance operations using it. A system has multiple
of these agents. Whilst only one agent is responsible for a given cache
line, interleave means that for a range operation, responsibility for the
cache lines making up the range will typically be spread across multiple
instances.
Put this driver on a new Kconfig menu under drivers/cache. The short
description as memory hotplug like operations is intended to cover
the somewhat complex set of cases where this unit applies and differentiate
it clearly from typical non coherent DMA flows.
Co-developed-by: Yicong Yang <yangyicong@hisilicon.com>
Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
Signed-off-by: Yushan Wang <wangyushan12@huawei.com>
Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Conor Dooley <conor.dooley@microchip.com>
The next patch will add a new type of cache maintenance driver responsible
for flushing deeper than is necessary for non coherent DMA (current
use case of drivers/cache drivers), as needed when performing operations
such as memory hotplug and security unlocking of persistent memory. The two
types of operation are similar enough to share a drivers/cache directory
and MAINTAINERS but are otherwise currently unrelated.
To avoid confusion have two separate menus. Each has dependencies that are
implemented by making them boolean symbols, here CACHEMAINT_FOR_DMA
which is dependent on RISCV as all driver are currently for platforms of
that architecture. Set new symbol default to y to avoid breaking existing
configs. This has no affect on actual code built, just visibility of the
menu.
Suggested-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Conor Dooley <conor.dooley@microchip.com>
Seems unfair to inflict the cache-coherency drivers on Conor with out also
stepping up as a second maintainer for drivers/cache.
Include the library support for cache-coherency maintenance drivers to the
existing entry.
Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Acked-by: Conor Dooley <conor.dooley@microchip.com>
Signed-off-by: Conor Dooley <conor.dooley@microchip.com>
The generic CPU cache maintenance framework provides a way to register
drivers for devices implementing the underlying support for
cpu_cache_has_invalidate_memregion(). Enable it for arm64 by selecting
GENERIC_CPU_CACHE_MAINTENANCE which provides the implementation for,
and in turn selects, ARCH_HAS_CPU_CACHE_INVALIDATE_MEMREGION.
Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Conor Dooley <conor.dooley@microchip.com>
ARCH_HAS_CPU_CACHE_INVALIDATE_MEMREGION provides the mechanism for
invalidating certain memory regions in a cache-incoherent manner. Currently
this is used by NVDIMM and CXL memory drivers in cases where it is
necessary to flush all data from caches by physical address range.
The operations in question are effectively memory hotplug, where stale
data might otherwise remain in the caches.
This is separate from the invalidates done to enable use of non-coherent
DMA masters, primarily in terms of when it is needed (not related to DMA
mappings) and how deep the flush must push data. The flushes done for
non-coherent DMA only need to reach the Point of Coherence of a single host
(which is often nearer CPUs and DMA masters than the physical storage).
This operation must push the data out of non architectural caches
(memory-side caches, write buffers etc) and typically all the way to the
memory device.
In some architectures these operations are supported by system components
that may become available only later in boot as they are either present
on a discoverable bus, or via a firmware description of an MMIO interface
(e.g. ACPI DSDT). Provide a framework to handle this case.
Architectures can opt in for this support via
CONFIG_GENERIC_CPU_CACHE_MAINTENANCE
Add a registration framework. Each driver provides an ops structure and
the first op is Write Back and Invalidate by PA Range. The driver may
over invalidate.
For systems that can perform this operation asynchronously an optional
completion check operation is also provided. If present that must be called
to ensure that the action has finished. This provides a considerable
performance advantage if multiple agents are involved in the maintenance
operation.
When multiple agents are present in the system each should register with
this framework and the core code will issue the invalidate to all of them
before checking for completion on each. This is done to avoid need for
filtering in the core code which can become complex when interleave,
potentially across different cache coherency hardware is going on, so it
is easier to tell everyone and let those who don't care do nothing.
Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
Co-developed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Acked-by: Conor Dooley <conor.dooley@microchip.com>
Signed-off-by: Conor Dooley <conor.dooley@microchip.com>
Call paths leading to __virt_pg_map() are currently:
(a) virt_pg_map() -> virt_arch_pg_map() -> __virt_pg_map()
(b) virt_map_level() -> __virt_pg_map()
For (a), calls to virt_pg_map() from kvm_util.c make sure they update
vm->vpages_mapped, but other callers do not. Move the sparsebit_set()
call into virt_pg_map() to make sure all callers are captured.
For (b), call sparsebit_set_num() from virt_map_level().
It's tempting to have a single the call inside __virt_pg_map(), however:
- The call path in (a) is not x86-specific, while (b) is. Moving the
call into __virt_pg_map() would require doing something similar for
other archs implementing virt_pg_map().
- Future changes will reusue __virt_pg_map() for nested PTEs, which should
not update vm->vpages_mapped, i.e. a triple underscore version that does
not update vm->vpages_mapped would need to be provided.
Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Link: https://patch.msgid.link/20251021074736.1324328-12-yosry.ahmed@linux.dev
Signed-off-by: Sean Christopherson <seanjc@google.com>
Qualcomm Arm64 DeviceTree updates for v6.19
Introduce support for the Redxa Dragon Q6A development board, the Huawei
MateBoot E 2019, the Asus ZenFone 2 Laser/Selfie, the MSM8937 platform
and the Xiaomo Redmi 3S device based on it.
SoC dtsi files for Agatti, Hamoa, Kodiak, Monaco, Purwa, and Talos, are
renamed in order to better facilitate the addition of new boards on the
various SKUs of these.
Cooling maps are introduced for the CPU cores in IPQ5424, and the
network subsystem clock controller is added.
On Lemans, RTC is enabled, the EVK fan controller is described and a
camera mezzanine overlay is introduced.
Touchscreen support is added to the BQ Aquaris M5, and the touchscreen
from Samsung Galaxy Core Prime is moved to the common platform to
benefit the other devices sharing common definitions.
On Agatti two more UARTs are described, as well as APR and the related audio
services, and the LPASS LPI pin controller. The RB1 board gets HDMI
autio playback support.
On Kodiak-based targets, Fairphone FP5 gains definitions of the UW camera
actuator, regulator for the ToF sensor, and haptic module. The SHIFT
SHIFTphone 8 gains RGB and flash LEDs, and Venus support. The Rb3Gen2
development board gets QUP firmware path defined, to support dynamic
loading of the serial engine firmware. Kodiak also gains Coresight
devices for AOSS and QDSS blocks.
Display support is added for the Talos platform, and enabled on the Ride
board. Talos also gains the definitions to scale DDR and L3
interconnects.
On SC8280XP, the camera privacy indicator on Lenovo Thinkpad X13s is
connected to the camera stack. Off-by-one GPI DMA channels are
corrected.
The SDM845-based LG and OnePlus custom defined rmtfs guard pages are
replaced with the inline-support for guard pages.
SDX75 DWC3 node is flattened and marked for USB role switching.
On SM8550, the camera subsystem and the S5K3M5 camera sensor is
introduced for the QRD, and an overlay for the "Rear Camera Card" for
the Hardware Development Kit (HDK) is introduced.
USB support is introduce for the SM8750 platform, and enabled in the MTP
and QRD devices.
On Hamoa, like on other devices the Asus Zenbook A14 definition of the
eDP panel is reworked to support both LCD and OLED configurations. WiFi
and Bluetooth is also enabled on the A14. The CRD gains support for
controlling charge limits.
The refgen regulator supplying DSI is defined and wired up on a variety
of platforms.
* tag 'qcom-arm64-for-6.19' of https://git.kernel.org/pub/scm/linux/kernel/git/qcom/linux: (138 commits)
arm64: dts: qcom: sdx75: Add missing usb-role-switch property
arm64: dts: qcom: sdx75: Flatten usb controller node
arm64: dts: qcom: HAMOA-IOT-SOM: Unreserve GPIOs blocking SPI11 access
arm64: dts: qcom: qrb2210-rb1: Fix UART3 wakeup IRQ storm
Revert "arm64: dts: qcom: sc7280: Increase config size to 256MB for ECAM feature"
arm64: dts: qcom: kodiak: add coresight nodes
arm64: dts: qcom: sdm845-oneplus: Describe TE gpio
arm64: dts: qcom: sdm845-oneplus: Implement panel sleep pinctrl
arm64: dts: qcom: sdm845-oneplus: Group panel pinctrl
arm64: dts: qcom: sdm845-oneplus: Update compatbible and add DDIC supplies
arm64: dts: qcom: qcs6490-rb3gen2: Rename vph-pwr regulator node
arm64: dts: qcom: qcm6490-fairphone-fp5: Add UW cam actuator
arm64: dts: qcom: qcm6490-fairphone-fp5: Enable CCI pull-up
arm64: dts: qcom: sm8750: Add USB support for SM8750 QRD platform
arm64: dts: qcom: sm8750: Add USB support for SM8750 MTP platform
arm64: dts: qcom: sm8750: Add USB support to SM8750 SoCs
arm64: dts: qcom: rename x1p42100 to purwa
arm64: dts: qcom: rename sc7280 to kodiak
arm64: dts: qcom: rename qcm2290 to agatti
arm64: dts: qcom: add gpu_zap_shader label
...
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Qualcomm Arm32 DeviceTree updates for v6.19
In addition to a variety of cleanups and reordering of nodes, four GSBIs
are added to the MSM8960 platform.
On the MSM8226-based Samsung Galaxy Grand 2, a simple framebuffer is
defined.
* tag 'qcom-arm32-for-6.19' of https://git.kernel.org/pub/scm/linux/kernel/git/qcom/linux:
ARM: dts: qcom: msm8226-samsung-ms013g: add simple-framebuffer
ARM: dts: qcom: msm8960: rename msmgpio node to tlmm
ARM: dts: qcom: msm8960: add I2C nodes for gsbi1 and gsbi8
ARM: dts: qcom: msm8960: add I2C nodes for gsbi10 and gsbi12
ARM: dts: qcom: msm8960: inline qcom-msm8960-pins.dtsi
ARM: dts: qcom: msm8960: reorder nodes and properties
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
TI K3 device tree updates for v6.19
Generic fixes and cleanups:
* Multiple SoCs: Disable CPSW in SoC files and
enable them in board files for better board-level control
* Replace rgmii-rxid with rgmii-id for CPSW ports across multiple boards
New Boards/SoM:
* AM62L SoC and basic support for EVM
* Toradex Aquila AM69 board support
* Kontron SMARC-sAM67 module and ADS2 carrier board support
Platform wide:
* Define possible system states amd wakeup-source (AM62/AM62A/AM62P)
SoC/EVM specific changes:
AM62:
* Add RNG node
* Add OLDI support
AM62P:
* Move audio_refclk to common main dtsi (k3-am62p-j722s-common-main)
* Fix memory ranges for GPU
AM62D2:
* Enable PMIC support on EVM
* Misc fixes
AM64:
* Add DMA support for TSCADC on EVM
AM69:
* Add Aquila board support with Clover variant
J722S:
* Fix audio refclk source in main dtsi
* Explicitly use PLL1_HSDIV6 audio refclk for EVM
J784S4/J742S2:
* Add bootph-all tag to support PCIe boot
Variscite VAR-SOM-AM62P:
* Add support for ADS7846 touchscreen
* Add support for WM8904 audio codec
* tag 'ti-k3-dt-for-v6.19' of https://git.kernel.org/pub/scm/linux/kernel/git/ti/linux: (42 commits)
arm64: dts: ti: k3-am62l: add initial reference board file
arm64: dts: ti: k3-am62l: add initial infrastructure
dt-bindings: arm: ti: Add binding for AM62L SoCs
arm64: dts: ti: am69-aquila: Add Clover
arm64: dts: ti: Add Aquila AM69 Support
dt-bindings: arm: ti: add Toradex Aquila AM69
arm64: dts: ti: k3-j721s2: disable "mcu_cpsw" in SoC file and enable in board files
arm64: dts: ti: k3-j721e: disable "mcu_cpsw" in SoC file and enable it in board file
arm64: dts: ti: k3-j7200: disable "mcu_cpsw" in SoC file and enable in board file
arm64: dts: ti: k3-am65: disable "mcu_cpsw" in SoC file and enable in board file
arm64: dts: ti: k3-am62: disable "cpsw3g" in SoC file and enable in board file
arm64: dts: ti: k3-am62p5-sk: Set wakeup-source system-states
arm64: dts: ti: k3-am62a7-sk: Set wakeup-source system-states
arm64: dts: ti: k3-am62-lp-sk: Set wakeup-source system-states
arm64: dts: ti: k3-am62p: Define possible system states
arm64: dts: ti: k3-am62a: Define possible system states
arm64: dts: ti: k3-am62: Define possible system states
arm64: dts: ti: k3-am62p-j722s-common-main: move audio_refclk here
arm64: dts: ti: k3-*: Replace rgmii-rxid with rgmii-id for CPSW ports
arm64: dts: ti: k3-am642-tqma64xxl: add boot phase tags
...
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
i.MX arm64 device tree changes for 6.19:
- New board support: Protonic PRT8ML, Toradex SMARC iMX95, Skov Rev.C
HDMI, i.MX 95 Verdin Evaluation KitPHYTEC phyBOARD-Segin-i.MX91 board,
Skov i.MX8MP variant
- A series from Alexander Stein to clean up and improve imx95-tqma9596sa
board support
- Add MicIn routing support for mba8mx boards
- A couple of patch sets from Frank Li to clean up dt-schema warnings
and add more device support for imx8dxl and imx8qxp boards
- A series from Ioana Ciornei to add FPGA based GPIO controller and SFP+
cages for layerscape boards
- A change from Jan Petrous to add GMAC Ethernet for S32G2 EVB, RDB2 and
S32G3 RDB3 boards
- A series from Markus Niebel to improve imx95-tqma9596sa board support
- A couple of changes from Martin Kepplinger-Novaković to enable cpuidle
cooling device support for imx8mp
- A series from Max Krummenacher to clean up todo and add thermal
support for imx8-apalis board
- A series from Primoz Fiser to add USB vbus regulators, jtag and
pwm-fan overlay for imx93-phyboard
- A couple of series from Richard Zhu to add supports-clkreq property
and vpcie3v3aux regulator for PCIe M.2 device
- A series from Stefano Radaelli to add WiFi, BT, PMIC, WM8904 audio,
and ADS7846 touchscreen support for imx93-var-som
- A series from Tim Harvey to make some cleanups for imx8mm-venice
boards
- A change from Xu Yang to add DDR Perf Monitor support for i.MX94
- Other small and random changes
* tag 'imx-dt64-6.19' of https://git.kernel.org/pub/scm/linux/kernel/git/shawnguo/linux: (122 commits)
arm64: dts: freescale: add Toradex SMARC iMX95
arm64: dts: freescale: tqma9352: Add vcc-supply for spi-nor
arm64: dts: mb-smarc-2: Add MicIn routing
arm64: dts: mba8xx: Add MicIn routing
arm64: dts: mba8mx: Add MicIn routing
arm64: dts: imx8mp: make 'dsp' node depend on 'aips5'
arm64: dts: imx8mp: convert 'aips5' to 'aipstz5'
arm64: dts: imx8mp-skov: add Rev.C HDMI support
arm64: dts: imx8mp: Add missing LED enumerators for DH electronics i.MX8M Plus DHCOM on PDK2
arm64: dts: freescale: Add GMAC Ethernet for S32G2 EVB and RDB2 and S32G3 RDB3
arm64: dts: imx8qm-apalis: add pwm used by the backlight
arm64: dts: imx95-tqma9596sa-mb-smarc-2: add aliases for SPI
arm64: dts: imx95-tqma9596sa-mb-smarc-2: remove superfluous line
arm64: dts: imx95-tqma9596sa-mb-smarc-2: mark LPUART1 as reserved
arm64: dts: imx95-tqma9596sa-mb-smarc-2: Add MicIn routing
arm64: dts: imx95-tqma9596sa: add EEPROM pagesize
arm64: dts: imx95-tqma9596sa: whitespace fixes
arm64: dts: imx95-tqma9596sa: add gpio bus recovery for i2c
arm64: dts: imx95-tqma9596sa: remove superfluous pinmux for usdhci
arm64: dts: imx95-tqma9596sa: remove superfluous pinmux for i2c
...
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
i.MX ARM device tree changes for 6.19:
- A bunch of dt-schema warning cleanup patches from Frank Li
- A couple of imx6dl-yapp4 board update from Michal Vokáč to enable
pwm-beeper and model the RGB LED as a single multi-led part
- Enable PMIC RTC on imx53-qsrb board
- Correct rtc compatible for imx6q-evi board
- Add sy7636 support for e70k02 board
- Replace license text comment with SPDX identifier for imx53-usbarmory
board
- Add I2S audio support for imx28-amarula-rmm board
* tag 'imx-dt-6.19' of https://git.kernel.org/pub/scm/linux/kernel/git/shawnguo/linux: (29 commits)
ARM: dts: imx6qdl: make VAR-SOM SoM SoC-agnostic
ARM: dts: imx6dl-yapp4: Model the RGB LED as a single multi-led part
ARM: dts: imx6dl-yapp43: Enable pwm-beeper on boards with speaker
ARM: dts: imx: e70k02: add sy7636
ARM: dts: imx28-amarula-rmm: add I2S audio
ARM: dts: imx: add vdd-supply and vddio-supply for fsl,mpl3115
ARM: dts: imx7ulp: remove bias-pull-up
ARM: dts: remove undocumented clock-names for ov5642
ARM: dts: add device_type for memory node
ARM: dts: Add bus type for parallel ov5640
ARM: dts: imx6q-cm-fx6.dts: add supplies for wm8731
ARM: dts: imx6qdl-skov-cpu fix typo interrupt
ARM: dts: imx: remove redundant linux,phandle
ARM: dts: imx6ull-dhcom-pdk2: rename power-supply to vcc-supply for touchscreen
ARM: dts: imx: add power-supply for lcd panel
ARM: dts: imx6qdl-nitrogen6_max: rename i2c<n>mux to i2c
ARM: dts: imx6ull-phytec-tauri: remove extra space before jedec,spi-nor
ARM: dts: imx6q-utilite-pro: add missing required property for pci
ARM: dts: imx6-tbs2910: rename ir_recv to ir-receiver
ARM: dts: imx6: remove pinctrl-name if pinctrl-0 doesn't exist
...
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Allwinner device tree changes for 6.19
The A523 family gains support for I2S and SPDIF audio interfaces, as
well as the GMAC200 Ethernet controller.
The H616 gains support for the NAND controller.
* tag 'sunxi-dt-for-6.19' of https://git.kernel.org/pub/scm/linux/kernel/git/sunxi/linux:
arm64: dts: allwinner: a523: Add SPDIF TX pin on PB and PI pins
arm64: dts: allwinner: a523: Add I2S2 pins on PI pin group
arm64: dts: allwinner: a523: Add device nodes for I2S controllers
arm64: dts: allwinner: a523: Add device node for SPDIF block
arm64: dts: allwinner: a523: Add DMA controller device nodes
dt-bindings: dma: allwinner,sun50i-a64-dma: Add compatibles for A523
arm64: dts: allwinner: h616: add NAND controller
arm64: dts: allwinner: t527: orangepi-4a: Enable Ethernet port
arm64: dts: allwinner: t527: avaota-a1: enable second Ethernet port
arm64: dts: allwinner: a527: cubie-a5e: Enable second Ethernet port
arm64: dts: allwinner: a523: Add GMAC200 ethernet controller
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
arm64: tegra: Device tree changes for v6.19-rc1
This contains a bunch of additions and improvements for older devices.
Tegra210 devices now have empty reserved-memory nodes to improve inter-
operability with certain bootloaders. These chips now also support more
multimedia engines. A new variant of the Jetson Nano is also added.
Jetson TX2 sees some improvements. PCI endpoint mode is improved for
Tegra234 so that reset interrupts are properly routed.
A new RTC device is added starting with Orin.
Rounding things off is a flurry of small fixes for DT validation and USB
OTG mode.
* tag 'tegra-for-6.19-arm64-dt' of git://git.kernel.org/pub/scm/linux/kernel/git/tegra/linux: (25 commits)
arm64: tegra: Remove OTG ID GPIO from Jetson TX2 NX
arm64: tegra: Set USB Micro-B port to OTG mode on P3450
arm64: tegra: Add NVJPG node for Tegra210 platforms
arm64: tegra: Add Tegra210 NVJPG power-domain node
arm64: tegra: Add interrupts for Tegra234 USB wake events
arm64: tegra: Add reserved-memory node for P2180
arm64: tegra: Add reserved-memory node for P3450
arm64: tegra: Enable NVDEC and NVENC on Tegra210
arm64: tegra: Fix APB DMA controller node name
arm64: tegra: Add default GIC address cells on Tegra210
arm64: tegra: Add default GIC address cells on Tegra194
arm64: tegra: Add default GIC address cells on Tegra186
arm64: tegra: Add default GIC address cells on Tegra132
arm64: tegra: Add OPP tables on Tegra210
arm64: tegra: Add interconnect properties for Tegra210
arm64: tegra: Add ACTMON on Tegra210
arm64: tegra: Add device-tree node for NVVRS RTC
arm64: tegra: Move avdd-dsi-csi-supply into CSI node
arm64: tegra: Drop redundant clock and reset names from TSEC node
arm64: tegra: Move HDA into the correct bus
...
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
ARM: tegra: Device tree changes for v6.19-rc1
Add more host1x devices on Tegra114 and Tegra124, as well as CSI for
Tegra20 and Tegra30. Support for the Xiaomi Mi Pad is also added.
* tag 'tegra-for-6.19-arm-dt' of git://git.kernel.org/pub/scm/linux/kernel/git/tegra/linux:
ARM: tegra: Add device-tree for Xiaomi Mi Pad (A0101)
ARM: tegra: add CSI nodes for Tegra20 and Tegra30
ARM: tegra: Add missing HOST1X device nodes on Tegra124
ARM: tegra: Add missing HOST1X device nodes on Tegra114
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
dt-bindings: Changes for v6.19-rc1
Document various new IPs on older chips, as well as some existing
developer kits that were missing compatible strings. Add power domain
IDs on Tegra264 and wake-up support for the XUSB controller on Tegra234.
* tag 'tegra-for-6.19-dt-bindings' of git://git.kernel.org/pub/scm/linux/kernel/git/tegra/linux:
dt-bindings: usb: Add wake-up support for Tegra234 XUSB host controller
dt-bindings: devfreq: tegra30-actmon: Add Tegra124 fallback for Tegra210
dt-bindings: display: tegra: Document Tegra20 and Tegra30 CSI
dt-bindings: display: tegra: document EPP, ISP, MPE and TSEC for Tegra114+
dt-bindings: arm: tegra: Document Jetson Nano Devkits
dt-bindings: power: Add power domain IDs for Tegra264
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
MediaTek ARM64 Device Tree updates
This adds support for new boards and variants based on different
already supported MediaTek SoCs, and improves support for current
boards.
In particular:
- New machines:
- MT7988 BananaPi R4 Pro eMMC and SD router board with support
for both Key-M and Key-E M.2 slots through DTB Overlays
- MT8370 Grinn GenioSBC-510 (GenioSOM-510 + GenioBoard Edge AI)
- MT8390 Grinn GenioSBC-700 (GenioSOM-700 + GenioBoard Edge AI)
- New variant: MT8395 MediaTek Genio 1200 EVK with UFS
...preparation for new SoCs (MT8196 Kompanio Ultra, a clone of the
MT6991 Dimensity 9400, and MT6878 Dimensity 7300) with the
addition of GCE/PIO definitions
...improvements for already supported SoCs and machines:
- MT7622/7981b/7986a/7988a gain support for reading SoC UUID from
eFuse, used to generate a persistent MAC address on boards that
don't have any factory-assigned addresses.
- MT7986 BananaPi R3 gets changes to its default fan PWM speed to
improve compatibility with cheaper fans (usually coming with the
heatsink+fan combos)
- The MT7981b OpenWRT One router sees general support improvements
with the enablement of its UART-0 console and correct pinmuxing
for the same, addition of reserved memory for Trusted Firmware A,
its SPI NOR Flash (for recovery system, WiFi eeprom data and ETH
MAC address from factory), and board LEDs.
- MT8365 gets support for its Mali G52 MC1 GPU, which gets enabled
in the MediaTek Genio 350 EVK board
...and a dt-bindings warning fix for MT8183 machines through trivial
changes to rename the audiosys and afe nodes to reflect bindings.
* tag 'mtk-dts64-for-v6.19' of https://git.kernel.org/pub/scm/linux/kernel/git/mediatek/linux: (27 commits)
arm64: dts: mediatek: mt7981b-openwrt-one: Enable software leds
arm64: dts: mediatek: mt7981b-openwrt-one: Enable SPI NOR
arm64: dts: mediatek: mt7988a-bpi-r4pro: Add mmc overlays
arm64: dts: mediatek: mt7988a-bpi-r4-pro: Add PCIe overlays
arm64: dts: mediatek: mt7988: Add devicetree for BananaPi R4 Pro
arm64: dts: mediatek: mt7988: Disable 2.5G phy and enable at board layer
dt-bindings: arm: mediatek: add BPI-R4 Pro board
arm64: dts: mediatek: Add GCE header for MT8196
arm64: dts: mediatek: mt7981b: Add reserved memory for TF-A
arm64: dts: mediatek: mt7981b: Configure UART0 pinmux
arm64: dts: mediatek: mt8365-evk: Enable GPU support
arm64: dts: mediatek: mt8365: Add GPU support
arm64: dts: mediatek: mt8395-genio-1200-evk: Describe CPU supplies
arm64: dts: mediatek: Add MT6878 pinmux macro header file
arm64: dts: mediatek: mt7986-bpi-r3: Change fan PWM value for mid speed
arm64: dts: mediatek: mt8370-grinn-genio-510-sbc: Add Grinn GenioSBC-510
arm64: dts: mediatek: mt8390-genio-700-evk: Add Grinn GenioSBC-700
arm64: dts: mediatek: mt7988a: add 'soc-uuid' cell to efuse
arm64: dts: mediatek: mt7981b: add 'soc-uuid' cell to efuse
arm64: dts: mediatek: mt7986a: add 'soc-uuid' cell to efuse
...
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
MediaTek ARM32 Device Tree updates
This performs a cleanup of the MT6582 devicetrees and adds support
for secondary cores bringup on this SoC.
This also introduces basic support for a new machine, the MT6582
Alcatel "yarisxl" Pop C7 (OT-7041D) smartphone, with support for
booting into a initramfs with UART console output.
* tag 'mtk-dts32-for-v6.19' of https://git.kernel.org/pub/scm/linux/kernel/git/mediatek/linux:
ARM: dts: mediatek: drop wrong syscon hifsys compatible for MT2701/7623
ARM: dts: mediatek: add basic support for Alcatel yarisxl board
dt-bindings: arm: mediatek: Add MT6582 yarisxl
ARM: dts: mediatek: mt6582: add enable-method property to cpus
ARM: dts: mediatek: mt6582: add clock-names property to uart nodes
ARM: dts: mediatek: mt6582: add mt6582 compatible to timer
ARM: dts: mediatek: mt6582: remove compatible property from root node
ARM: dts: mediatek: mt6582: sort nodes and properties
ARM: dts: mediatek: mt6582: move MMIO devices under soc node
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
T-HEAD Devicetrees for v6.19
Add PWM controlled fan and it's associated thermal management for the
Lichee Pi 4A board.
Enable additional ISA extenstions supported by the T-Head C910 cores:
Zfh, Ziccrse, XTheadvector.
Add reset controllers of more TH1520 subsystems: AP, AO, DSP, MISC, VI.
Signed-off-by: Drew Fustini <fustini@kernel.org>
* tag 'thead-dt-for-v6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/fustini/linux:
riscv: dts: thead: Add reset controllers of more subsystems for TH1520
riscv: dts: thead: Add PWM fan and thermal control
riscv: dts: thead: Add PWM controller node
riscv: dts: thead: add zfh for th1520
riscv: dts: thead: add ziccrse for th1520
riscv: dts: thead: add xtheadvector to the th1520 devicetree
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
First batch of ASPEED Arm devicetree changes for 6.19
Significant changes:
- The IBM Power11 FSI DTSIs have been rearranged to accommodate new systems
New platforms:
- IBM Balcones
The Balcones system is similar to Bonnell but with a POWER11 processor.
Like POWER10, the POWER11 is a dual-chip module, so a dual chip FSI
tree is needed.
- Meta Yosemite5
The Yosemite5 platform provides monitoring of voltages, power,
temperatures, and other critical parameters across the motherboard,
CXL board, E1.S expansion board, and NIC components.
Updated platforms:
- clemente (Meta): LEDs, shunt resistor configuration
- santabarbara (Meta): AMD APML, EEPROMs, LEDs, GPIO line names, MCTP for NICs
There are a scattering of one-off changes and devicetree cleanups for other
platforms as well.
* tag 'aspeed-6.19-devicetree-0' of https://git.kernel.org/pub/scm/linux/kernel/git/bmc/linux:
ARM: dts: aspeed: santabarbara: Add eeprom device node for PRoT module
ARM: dts: aspeed: santabarbara: Add AMD APML interface support
ARM: dts: aspeed: santabarbara: Add gpio line name
ARM: dts: aspeed: santabarbara: Add bmc_ready_noled Led
ARM: dts: aspeed: santabarbara: Enable MCTP for frontend NIC
ARM: dts: aspeed: santabarbara: Add sensor support for extension boards
ARM: dts: aspeed: santabarbara: Add blank lines between nodes for readability
ARM: dts: aspeed: yosemite5: Add Meta Yosemite5 BMC
dt-bindings: arm: aspeed: add Meta Yosemite5 board
ARM: dts: aspeed: clemente: Add HDD LED GPIO
ARM: dts: aspeed: Fix max31785 fan properties
ARM: dts: aspeed: Add Balcones system
dt-bindings: arm: aspeed: add IBM Bonnell board
dt-bindings: arm: aspeed: add IBM Balcones board
ARM: dts: aspeed: harma: Add MCTP I2C controller node
ARM: dts: aspeed: yosemite4: allocate ramoops for kernel panic
ARM: dts: aspeed: clemente: add shunt-resistor-micro-ohms for LM5066i
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
PXA1908 DT changes for 6.19
Rollup of hardware support which has accumulated since support for the
SoC and coreprimevelte board was merged. This most notably includes
eMMC, PMIC, backlight and touchscreen. A few QoL fixes are also
included.
* tag 'pxa1908-dt-for-6.19' of https://gitlab.com/pxa1908-mainline/linux:
arm64: dts: marvell: pxa1908: Add power domains
arm64: dts: marvell: samsung,coreprimevelte: Add USB connector
arm64: dts: marvell: samsung,coreprimevelte: Fill in memory node
arm64: dts: marvell: samsung,coreprimevelte: Drop some reserved memory
arm64: dts: marvell: pxa1908: Move ramoops to SoC dtsi
arm64: dts: marvell: samsung,coreprimevelte: Add vibrator
arm64: dts: marvell: pxa1908: Add PWMs
arm64: dts: marvell: samsung,coreprimevelte: Enable eMMC
arm64: dts: marvell: samsung,coreprimevelte: Correct CD GPIO
arm64: dts: marvell: samsung,coreprimevelte: Add backlight
arm64: dts: samsung,coreprimevelte: add SDIO
arm64: dts: samsung,coreprimevelte: add touchscreen
arm64: dts: samsung,coreprimevelte: add PMIC
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Tenstorrent device tree for v6.19
Add Tenstorrent as a vendor and enable support for the Blackhole SoC
in Blackhole P100 and P150 PCIe cards. The SoC contains four RISC-V
CPU tiles consisting of 4x SiFive X280 cores.
There is a virtual UART implemented in OpenSBI firmware that allows a
console program on the PCIe host to communicate through shared memory
with Linux running on the Blackhole card.
Link: https://github.com/tenstorrent/tt-bh-linux
Link: https://github.com/tenstorrent/opensbi/
Signed-off-by: Drew Fustini <fustini@kernel.org>
* tag 'tenstorrent-dt-for-v6.19' of https://github.com/tenstorrent/linux:
riscv: defconfig: Enable Tenstorrent SoCs
riscv: Kconfig.socs: Add ARCH_TENSTORRENT for Tenstorrent SoCs
riscv: dts: Add Tenstorrent Blackhole SoC PCIe cards
dt-bindings: interrupt-controller: Add Tenstorrent Blackhole compatible
dt-bindings: timers: Add Tenstorrent Blackhole compatible
dt-bindings: riscv: cpus: Add SiFive X280 compatible
dt-bindings: riscv: Add Tenstorrent Blackhole compatible
dt-bindings: vendor-prefixes: Add Tenstorrent AI ULC
The VSIE code currently checks that the BSCA struct fits within
a page, and returns a validity exception 0x003b if it doesn't.
The BSCA is pinned in memory rather than shadowed (see block
comment at end of kvm_s390_cpu_feat_init()), so enforcing the
CPU entries to be on the same pinned page makes sense.
Except those entries aren't going to be used below the guest,
and according to the definition of that validity exception only
the header of the BSCA (everything but the CPU entries) needs to
be within a page. Adjust the alignment check to account for that.
Signed-off-by: Eric Farman <farman@linux.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Reviewed-by: Christoph Schlameuss <schlameuss@linux.ibm.com>
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
Setting KVM_CAP_S390_USER_OPEREXEC will forward all operation
exceptions to user space. This also includes the 0x0000 instructions
managed by KVM_CAP_S390_USER_INSTR0. It's helpful if user space wants
to emulate instructions which do not (yet) have an opcode.
While we're at it refine the documentation for
KVM_CAP_S390_USER_INSTR0.
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Acked-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
Fix the following warning with W=1:
arch/arm64/boot/dts/ti/k3-am62l.dtsi:101.30-112.5: Warning (simple_bus_reg): /bus@f0000/bus@43000000: simple-bus unit address format error, expected "a80000"
While at that, also remove extra space b/w label and node name.
Fixes: 5f016758b0 ("arm64: dts: ti: k3-am62l: add initial infrastructure")
Link: https://patch.msgid.link/20251120143419.223238-1-vigneshr@ti.com
Signed-off-by: Vignesh Raghavendra <vigneshr@ti.com>
The SoC pin Y1 is incorrectly defined in the WKUP Pinmux device-tree node
(pinctrl@4301c000) leading to the following silent failure:
pinctrl-single 4301c000.pinctrl: mux offset out of range: 0x1dc (0x178)
According to the datasheet for the J721E SoC [0], the pin Y1 belongs to the
MAIN Pinmux device-tree node (pinctrl@11c000). This is confirmed by the
address of the pinmux register for it on page 142 of the datasheet which is
0x00011C1DC.
Hence fix it.
[0]: https://www.ti.com/lit/ds/symlink/tda4vm.pdf
Fixes: 97b67cc102 ("arm64: dts: ti: k3-j721e-sk: Add DT nodes for power regulators")
Cc: stable@vger.kernel.org
Signed-off-by: Siddharth Vadapalli <s-vadapalli@ti.com>
Reviewed-by: Yemike Abhilash Chandra <y-abhilashchandra@ti.com>
Link: https://patch.msgid.link/20251119160148.2752616-1-s-vadapalli@ti.com
Signed-off-by: Vignesh Raghavendra <vigneshr@ti.com>
It's a requirement that DT overlays be applied at build time in order to
validate them as overlays are not validated on their own.
Add the missing TI overlays. Some of the TI overlays have the first part
needed (a "*-dtbs" variable), but not the second part adding the target to
dtb-y/dtb- variable.
Signed-off-by: Rob Herring (Arm) <robh@kernel.org>
[vigneshr@ti.com: create new target for J721e GESI EVM]
Link: https://patch.msgid.link/20251120141936.190796-1-vigneshr@ti.com
Signed-off-by: Vignesh Raghavendra <vigneshr@ti.com>
Add a selftest that verifies KVM's ability to save and restore
nested state when the L1 guest is using 5-level paging and the L2
guest is using 4-level paging. Specifically, canonicality tests of
the VMCS12 host-state fields should accept 57-bit virtual addresses.
Signed-off-by: Jim Mattson <jmattson@google.com>
Link: https://patch.msgid.link/20251028225827.2269128-5-jmattson@google.com
[sean: rename to vmx_nested_la57_state_test to prep nested_<test> namespace]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Currently, on Radxa boards, the power LED is turned on immediately
after power-up, independent of software control. The heartbeat LED and
other available LEDs are subsequently turned on by the initial
software, such as U-Boot, to indicate software is running.
However, the device tree description for this behavior is inconsistent
and fragmented, with definitions split between the main Linux DTS
files and separate U-Boot files (u-boot/arch/arm/dts/*-u-boot.dtsi).
This patch addresses the inconsistency for the power LED by using
default-state = "on" instead of linux,default-trigger = "default-on".
Signed-off-by: FUKAUMI Naoki <naoki@radxa.com>
Reviewed-by: Dragan Simic <dsimic@manjaro.org>
Link: https://patch.msgid.link/20251113124222.4691-2-naoki@radxa.com
Signed-off-by: Heiko Stuebner <heiko@sntech.de>
Add the pin definition for the host wake interrupt on the Indiedroid
Nova. This necessitates adding a node for the wifi controller to
properly define the interrupt. Additionally, we can consolidate both
pinctrl definitions under a wifi node to note their common functionality.
Signed-off-by: Chris Morgan <macromorgan@hotmail.com>
Link: https://patch.msgid.link/20251118223048.4531-5-macroalpha82@gmail.com
Signed-off-by: Heiko Stuebner <heiko@sntech.de>
Assert that we correctly merge VMAs containing VM_SOFTDIRTY flags now that
we correctly handle these as sticky.
In order to do so, we have to account for the fact the pagemap interface
checks soft dirty PTEs and additionally that newly merged VMAs are marked
VM_SOFTDIRTY.
We do this by using use unfaulted anon VMAs, establishing one and clearing
references on that one, before establishing another and merging the two
before checking that soft-dirty is propagated as expected.
We check that this functions correctly with mremap() and mprotect() as
sample cases, because VMA merge of adjacent newly mapped VMAs will
automatically be made soft-dirty due to existing logic which does so.
We are therefore exercising other means of merging VMAs.
Link: https://lkml.kernel.org/r/d5a0f735783fb4f30a604f570ede02ccc5e29be9.1763399675.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Andrey Vagin <avagin@gmail.com>
Cc: David Hildenbrand (Red Hat) <david@kernel.org>
Cc: Jann Horn <jannh@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Pedro Falcato <pfalcato@suse.de>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "make VM_SOFTDIRTY a sticky VMA flag", v2.
Currently we set VM_SOFTDIRTY when a new mapping is set up (whether by
establishing a new VMA, or via merge) as implemented in __mmap_complete()
and do_brk_flags().
However, when performing a merge of existing mappings such as when
performing mprotect(), we may lose the VM_SOFTDIRTY flag.
Now we have the concept of making VMA flags 'sticky', that is that they
both don't prevent merge and, importantly, are propagated to merged VMAs,
this seems a sensible alternative to the existing special-casing of
VM_SOFTDIRTY.
We additionally add a self-test that demonstrates that this logic behaves
as expected.
This patch (of 2):
Currently we set VM_SOFTDIRTY when a new mapping is set up (whether by
establishing a new VMA, or via merge) as implemented in __mmap_complete()
and do_brk_flags().
However, when performing a merge of existing mappings such as when
performing mprotect(), we may lose the VM_SOFTDIRTY flag.
This is because currently we simply ignore VM_SOFTDIRTY for the purposes
of merge, so one VMA may possess the flag and another not, and whichever
happens to be the target VMA will be the one upon which the merge is
performed which may or may not have VM_SOFTDIRTY set.
Now we have the concept of 'sticky' VMA flags, let's make VM_SOFTDIRTY one
which solves this issue.
Additionally update VMA userland tests to propagate changes.
[akpm@linux-foundation.org: update comments, per Lorenzo]
Link: https://lkml.kernel.org/r/0019e0b8-ee1e-4359-b5ee-94225cbe5588@lucifer.local
Link: https://lkml.kernel.org/r/cover.1763399675.git.lorenzo.stoakes@oracle.com
Link: https://lkml.kernel.org/r/955478b5170715c895d1ef3b7f68e0cd77f76868.1763399675.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>
Reviewed-by: Pedro Falcato <pfalcato@suse.de>
Acked-by: Andrey Vagin <avagin@gmail.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: Jann Horn <jannh@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "mm/damon: misc cleanups".
Yet another batch of misc cleanups and refactoring for DAMON code, tests,
and documents.
First two patches (1and 2) rename DAMOS core filters related code for
readability.
Three following patches (3-5) refactor page table walk callback functions
in DAMON, as suggested by Hugh and David, and I promised.
Next two patches (6 and 7) refactor DAMON core layer kunit test and sysfs
interface selftest to be simple and deduplicated.
Final two patches (8 and 9) fix up sphinx and grammatical errors on
documents.
This patch (of 9):
DAMOS filters handled by the core layer are called core filters, while
those handled by the ops layer are called ops filters. They share the
same type but are managed in different places since core filters are
evaluated before the ops filters. They also have different helper
functions that depend on their managed places.
The helper functions for ops filters have '_ops_' keyword on their name,
so it is easy to know they are for ops filters. Meanwhile, the helper
functions for core filters are not having the 'core' keyword on their
name. This makes it easy to be mistakenly used for ops filters. Actually
there was such a bug.
To avoid future mistakes from similar confusions, rename DAMOS core
filters helper functions to have a keyword 'core' on their names.
Link: https://lkml.kernel.org/r/20251112154114.66053-1-sj@kernel.org
Link: https://lkml.kernel.org/r/20251112154114.66053-2-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Bill Wendling <morbo@google.com>
Cc: Brendan Higgins <brendan.higgins@linux.dev>
Cc: David Gow <davidgow@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Justin Stitt <justinstitt@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Miguel Ojeda <ojeda@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "mm/damon/tests: add more tests for online parameters commit".
A DAMON feature called parameters "commit" allows DAMON API callers and
ABI users to update nearly every DAMON parameter while DAMON is running.
This is being used for flexible DAMON use cases such as taking a snapshot
of the monitoring results with minimum overhead, or adjusting access-aware
system operations (DAMOS) for user-space driven auto-tuning or
investigations.
Compared to the usefulness of the feature and size of the implementation,
the test coverage is pretty small. Only the filter commit part has a
single test case, namely damos_test_commit_filter(). Actually, we found
and fixed a few bugs of the feature in the past. The single existing test
was also added to avoid reintroduction of a found bug.
Add more unit tests for the feature.
First four patches (1-4) refactor and extend the existing test for DAMOS
filter commit for multiple test cases.
Next three patches (5-7) add tests for DAMOS quota commit.
Next two patches (8 and 9) refactor damos_commit_dests() for ease of code
reading and test writing, and implement a new unit test of the function
that is being refactored in a test-friendly way.
Final two patches (10 and 11) further add new unit tests for
damos_commit() and damon_commit_target_regions().
This patch (of 11):
damos_test_commit_filter() is dynamically allocating test-purpose DAMOS
filters. Allocation failure checks are making the code longer,
complicated, and difficult to extend for more test cases. Refactor the
code to remove the dynamic allocation.
Link: https://lkml.kernel.org/r/20251111184415.141757-1-sj@kernel.org
Link: https://lkml.kernel.org/r/20251111184415.141757-2-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Brendan Higgins <brendan.higgins@linux.dev>
Cc: David Gow <davidgow@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "vma_start_write_killable"", v2.
When we added the VMA lock, we made a major oversight in not adding a
killable variant. That can run us into trouble where a thread takes the
VMA lock for read (eg handling a page fault) and then goes out to lunch
for an hour (eg doing reclaim). Another thread tries to modify the VMA,
taking the mmap_lock for write, then attempts to lock the VMA for write.
That blocks on the first thread, and ensures that every other page fault
now tries to take the mmap_lock for read. Because everything's in an
uninterruptible sleep, we can't kill the task, which makes me angry.
This patchset just adds vma_start_write_killable() and converts one caller
to use it. Most users are somewhat tricky to convert, so expect follow-up
individual patches per call-site which need careful analysis to make sure
we've done proper cleanup.
This patch (of 2):
The vma can be held read-locked for a substantial period of time, eg if
memory allocation needs to go into reclaim. It's useful to be able to
send fatal signals to threads which are waiting for the write lock.
Link: https://lkml.kernel.org/r/20251110203204.1454057-1-willy@infradead.org
Link: https://lkml.kernel.org/r/20251110203204.1454057-2-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Chris Li <chriscli@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
We only need to keep the page table stable so we can perform this
operation under the VMA lock. PTE installation is stabilised via the PTE
lock.
One caveat is that, if we prepare vma->anon_vma we must hold the mmap read
lock. We can account for this by adapting the VMA locking logic to
explicitly check for this case and prevent a VMA lock from being acquired
should it be the case.
This check is safe, as while we might be raced on anon_vma installation,
this would simply make the check conservative, there's no way for us to
see an anon_vma and then for it to be cleared, as doing so requires the
mmap/VMA write lock.
We abstract the VMA lock validity logic to is_vma_lock_sufficient() for
this purpose, and add prepares_anon_vma() to abstract the anon_vma logic.
In order to do this we need to have a way of installing page tables
explicitly for an identified VMA, so we export walk_page_range_vma() in an
unsafe variant - walk_page_range_vma_unsafe() and use this should the VMA
read lock be taken.
We additionally update the comments in madvise_guard_install() to more
accurately reflect the cases in which the logic may be reattempted,
specifically THP huge pages being present.
Link: https://lkml.kernel.org/r/cca1edbd99cd1386ad20556d08ebdb356c45ef91.1762795245.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>
Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: SeongJae Park <sj@kernel.org>
Cc: Jann Horn <jannh@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "mm: perform guard region install/remove under VMA lock", v2.
There is no reason why can't perform guard region operations under the VMA
lock, as long we take proper precautions to ensure that we do so in a safe
manner.
This is fine, as VMA lock acquisition is always best-effort, so if we are
unable to do so, we can simply fall back to using the mmap read lock.
Doing so will reduce mmap lock contention for callers performing guard
region operations and help establish a precedent of trying to use the VMA
lock where possible.
As part of this change we perform a trivial rename of page walk functions
which bypass safety checks (i.e. whether or not mm_walk_ops->install_pte
is specified) in order that we can keep naming consistent with the mm
walk.
This is because we need to expose a VMA-specific walk that still allows us
to install PTE entries.
This patch (of 2):
Make it clear we're referencing an unsafe variant of this function
explicitly.
This is laying the foundation for exposing more such functions and
maintaining a consistent naming scheme.
As a part of this change, rename check_ops_valid() to check_ops_safe() for
consistency.
Link: https://lkml.kernel.org/r/cover.1762795245.git.lorenzo.stoakes@oracle.com
Link: https://lkml.kernel.org/r/c684d91464a438d6e31172c9450416a373f10649.1762795245.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>
Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: SeongJae Park <sj@kernel.org>
Cc: Jann Horn <jannh@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Now we have established the VM_MAYBE_GUARD flag and added the capacity to
set it atomically, do so upon MADV_GUARD_INSTALL.
The places where this flag is used currently and matter are:
* VMA merge - performed under mmap/VMA write lock, therefore excluding
racing writes.
* /proc/$pid/smaps - can race the write, however this isn't meaningful
as the flag write is performed at the point of the guard region being
established, and thus an smaps reader can't reasonably expect to avoid
races. Due to atomicity, a reader will observe either the flag being
set or not. Therefore consistency will be maintained.
In all other cases the flag being set is irrelevant and atomicity
guarantees other flags will be read correctly.
Note that non-atomic updates of unrelated flags do not cause an issue with
this flag being set atomically, as writes of other flags are performed
under mmap/VMA write lock, and these atomic writes are performed under
mmap/VMA read lock, which excludes the write, avoiding RMW races.
Note that we do not encounter issues with KCSAN by adjusting this flag
atomically, as we are only updating a single bit in the flag bitmap and
therefore we do not need to annotate these changes.
We intentionally set this flag in advance of actually updating the page
tables, to ensure that any racing atomic read of this flag will only
return false prior to page tables being updated, to allow for
serialisation via page table locks.
Note that we set vma->anon_vma for anonymous mappings. This is because
the expectation for anonymous mappings is that an anon_vma is established
should they possess any page table mappings. This is also consistent with
what we were doing prior to this patch (unconditionally setting anon_vma
on guard region installation).
We also need to update retract_page_tables() to ensure that madvise(...,
MADV_COLLAPSE) doesn't incorrectly collapse file-backed ranges contain
guard regions.
This was previously guarded by anon_vma being set to catch MAP_PRIVATE
cases, but the introduction of VM_MAYBE_GUARD necessitates that we check
this flag instead.
We utilise vma_flag_test_atomic() to do so - we first perform an
optimistic check, then after the PTE page table lock is held, we can check
again safely, as upon guard marker install the flag is set atomically
prior to the page table lock being taken to actually apply it.
So if the initial check fails either:
* Page table retraction acquires page table lock prior to VM_MAYBE_GUARD
being set - guard marker installation will be blocked until page table
retraction is complete.
OR:
* Guard marker installation acquires page table lock after setting
VM_MAYBE_GUARD, which raced and didn't pick this up in the initial
optimistic check, blocking page table retraction until the guard regions
are installed - the second VM_MAYBE_GUARD check will prevent page table
retraction.
Either way we're safe.
We refactor the retraction checks into a single
file_backed_vma_is_retractable(), there doesn't seem to be any reason that
the checks were separated as before.
Note that VM_MAYBE_GUARD being set atomically remains correct as
vma_needs_copy() is invoked with the mmap and VMA write locks held,
excluding any race with madvise_guard_install().
Link: https://lkml.kernel.org/r/e9e9ce95b6ac17497de7f60fc110c7dd9e489e8d.1763460113.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: David Hildenbrand (Red Hat) <david@kernel.org>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Lance Yang <lance.yang@linux.dev>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nico Pache <npache@redhat.com>
Cc: Pedro Falcato <pfalcato@suse.de>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
It is useful to be able to designate that certain flags are 'sticky', that
is, if two VMAs are merged one with a flag of this nature and one without,
the merged VMA sets this flag.
As a result we ignore these flags for the purposes of determining VMA flag
differences between VMAs being considered for merge.
This patch therefore updates the VMA merge logic to perform this action,
with flags possessing this property being described in the VM_STICKY
bitmap.
Those flags which ought to be ignored for the purposes of VMA merge are
described in the VM_IGNORE_MERGE bitmap, which the VMA merge logic is also
updated to use.
As part of this change we place VM_SOFTDIRTY in VM_IGNORE_MERGE as it
already had this behaviour, alongside VM_STICKY as sticky flags by
implication must not disallow merge.
Ultimately it seems that we should make VM_SOFTDIRTY a sticky flag in its
own right, but this change is out of scope for this series.
The only sticky flag designated as such is VM_MAYBE_GUARD, so as a result
of this change, once the VMA flag is set upon guard region installation,
VMAs with guard ranges will now not have their merge behaviour impacted as
a result and can be freely merged with other VMAs without VM_MAYBE_GUARD
set.
Also update the comments for vma_modify_flags() to directly reference
sticky flags now we have established the concept.
We also update the VMA userland tests to account for the changes.
Link: https://lkml.kernel.org/r/22ad5269f7669d62afb42ce0c79bad70b994c58d.1763460113.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Pedro Falcato <pfalcato@suse.de>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: David Hildenbrand (Red Hat) <david@kernel.org>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Lance Yang <lance.yang@linux.dev>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nico Pache <npache@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "introduce VM_MAYBE_GUARD and make it sticky", v4.
Currently, guard regions are not visible to users except through
/proc/$pid/pagemap, with no explicit visibility at the VMA level.
This makes the feature less useful, as it isn't entirely apparent which
VMAs may have these entries present, especially when performing actions
which walk through memory regions such as those performed by CRIU.
This series addresses this issue by introducing the VM_MAYBE_GUARD flag
which fulfils this role, updating the smaps logic to display an entry for
these.
The semantics of this flag are that a guard region MAY be present if set
(we cannot be sure, as we can't efficiently track whether an
MADV_GUARD_REMOVE finally removes all the guard regions in a VMA) - but if
not set the VMA definitely does NOT have any guard regions present.
It's problematic to establish this flag without further action, because
that means that VMAs with guard regions in them become non-mergeable with
adjacent VMAs for no especially good reason.
To work around this, this series also introduces the concept of 'sticky'
VMA flags - that is flags which:
a. if set in one VMA and not in another still permit those VMAs to be
merged (if otherwise compatible).
b. When they are merged, the resultant VMA must have the flag set.
The VMA logic is updated to propagate these flags correctly.
Additionally, VM_MAYBE_GUARD being an explicit VMA flag allows us to solve
an issue with file-backed guard regions - previously these established an
anon_vma object for file-backed mappings solely to have vma_needs_copy()
correctly propagate guard region mappings to child processes.
We introduce a new flag alias VM_COPY_ON_FORK (which currently only
specifies VM_MAYBE_GUARD) and update vma_needs_copy() to check explicitly
for this flag and to copy page tables if it is present, which resolves
this issue.
Additionally, we add the ability for allow-listed VMA flags to be
atomically writable with only mmap/VMA read locks held.
The only flag we allow so far is VM_MAYBE_GUARD, which we carefully ensure
does not cause any races by being allowed to do so.
This allows us to maintain guard region installation as a read-locked
operation and not endure the overhead of obtaining a write lock here.
Finally we introduce extensive VMA userland tests to assert that the
sticky VMA logic behaves correctly as well as guard region self tests to
assert that smaps visibility is correctly implemented.
This patch (of 9):
Currently, if a user needs to determine if guard regions are present in a
range, they have to scan all VMAs (or have knowledge of which ones might
have guard regions).
Since commit 8e2f2aeb8b ("fs/proc/task_mmu: add guard region bit to
pagemap") and the related commit a516403787 ("fs/proc: extend the
PAGEMAP_SCAN ioctl to report guard regions"), users can use either
/proc/$pid/pagemap or the PAGEMAP_SCAN functionality to perform this
operation at a virtual address level.
This is not ideal, and it gives no visibility at a /proc/$pid/smaps level
that guard regions exist in ranges.
This patch remedies the situation by establishing a new VMA flag,
VM_MAYBE_GUARD, to indicate that a VMA may contain guard regions (it is
uncertain because we cannot reasonably determine whether a
MADV_GUARD_REMOVE call has removed all of the guard regions in a VMA, and
additionally VMAs may change across merge/split).
We utilise 0x800 for this flag which makes it available to 32-bit
architectures also, a flag that was previously used by VM_DENYWRITE, which
was removed in commit 8d0920bde5 ("mm: remove VM_DENYWRITE") and hasn't
bee reused yet.
We also update the smaps logic and documentation to identify these VMAs.
Another major use of this functionality is that we can use it to identify
that we ought to copy page tables on fork.
We do not actually implement usage of this flag in mm/madvise.c yet as we
need to allow some VMA flags to be applied atomically under mmap/VMA read
lock in order to avoid the need to acquire a write lock for this purpose.
Link: https://lkml.kernel.org/r/cover.1763460113.git.lorenzo.stoakes@oracle.com
Link: https://lkml.kernel.org/r/cf8ef821eba29b6c5b5e138fffe95d6dcabdedb9.1763460113.git.lorenzo.stoakes@oracle.com
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Pedro Falcato <pfalcato@suse.de>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>
Reviewed-by: Lance Yang <lance.yang@linux.dev>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nico Pache <npache@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Following the extraction of sysfs code, this patch moves the sysctl
interface implementation into a dedicated file to further improve code
organization and maintainability of the hugetlb subsystem.
The following components are moved to mm/hugetlb_sysctl.c:
- proc_hugetlb_doulongvec_minmax()
- hugetlb_sysctl_handler_common()
- hugetlb_sysctl_handler()
- hugetlb_mempolicy_sysctl_handler() (CONFIG_NUMA)
- hugetlb_overcommit_handler()
- hugetlb_table[] sysctl table definition
- hugetlb_sysctl_init()
The hugetlb_internal.h header file is updated to declare the sysctl
initialization function with proper #ifdef guards for configurations
without CONFIG_SYSCTL support.
The Makefile is updated to compile hugetlb_sysctl.o when CONFIG_HUGETLBFS
is enabled. This refactoring reduces the size of hugetlb.c and logically
separates the sysctl interface from core hugetlb management code.
MAINTAINERS is updated to add new file hugetlb_sysctl.c.
No functional changes are introduced; all code is moved as-is from
hugetlb.c with consistent formatting.
Link: https://lkml.kernel.org/r/5bbee7ab5be71d0bb1aebec38642d7e83526bb7a.1762398359.git.zhuhui@kylinos.cn
Signed-off-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Hui Zhu <zhuhui@kylinos.cn>
Cc: David Hildenbrand <david@redhat.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "mm/hugetlb: refactor sysfs/sysctl interfaces", v5.
hugetlb.c has grown significantly and become difficult to maintain. This
patch series extracts the sysfs and sysctl interface code into separate
dedicated files to improve code organization.
The refactoring includes:
- Patch 1: Extract sysfs interface into mm/hugetlb_sysfs.c
- Patch 2: Extract sysctl interface into mm/hugetlb_sysctl.c
No functional changes are introduced in this series. The code is moved
as-is, with only minor formatting adjustments for code style consistency.
This should make future maintenance and enhancements to the hugetlb
subsystem easier.
Testing: The patch series has been compile-tested and maintains the same
functionality as the original code.
This patch (of 2):
Currently, hugetlb.c contains both core management logic and sysfs
interface implementations, making it difficult to maintain. This patch
extracts the sysfs-related code into a dedicated file to improve code
organization.
The following components are moved to mm/hugetlb_sysfs.c:
- sysfs attribute definitions and handlers
- sysfs kobject management functions
- NUMA per-node hstate attribute registration
Several inline helper functions and macros are moved to
mm/hugetlb_internal.h:
- hstate_is_gigantic_no_runtime()
- next_node_allowed()
- get_valid_node_allowed()
- hstate_next_node_to_alloc()
- hstate_next_node_to_free()
- for_each_node_mask_to_alloc/to_free macros
To support code sharing, these functions are changed from static to
exported symbols:
- remove_hugetlb_folio()
- add_hugetlb_folio()
- init_new_hugetlb_folio()
- prep_and_add_allocated_folios()
- demote_pool_huge_page()
- __nr_hugepages_store_common()
The Makefile is updated to compile hugetlb_sysfs.o when CONFIG_HUGETLBFS
is enabled. This maintains all existing functionality while improving
maintainability by separating concerns.
MAINTAINERS is updated to add new file hugetlb_sysfs.c.
Link: https://lkml.kernel.org/r/cover.1762398359.git.zhuhui@kylinos.cn
Link: https://lkml.kernel.org/r/656a03dff7e2bb20e24e841ede81fdca01d21410.1762398359.git.zhuhui@kylinos.cn
Signed-off-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Hui Zhu <zhuhui@kylinos.cn>
Cc: David Hildenbrand <david@redhat.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "some cleanups for pageout()", v2.
Since we no longer attempt to write back filesystem folios in pageout(),
and only tmpfs/shmem folios and anonymous swapcache folios can be written
back, we can remove the redundant folio_test_private() related logic to
simplify the logic of pageout(), as tmpfs/shmem and swapcache folios do
not use the PG_private flag.
This patch (of 2):
The folio_test_private() check in pageout() was introduced by commit
ce91b575332b ("orphaned pagecache memleak fix") in 2005 (checked from a
history tree[1]). As the commit message mentioned, it was to address the
issue where reiserfs pagecache may be truncated while still pinned. To
further explain, the truncation removes the page->mapping, but the page is
still listed in the VM queues because it still has buffers.
In 2008, commit a2b345642f ("Fix dirty page accounting leak with ext3
data=journal") seems to be dealing with a similar issue, where the page
becomes dirty after truncation, and it provides a very useful call stack:
truncate_complete_page()
cancel_dirty_page() // PG_dirty cleared, decr. dirty pages
do_invalidatepage()
ext3_invalidatepage()
journal_invalidatepage()
journal_unmap_buffer()
__dispose_buffer()
__journal_unfile_buffer()
__journal_temp_unlink_buffer()
mark_buffer_dirty(); // PG_dirty set, incr. dirty pages
In this commit a2b345642f, we forcefully clear the page's dirty flag
during truncation (in truncate_complete_page()).
Now it seems this was just a peculiar usage specific to reiserfs. Maybe
reiserfs had some extra refcount on these pages, which caused them to pass
the is_page_cache_freeable() check.
With the fix provided by commit a2b345642f and reiserfs being removed
in 2024 by commit fb6f20ecb1 ("reiserfs: The last commit"), such a case
is unlikely to occur again. So let's remove the redundant
folio_test_private() checks and related buffer_head release logic, and
just leave a warning here to catch such a bug.
[akpm@linux-foundation.org: redo comment, per David]
Link: https://lkml.kernel.org/r/17d1b293-e393-4989-a357-7eea74b3c805@redhat.com
[baolin.wang@linux.alibaba.com: remove comment and WARNing, per Hugh and others]
Link: https://lkml.kernel.org/r/392a9ca3-31ac-4447-bd44-3c656d63e4ca@linux.alibaba.com
Link: https://lkml.kernel.org/r/cover.1758166683.git.baolin.wang@linux.alibaba.com
Link: https://lkml.kernel.org/r/9ef0e560dc83650bc538eb5dcd1594e112c1369f.1758166683.git.baolin.wang@linux.alibaba.com
Link: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git [1]
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The TS233 is a 2 bay NAS similar to the TS433. Architecture-wise it really
seems to be the same minus the additional PCIe connected components the
TS433 has.
So it just uses two of the SoCs SATA ports and the SoC's gigabit ethernet.
Signed-off-by: Heiko Stuebner <heiko@sntech.de>
Link: https://patch.msgid.link/20251112214206.423244-6-heiko@sntech.de
The NAS series based around the rk3568 contains a number of models with
1-4 drives, that reuse most of the board structure.
Therefore move the shared parts to a dtsi, to be included by the devices.
As the smallest device is the 1-bay TS133, keep everything > slot1 in
the individual devicetree.
Signed-off-by: Heiko Stuebner <heiko@sntech.de>
Link: https://patch.msgid.link/20251112214206.423244-4-heiko@sntech.de
The MCU's eeprom contains the unit's serial and a number of slots for
mac-addresses. As the MCU seems to be used in different devices, up to
8 mac addresses can live there and the unused slots are actually
initialized with empty mac-address strings like 00:00:00:00:05:09 .
Interestingly on the TS-433, the PCIe ethernet adapter brings its own
memory to hold its mac, and the gmac0 is supposed to get its mac from
the second mac-slot, while the first one stays empty.
Signed-off-by: Heiko Stuebner <heiko@sntech.de>
Link: https://patch.msgid.link/20251112214206.423244-3-heiko@sntech.de
Some users of KVM have emulated devices (typically added to private
forks of QEMU) that execute AVX instructions on PCI BARs. Whenever
the guest OS tries to do that, an illegal instruction exception or
emulation failure is triggered.
Add the Avx flag to move instructions:
- (66) 0f 10 - MOVUPS/MOVUPD from memory
- (66) 0f 11 - MOVUPS/MOVUPD to memory
- 66 0f 6f - MOVDQA from memory
- 66 0f 7f - MOVDQA to memory
- f3 0f 6f - MOVDQU from memory
- f3 0f 7f - MOVDQU to memory
- (66) 0f 28 - MOVAPS/MOVAPD from memory
- (66) 0f 29 - MOVAPS/MOVAPD to memory
- (66) 0f 2b - MOVNTPS/MOVNTPD to memory
- 66 0f e7 - MOVNTDQ to memory
- 66 0f 38 2a - MOVNTDQA to memory
Co-developed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/kvm/BD108C42-0382-4B17-B601-434A4BD038E7@fb.com/T/
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Link: https://patch.msgid.link/20251114003633.60689-11-pbonzini@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Add myself as the maintainer of the Anlogic DR1V90 SoC tree, including
the corresponding DTS and DT bindings paths for Anlogic RISC-V-based
SoCs. I don't really want to look after this platform, but am due to
irritation of the vendor's behaviour towards the contributor of support.
Hence, Odd Fixes as the status.
Signed-off-by: Conor Dooley <conor.dooley@microchip.com>
After all the changes done in the previous patches, the only thing
left to support AVX MOV instructions is to expand the VEX prefix into
the appropriate REX, 66/F3/F2 and map prefixes. Three-operand
instructions are not supported.
The Avx bit in this case is not cleared, in fact it is used as the
sign that the instruction does support VEX encoding. Until it is
added to any instruction, however, the only functional change is
to change some not-implemented instructions to #UD if they correspond
to a VEX prefix with an invalid map.
Co-developed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Link: https://patch.msgid.link/20251114003633.60689-10-pbonzini@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Restructure how to represent and interpret REX fields, preparing
for handling of both REX2 and VEX.
REX uses the upper four bits of a single byte as a fixed identifier,
and the lower four bits containing the data. VEX and REX2 extends this so
that the first byte identifies the prefix and the rest encode additional
bits; and while VEX only has the same four data bits as REX, eight zero
bits are a valid value for the data bits of REX2. So, stop storing the
REX byte as-is. Instead, store only the low bits of the REX prefix and
track separately whether a REX-like prefix was used.
No functional changes intended.
Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Message-ID: <20251110180131.28264-11-chang.seok.bae@intel.com>
[Extracted from APX series; removed bitfields and REX2-specific default. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Link: https://patch.msgid.link/20251114003633.60689-9-pbonzini@redhat.com
[sean: name REX_{BXRW} enum "rex_bits"]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Prepare struct operand for hosting AVX registers. Remove the
existing, incomplete code that placed the Avx flag in the operand
alignment field, and repurpose the name for a separate bit that
indicates:
- after decode, whether an instruction supports the VEX prefix;
- before writeback, that the instruction did have the VEX prefix and
therefore 1) it can have op_bytes == 32; 2) t should clear high
bytes of XMM registers.
Right now the bit will never be set and the patch has no intended
functional change. However, this is actually more vexing than the
decoder changes itself, and therefore worth separating.
Co-developed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Link: https://patch.msgid.link/20251114003633.60689-8-pbonzini@redhat.com
[sean: guard ymm[8-15] accesses with #ifdef CONFIG_X86_64]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Remove all duplicate handling of register operands, including picking
the right register class and fetching it, by extracting a new function
that can be used for both REG and MODRM operands.
Centralize setting op->orig_val = op->val in fetch_register_operand()
as well.
No functional change intended.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Chang S. Bae <chang.seok.bae@intel.com>
Link: https://patch.msgid.link/20251114003633.60689-6-pbonzini@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
An irresistible microoptimization (changing accesses to Src2 to just an
AND :)) that also frees a bit for AVX in the low flags word. This makes
it closer to SSE since both of them can access XMM registers, pointlessly
shaving another clock cycle or two (maybe).
No functional change intended.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Chang S. Bae <chang.seok.bae@intel.com
Link: https://patch.msgid.link/20251114003633.60689-3-pbonzini@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
When a large VM, specifically one that holds a significant number of PTEs,
gets abruptly destroyed, the following warning is seen during the
page-table walk:
sched: CPU 0 need_resched set for > 100018840 ns (100 ticks) without schedule
CPU: 0 UID: 0 PID: 9617 Comm: kvm_page_table_ Tainted: G O 6.16.0-smp-DEV #3 NONE
Tainted: [O]=OOT_MODULE
Call trace:
show_stack+0x20/0x38 (C)
dump_stack_lvl+0x3c/0xb8
dump_stack+0x18/0x30
resched_latency_warn+0x7c/0x88
sched_tick+0x1c4/0x268
update_process_times+0xa8/0xd8
tick_nohz_handler+0xc8/0x168
__hrtimer_run_queues+0x11c/0x338
hrtimer_interrupt+0x104/0x308
arch_timer_handler_phys+0x40/0x58
handle_percpu_devid_irq+0x8c/0x1b0
generic_handle_domain_irq+0x48/0x78
gic_handle_irq+0x1b8/0x408
call_on_irq_stack+0x24/0x30
do_interrupt_handler+0x54/0x78
el1_interrupt+0x44/0x88
el1h_64_irq_handler+0x18/0x28
el1h_64_irq+0x84/0x88
stage2_free_walker+0x30/0xa0 (P)
__kvm_pgtable_walk+0x11c/0x258
__kvm_pgtable_walk+0x180/0x258
__kvm_pgtable_walk+0x180/0x258
__kvm_pgtable_walk+0x180/0x258
kvm_pgtable_walk+0xc4/0x140
kvm_pgtable_stage2_destroy+0x5c/0xf0
kvm_free_stage2_pgd+0x6c/0xe8
kvm_uninit_stage2_mmu+0x24/0x48
kvm_arch_flush_shadow_all+0x80/0xa0
kvm_mmu_notifier_release+0x38/0x78
__mmu_notifier_release+0x15c/0x250
exit_mmap+0x68/0x400
__mmput+0x38/0x1c8
mmput+0x30/0x68
exit_mm+0xd4/0x198
do_exit+0x1a4/0xb00
do_group_exit+0x8c/0x120
get_signal+0x6d4/0x778
do_signal+0x90/0x718
do_notify_resume+0x70/0x170
el0_svc+0x74/0xd8
el0t_64_sync_handler+0x60/0xc8
el0t_64_sync+0x1b0/0x1b8
The warning is seen majorly on the host kernels that are configured
not to force-preempt, such as CONFIG_PREEMPT_NONE=y. To avoid this,
instead of walking the entire page-table in one go, split it into
smaller ranges, by checking for cond_resched() between each range.
Since the path is executed during VM destruction, after the
page-table structure is unlinked from the KVM MMU, relying on
cond_resched_rwlock_write() isn't necessary.
Signed-off-by: Raghavendra Rao Ananta <rananta@google.com>
Link: https://msgid.link/20251113052452.975081-4-rananta@google.com
Signed-off-by: Oliver Upton <oupton@kernel.org>
Split kvm_pgtable_stage2_destroy() into two:
- kvm_pgtable_stage2_destroy_range(), that performs the
page-table walk and free the entries over a range of addresses.
- kvm_pgtable_stage2_destroy_pgd(), that frees the PGD.
This refactoring enables subsequent patches to free large page-tables
in chunks, calling cond_resched() between each chunk, to yield the
CPU as necessary.
Existing callers of kvm_pgtable_stage2_destroy(), that probably cannot
take advantage of this (such as nVMHE), will continue to function as is.
Signed-off-by: Raghavendra Rao Ananta <rananta@google.com>
Suggested-by: Oliver Upton <oupton@kernel.org>
Link: https://msgid.link/20251113052452.975081-3-rananta@google.com
Signed-off-by: Oliver Upton <oupton@kernel.org>
A subsequent change to the way KVM frees stage-2s will invoke the free
walker on sub-ranges of the VM's IPA space, meaning there's potential
for only partially visiting a table's PTEs.
Split the leaf and table visitors and only drop references on a table
when the page count reaches 1, implying there are no valid PTEs that
need to be visited. Invalidate the table PTE to avoid traversing the
stale reference.
Link: https://msgid.link/20251113052452.975081-2-rananta@google.com
Signed-off-by: Oliver Upton <oupton@kernel.org>
vgic_lpi_stress sends MAPTI and MAPC commands during guest GIC setup to
map interrupt events to ITT entries and collection IDs to
redistributors, respectively.
We have no guarantee that the ITS will finish handling these mapping
commands before the selftest calls KVM_SIGNAL_MSI to inject LPIs to the
guest. If LPIs are injected before ITS mapping completes, the ITS cannot
properly pass the interrupt on to the redistributor.
Fix by adding a SYNC command to the selftests ITS library, then calling
SYNC after ITS mapping to ensure mapping completes before signal_lpi()
writes to GITS_TRANSLATER.
Signed-off-by: Maximilian Dittgen <mdittgen@amazon.de>
Link: https://msgid.link/20251119135744.68552-2-mdittgen@amazon.de
Signed-off-by: Oliver Upton <oupton@kernel.org>
The selftests GIC library and tests assume that the
GICR_TYPER.Processor_number associated with a given CPU is the same as
the CPU's selftest index.
Since this assumption is not guaranteed by specification, add an assert
in gicv3_cpu_init() that validates this is true.
Signed-off-by: Maximilian Dittgen <mdittgen@amazon.de>
Link: https://msgid.link/20251119135744.68552-1-mdittgen@amazon.de
Signed-off-by: Oliver Upton <oupton@kernel.org>
Add selftests to cbo.c to verify Zicbop extension behavior, and split
the previous `--sigill` mode into two options so they can be tested
independently.
The test checks:
- That hwprobe correctly reports Zicbop presence and block size.
- That prefetch instructions execute without exception on valid and NULL
addresses when Zicbop is present.
Signed-off-by: Yao Zihong <zihong.plct@isrc.iscas.ac.cn>
Reviewed-by: Andrew Jones <ajones@ventanamicro.com>
Link: https://patch.msgid.link/20251118162436.15485-3-zihong.plct@isrc.iscas.ac.cn
Signed-off-by: Paul Walmsley <pjw@kernel.org>
- Add `RISCV_HWPROBE_EXT_ZICBOP` to report the presence of the
Zicbop extension.
- Add `RISCV_HWPROBE_KEY_ZICBOP_BLOCK_SIZE` to expose the block
size (in bytes) when Zicbop is supported.
- Update hwprobe.rst to document the new extension bit and block
size key, following the existing Zicbom/Zicboz style.
Reviewed-by: Andrew Jones <ajones@ventanamicro.com>
Signed-off-by: Yao Zihong <zihong.plct@isrc.iscas.ac.cn>
Link: https://patch.msgid.link/20251118162436.15485-2-zihong.plct@isrc.iscas.ac.cn
[pjw@kernel.org: updated to apply]
Signed-off-by: Paul Walmsley <pjw@kernel.org>
The vector regset uses the maximum possible vlen value to estimate the
.n field. But not all the hardwares support the maximum vlen. Linux
might wastes time to prepare a large memory buffer(about 2^6 pages) for
the vector regset.
The regset can only copy vector registers when the process are using
vector. Add .active callback and determine the n field of vector regset
in riscv_v_setup_ctx_cache() doesn't affect the ptrace syscall and
coredump. It can avoid oversized allocations and better matches real
hardware limits.
Signed-off-by: Yong-Xuan Wang <yongxuan.wang@sifive.com>
Reviewed-by: Greentime Hu <greentime.hu@sifive.com>
Reviewed-by: Andy Chiu <andybnac@gmail.com>
Tested-by: Andy Chiu <andybnac@gmail.com>
Link: https://patch.msgid.link/20251013091318.467864-2-yongxuan.wang@sifive.com
Signed-off-by: Paul Walmsley <pjw@kernel.org>
Use riscv_has_extension_likely() to check for RISCV_ISA_EXT_ZBB,
replacing the use of asm goto with ALTERNATIVE.
The "likely" variant is used to match the behavior of the original
implementation using ALTERNATIVE("j %l[no_zbb]", "nop", ...).
While we're at it, also remove bogus comment about Zbb being likely
available. We have to choose between "likely" and "unlikely" due to
limitations of the asm goto feature, but that does not mean we should
put a bad comment on why we pick "likely" over "unlikely".
Signed-off-by: Vivian Wang <wangruikang@iscas.ac.cn>
Link: https://patch.msgid.link/20251020-riscv-altn-helper-wip-v4-2-ef941c87669a@iscas.ac.cn
Signed-off-by: Paul Walmsley <pjw@kernel.org>
Use riscv_has_extension_unlikely() to check for RISCV_ISA_EXT_SVVPTC,
replacing the use of asm goto with ALTERNATIVE.
The "unlikely" variant is used to match the behavior of the original
implementation using ALTERNATIVE("nop", "j %l[svvptc]", ...).
Note that this makes the check for RISCV_ISA_EXT_SVVPTC a runtime one if
RISCV_ALTERNATIVE=n, but it should still be worthwhile to do so given
that TLB flushes are relatively slow.
Signed-off-by: Vivian Wang <wangruikang@iscas.ac.cn>
Link: https://patch.msgid.link/20251020-riscv-altn-helper-wip-v4-1-ef941c87669a@iscas.ac.cn
Signed-off-by: Paul Walmsley <pjw@kernel.org>
The core kernel already supports parallel bringup of secondary
CPUs (aka HOTPLUG_PARALLEL). The x86 and MIPS architectures
already use HOTPLUG_PARALLEL and ARM is also moving toward it.
On RISC-V, there is no arch specific global data accessed in the
RISC-V secondary CPU bringup path so enabling HOTPLUG_PARALLEL for
RISC-V would only require:
1) Providing RISC-V specific arch_cpuhp_kick_ap_alive()
2) Calling cpuhp_ap_sync_alive() from smp_callin()
This patch is tested natively with OpenSBI on QEMU RV64 virt machine
with 64 cores and also tested with KVM RISC-V guest with 32 VCPUs.
Signed-off-by: Anup Patel <apatel@ventanamicro.com>
Reviewed-by: Atish Patra <atishp@rivosinc.com>
Link: https://patch.msgid.link/20250905122512.71684-1-apatel@ventanamicro.com
Signed-off-by: Paul Walmsley <pjw@kernel.org>
Move KVM's swapping of PKRU outside of the fastpath loop, as there is no
KVM code anywhere in the fastpath that accesses guest/userspace memory,
i.e. that can consume protection keys.
As documented by commit 1be0e61c1f ("KVM, pkeys: save/restore PKRU when
guest/host switches"), KVM just needs to ensure the host's PKRU is loaded
when KVM (or the kernel at-large) may access userspace memory. And at the
time of commit 1be0e61c1f, KVM didn't have a fastpath, and PKU was
strictly contained to VMX, i.e. there was no reason to swap PKRU outside
of vmx_vcpu_run().
Over time, the "need" to swap PKRU close to VM-Enter was likely falsely
solidified by the association with XFEATUREs in commit 37486135d3
("KVM: x86: Fix pkru save/restore when guest CR4.PKE=0, move it to x86.c"),
and XFEATURE swapping was in turn moved close to VM-Enter/VM-Exit as a
KVM hack-a-fix ution for an #MC handler bug by commit 1811d979c7
("x86/kvm: move kvm_load/put_guest_xcr0 into atomic context").
Deferring the PKRU loads shaves ~40 cycles off the fastpath for Intel,
and ~60 cycles for AMD. E.g. using INVD in KVM-Unit-Test's vmexit.c,
with extra hacks to enable CR4.PKE and PKRU=(-1u & ~0x3), latency numbers
for AMD Turin go from ~1560 => ~1500, and for Intel Emerald Rapids, go
from ~810 => ~770.
Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Jon Kohler <jon@nutanix.com>
Link: https://patch.msgid.link/20251118222328.2265758-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Move KVM's swapping of XFEATURE masks, i.e. XCR0 and XSS, out of the
fastpath loop now that the guts of the #MC handler runs in task context,
i.e. won't invoke schedule() with preemption disabled and clobber state
(or crash the kernel) due to trying to context switch XSTATE with a mix
of host and guest state.
For all intents and purposes, this reverts commit 1811d979c7 ("x86/kvm:
move kvm_load/put_guest_xcr0 into atomic context"), which papered over an
egregious bug/flaw in the #MC handler where it would do schedule() even
though IRQs are disabled. E.g. the call stack from the commit:
kvm_load_guest_xcr0
...
kvm_x86_ops->run(vcpu)
vmx_vcpu_run
vmx_complete_atomic_exit
kvm_machine_check
do_machine_check
do_memory_failure
memory_failure
lock_page
Commit 1811d979c7 "fixed" the immediate issue of XRSTORS exploding, but
completely ignored that scheduling out a vCPU task while IRQs and
preemption is wildly broken. Thankfully, commit 5567d11c21 ("x86/mce:
Send #MC singal from task work") (somewhat incidentally?) fixed that flaw
by pushing the meat of the work to the user-return path, i.e. to task
context.
KVM has also hardened itself against #MC goofs by moving #MC forwarding to
kvm_x86_ops.handle_exit_irqoff(), i.e. out of the fastpath. While that's
by no means a robust fix, restoring as much state as possible before
handling the #MC will hopefully provide some measure of protection in the
event that #MC handling goes off the rails again.
Note, KVM always intercepts XCR0 writes for vCPUs without protected state,
e.g. there's no risk of consuming a stale XCR0 when determining if a PKRU
update is needed; kvm_load_host_xfeatures() only reads, and never writes,
vcpu->arch.xcr0.
Deferring the XCR0 and XSS loads shaves ~300 cycles off the fastpath for
Intel, and ~500 cycles for AMD. E.g. using INVD in KVM-Unit-Test's
vmexit.c, which an extra hack to enable CR4.OXSAVE, latency numbers for
AMD Turin go from ~2000 => 1500, and for Intel Emerald Rapids, go from
~1300 => ~1000.
Cc: Jon Kohler <jon@nutanix.com>
Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Jon Kohler <jon@nutanix.com>
Link: https://patch.msgid.link/20251118222328.2265758-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Handle Machine Checks (#MC) that happen on VM-Enter (VMX or TDX) outside
of KVM's fastpath so that as much host state as possible is re-loaded
before invoking the kernel's #MC handler. The only requirement is that
KVM invokes the #MC handler before enabling IRQs (and even that could
_probably_ be related to handling #MCs before enabling preemption).
Waiting to handle #MCs until "more" host state is loaded hardens KVM
against flaws in the #MC handler, which has historically been quite
brittle. E.g. prior to commit 5567d11c21 ("x86/mce: Send #MC singal from
task work"), the #MC code could trigger a schedule() with IRQs and
preemption disabled. That led to a KVM hack-a-fix in commit 1811d979c7
("x86/kvm: move kvm_load/put_guest_xcr0 into atomic context").
Note, vmx_handle_exit_irqoff() is common to VMX and TDX guests.
Cc: Tony Lindgren <tony.lindgren@linux.intel.com>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Jon Kohler <jon@nutanix.com>
Reviewed-by: Tony Lindgren <tony.lindgren@linux.intel.com>
Link: https://patch.msgid.link/20251118222328.2265758-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Handle Machine Checks (#MC) that happen in the guest (by forwarding them
to the host) outside of KVM's fastpath so that as much host state as
possible is re-loaded before invoking the kernel's #MC handler. The only
requirement is that KVM invokes the #MC handler before enabling IRQs (and
even that could _probably_ be relaxed to handling #MCs before enabling
preemption).
Waiting to handle #MCs until "more" host state is loaded hardens KVM
against flaws in the #MC handler, which has historically been quite
brittle. E.g. prior to commit 5567d11c21 ("x86/mce: Send #MC singal from
task work"), the #MC code could trigger a schedule() with IRQs and
preemption disabled. That led to a KVM hack-a-fix in commit 1811d979c7
("x86/kvm: move kvm_load/put_guest_xcr0 into atomic context").
Note, except for #MCs on VM-Enter, VMX already handles #MCs outside of the
fastpath.
Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Jon Kohler <jon@nutanix.com>
Link: https://patch.msgid.link/20251118222328.2265758-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Currently the tracking of the need to flush L1D for L1TF is tracked by
two bits: one per-CPU and one per-vCPU.
The per-vCPU bit is always set when the vCPU shows up on a core, so
there is no interesting state that's truly per-vCPU. Indeed, this is a
requirement, since L1D is a part of the physical CPU.
So simplify this by combining the two bits.
The vCPU bit was being written from preemption-enabled regions. To play
nice with those cases, wrap all calls from KVM and use a raw write so that
request a flush with preemption enabled doesn't trigger what would
effectively be DEBUG_PREEMPT false positives. Preemption doesn't need to
be disabled, as kvm_arch_vcpu_load() will mark the new CPU as needing a
flush if the vCPU task is migrated, or if userspace runs the vCPU on a
different task.
Signed-off-by: Brendan Jackman <jackmanb@google.com>
[sean: put raw write in KVM instead of in a hardirq.h variant]
Link: https://patch.msgid.link/20251113233746.1703361-10-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Disable support for flushing the L1 data cache to mitigate L1TF if CPU
mitigations are disabled for the entire kernel. KVM's mitigation of L1TF
is in no way special enough to justify ignoring CONFIG_CPU_MITIGATIONS=n.
Deliberately use CPU_MITIGATIONS instead of the more precise
MITIGATION_L1TF, as MITIGATION_L1TF only controls the default behavior,
i.e. CONFIG_MITIGATION_L1TF=n doesn't completely disable L1TF mitigations
in the kernel.
Keep the vmentry_l1d_flush module param to avoid breaking existing setups,
and leverage the .set path to alert the user to the fact that
vmentry_l1d_flush will be ignored. Don't bother validating the incoming
value; if an admin misconfigures vmentry_l1d_flush, the fact that the bad
configuration won't be detected when running with CONFIG_CPU_MITIGATIONS=n
is likely the least of their worries.
Reviewed-by: Brendan Jackman <jackmanb@google.com>
Link: https://patch.msgid.link/20251113233746.1703361-9-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Now that VMX encodes its own sequence for clearing CPU buffers, move
VM_CLEAR_CPU_BUFFERS into SVM to minimize the chances of KVM botching a
mitigation in the future, e.g. using VM_CLEAR_CPU_BUFFERS instead of
checking multiple mitigation flags.
No functional change intended.
Reviewed-by: Brendan Jackman <jackmanb@google.com>
Acked-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://patch.msgid.link/20251113233746.1703361-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Rework the handling of the MMIO Stale Data mitigation to clear CPU buffers
immediately prior to VM-Enter, i.e. in the same location that KVM emits a
VERW for unconditional (at runtime) clearing. Co-locating the code and
using a single ALTERNATIVES_2 makes it more obvious how VMX mitigates the
various vulnerabilities.
Deliberately order the alternatives as:
0. Do nothing
1. Clear if vCPU can access MMIO
2. Clear always
since the last alternative wins in ALTERNATIVES_2(), i.e. so that KVM will
honor the strictest mitigation (always clear CPU buffers) if multiple
mitigations are selected. E.g. even if the kernel chooses to mitigate
MMIO Stale Data via X86_FEATURE_CLEAR_CPU_BUF_VM_MMIO, another mitigation
may enable X86_FEATURE_CLEAR_CPU_BUF_VM, and that other thing needs to win.
Note, decoupling the MMIO mitigation from the L1TF mitigation also fixes
a mostly-benign flaw where KVM wouldn't do any clearing/flushing if the
L1TF mitigation is configured to conditionally flush the L1D, and the MMIO
mitigation but not any other "clear CPU buffers" mitigation is enabled.
For that specific scenario, KVM would skip clearing CPU buffers for the
MMIO mitigation even though the kernel requested a clear on every VM-Enter.
Note #2, the flaw goes back to the introduction of the MDS mitigation. The
MDS mitigation was inadvertently fixed by commit 43fb862de8 ("KVM/VMX:
Move VERW closer to VMentry for MDS mitigation"), but previous kernels
that flush CPU buffers in vmx_vcpu_enter_exit() are affected (though it's
unlikely the flaw is meaningfully exploitable even older kernels).
Fixes: 650b68a062 ("x86/kvm/vmx: Add MDS protection when L1D Flush is not active")
Suggested-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Reviewed-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Reviewed-by: Brendan Jackman <jackmanb@google.com>
Link: https://patch.msgid.link/20251113233746.1703361-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
TSA mitigation:
d8010d4ba4 ("x86/bugs: Add a Transient Scheduler Attacks mitigation")
introduced VM_CLEAR_CPU_BUFFERS for guests on AMD CPUs. Currently on Intel
CLEAR_CPU_BUFFERS is being used for guests which has a much broader scope
(kernel->user also).
Make mitigations on Intel consistent with TSA. This would help handling the
guest-only mitigations better in future.
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
[sean: make CLEAR_CPU_BUF_VM mutually exclusive with the MMIO mitigation]
Acked-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Brendan Jackman <jackmanb@google.com>
Link: https://patch.msgid.link/20251113233746.1703361-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
When testing for VMLAUNCH vs. VMRESUME, use the copy of @flags from the
stack instead of first moving it to EBX, and then propagating
VMX_RUN_VMRESUME to RFLAGS.CF (because RBX is clobbered with the guest
value prior to the conditional branch to VMLAUNCH). Stashing information
in RFLAGS is gross, especially with the writer and reader being bifurcated
by yet more gnarly assembly code.
Opportunistically drop the SHIFT macros as they existed purely to allow
the VM-Enter flow to use Bit Test.
Suggested-by: Borislav Petkov <bp@alien8.de>
Acked-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Brendan Jackman <jackmanb@google.com>
Link: https://patch.msgid.link/20251113233746.1703361-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Move user_return_msrs allocation/free from vendor modules (kvm-intel.ko and
kvm-amd.ko) (un)loading time to kvm.ko's to make it less risky to access
user_return_msrs in kvm.ko. Tying the lifetime of user_return_msrs to
vendor modules makes every access to user_return_msrs prone to
use-after-free issues as vendor modules may be unloaded at any time.
Opportunistically turn the per-CPU variable into full structs, as there's
no practical difference between statically allocating the memory and
allocating it unconditionally during module_init().
Zero out kvm_nr_uret_msrs on vendor module exit to further minimize the
chances of consuming stale data, and WARN on vendor module load if KVM
thinks there are existing user-return MSRs.
Note! The user-return MSRs also need to be "destroyed" if
ops->hardware_setup() fails, as both SVM and VMX expect common KVM to
clean up (because common code, not vendor code, is responsible for
kvm_nr_uret_msrs).
Signed-off-by: Chao Gao <chao.gao@intel.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Link: https://patch.msgid.link/20251108013601.902918-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
RESET_CONTROL_FLAGS_BIT_* macros use BIT(), but reset.h does not
include bits.h. This causes compilation errors when including
reset.h standalone.
Include bits.h to make reset.h self-contained.
Suggested-by: Troy Mitchell <troy.mitchell@linux.dev>
Reviewed-by: Troy Mitchell <troy.mitchell@linux.dev>
Reviewed-by: Philipp Zabel <p.zabel@pengutronix.de>
Signed-off-by: Encrow Thorne <jyc0019@gmail.com>
Signed-off-by: Philipp Zabel <p.zabel@pengutronix.de>
The devm_regmap_field_alloc() function never returns NULL, it returns
error pointers. Update the error checking to match.
Fixes: 58128aa88867 ("reset: rzg2l-usbphy-ctrl: Add support for USB PWRRDY")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: Claudiu Beznea <claudiu.beznea.uj@bp.renesas.com>
Signed-off-by: Philipp Zabel <p.zabel@pengutronix.de>
TH1520 SoC is divided into several subsystems, shipping distinct reset
controllers with similar control logic. Let's make reset signal mapping
a data structure specific to one compatible to prepare for introduction
of more reset controllers in the future.
Signed-off-by: Yao Zi <ziyao@disroot.org>
Acked-by: Guo Ren <guoren@kernel.org>
Reviewed-by: Drew Fustini <fustini@kernel.org>
Signed-off-by: Philipp Zabel <p.zabel@pengutronix.de>
TH1520 SoC is divided into several subsystems, most of them have
distinct reset controllers. Let's document reset controllers other than
the one for VO subsystem and IDs for their reset signals.
Signed-off-by: Yao Zi <ziyao@disroot.org>
Acked-by: Rob Herring (Arm) <robh@kernel.org>
Reviewed-by: Drew Fustini <fustini@kernel.org>
Acked-by: Guo Ren <guoren@kernel.org>
Signed-off-by: Philipp Zabel <p.zabel@pengutronix.de>
Registers in control of TH1520_RESET_ID_{NPU,WDT0,WDT1} belong to AP
reset controller, not the VO one which is documented as
"thead,th1520-reset" and is the only reset controller supported for
TH1520 for now.
Let's remove the IDs, leaving them to be implemented by AP-subsystem
reset controller in the future.
Fixes: 30e7573bab ("dt-bindings: reset: Add T-HEAD TH1520 SoC Reset Controller")
Signed-off-by: Yao Zi <ziyao@disroot.org>
Acked-by: Rob Herring (Arm) <robh@kernel.org>
Reviewed-by: Drew Fustini <fustini@kernel.org>
Acked-by: Guo Ren <guoren@kernel.org>
Signed-off-by: Philipp Zabel <p.zabel@pengutronix.de>
There are no more users of this code. Let's remove the exported symbols
and the implementation from reset core.
Reviewed-by: Philipp Zabel <p.zabel@pengutronix.de>
Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
[p.zabel@pengutronix.de: folded in 8e6ec20e-8965-4b42-99fc-0462269ff2f1@paulmck-laptop]
Signed-off-by: Philipp Zabel <p.zabel@pengutronix.de>
On the Renesas RZ/G3S SoC, the USB PHY block has an input signal called
PWRRDY. This signal is managed by the system controller and must be
de-asserted after powering on the area where USB PHY resides and asserted
before powering it off.
On power-on/resume the USB PWRRDY signal need to be de-asserted before
enabling clock and switching the module to normal state (through MSTOP
support). The power-on/resume configuration sequence must be:
1/ PWRRDY=0
2/ CLK_ON=1
3/ MSTOP=0
On power-off/suspend the configuration sequence should be:
1/ MSTOP=1
2/ CLK_ON=0
3/ PWRRDY=1
The CLK_ON and MSTOP functionalities are controlled by clock drivers.
The suspend/resume support will be handled by different patches.
After long discussions with the internal HW team, it has been confirmed
that the HW connection b/w USB PHY block, the USB channels, the system
controller, clock, MSTOP, PWRRDY signal is as follows:
┌──────────────────────────────┐
│ │◄── CPG_CLKON_USB.CLK0_ON
│ USB CH0 │
┌──────────────────────────┐ │┌───────────────────────────┐ │◄── CPG_CLKON_USB.CLK2_ON
│ ┌────────┐ ││host controller registers │ │
│ │ │ ││function controller registers│
│ │ PHY0 │◄──┤└───────────────────────────┘ │
│ USB PHY │ │ └────────────▲─────────────────┘
│ └────────┘ │
│ │ CPG_BUS_PERI_COM_MSTOP.MSTOP{6, 5}_ON
│┌──────────────┐ ┌────────┐
││USHPHY control│ │ │
││ registers │ │ PHY1 │ ┌──────────────────────────────┐
│└──────────────┘ │ │◄──┤ USB CH1 │
│ └────────┘ │┌───────────────────────────┐ │◄── CPG_CLKON_USB.CLK1_ON
└─▲───────▲─────────▲──────┘ ││ host controller registers │ │
│ │ │ │└───────────────────────────┘ │
│ │ │ └────────────▲─────────────────┘
│ │ │ │
│ │ │ CPG_BUS_PERI_COM_MSTOP.MSTOP7_ON
│PWRRDY │ │
│ │ CPG_CLK_ON_USB.CLK3_ON
│ │
│ CPG_BUS_PERI_COM_MSTOP.MSTOP4_ON
│
┌────┐
│SYSC│
└────┘
where:
- CPG_CLKON_USB.CLK.CLKX_ON is the register bit controlling the clock X
of different USB blocks, X in {0, 1, 2, 3}
- CPG_BUS_PERI_COM_MSTOP.MSTOPX_ON is the register bit controlling the
MSTOP of different USB blocks, X in {4, 5, 6, 7}
- USB PHY is the USB PHY block exposing 2 ports, port0 and port1, used
by the USB CH0, USB CH1
- SYSC is the system controller block controlling the PWRRDY signal
- USB CHx are individual USB block with host and function capabilities
(USB CH0 have both host and function capabilities, USB CH1 has only
host capabilities)
The USBPHY control registers are controlled though the
reset-rzg2l-usbphy-ctrl driver. The USB PHY ports are controlled by
phy_rcar_gen3_usb2 (drivers/phy/renesas/phy-rcar-gen3-usb2.c file). The
USB PHY ports request resets from the reset-rzg2l-usbphy-ctrl driver.
The connection b/w the system controller and the USB PHY CTRL driver is
implemented through the renesas,sysc-pwrrdy device tree property
proposed in this patch. This property specifies the register offset and the
bitmask required to control the PWRRDY signal.
Since the USB PHY CTRL driver needs to be probed before any other
USB-specific driver on RZ/G3S, control of PWRRDY is passed exclusively
to it. This guarantees the correct configuration sequence between clocks,
MSTOP bits, and the PWRRDY bit on probe/resume and remove/suspend. At the
same time, changes are kept minimal by avoiding modifications to the USB
PHY driver to also handle the PWRRDY itself.
Tested-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
Signed-off-by: Claudiu Beznea <claudiu.beznea.uj@bp.renesas.com>
Reviewed-by: Philipp Zabel <p.zabel@pengutronix.de>
Signed-off-by: Philipp Zabel <p.zabel@pengutronix.de>
The Renesas USB PHY hardware block needs to have the PWRRDY bit in the
system controller set before applying any other settings. The PWRRDY bit
must be controlled during power-on, power-off, and system suspend/resume
sequences as follows:
- during power-on/resume, it must be set to zero before enabling clocks and
modules
- during power-off/suspend, it must be set to one after disabling clocks
and modules
Add the renesas,sysc-pwrrdy device tree property, which allows the
reset-rzg2l-usbphy-ctrl driver to parse, map, and control the system
controller PWRRDY bit at the appropriate time. Along with it add a new
compatible for the RZ/G3S SoC.
Reviewed-by: Rob Herring (Arm) <robh@kernel.org>
Signed-off-by: Claudiu Beznea <claudiu.beznea.uj@bp.renesas.com>
Signed-off-by: Philipp Zabel <p.zabel@pengutronix.de>
LAN969x uses the same reset configuration as LAN966x, but we need to
allow compiling it when ARCH_LAN969X is selected.
A fallback compatible to LAN966x will be used.
Signed-off-by: Robert Marko <robert.marko@sartura.hr>
Signed-off-by: Philipp Zabel <p.zabel@pengutronix.de>
Create a new fuse inode flag that indicates that the kernel should
implement various local filesystem behaviors instead of passing vfs
commands straight through to the fuse server and expecting the server to
do all the work. For example, this means that we'll use the kernel to
transform some ACL updates into mode changes, and later to do
enforcement of the immutable and append iflags.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
no_slb_preload cmdline can come useful in quickly disabling and/or
testing the performance impact of userspace slb preloads. Recently there
was a slb multi-hit issue due to slb preload cache which was very
difficult to triage. This cmdline option allows to quickly disable
preloads and verify if the issue exists in preload cache or somewhere
else. This can also be a useful option to see the effect of slb preloads
for any application workload e.g. number of slb faults with or w/o slb
preloads.
with slb_preload:
slb_faults (minimal initrd boot): 15
slb_faults (full systemd boot): 300
with no_slb_preload:
slb_faults (minimal initrd boot): 33
slb_faults (full systemd boot): 138180
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/de484b55c45d831bc2db63945f455153c89a9a65.1761834163.git.ritesh.list@gmail.com
Update the directMap page counters for Hash. Hash by default always uses
mmu_linear_psize only, for it's directMap. However, once the kernel has
booted and the dmesg log is wrapped over there is no way of knowing the
kernel linear pagesize with Hash mmu. Features like debug_page_alloc can
make mmu_linear_psize to be PAGE_SIZE instead of PMD / PUD mappings. It
would be easier if we have this info printed in proc meminfo similar to
Radix for debugging purposes.
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/208e6f946d2ba9c1e2b8b4f665728abe5c891e7c.1761834163.git.ritesh.list@gmail.com
We get below errors when we try to enable debug logs in book3s64/hash_utils.c
This patch fixes these errors related to phys_addr_t printf format.
arch/powerpc/mm/book3s64/hash_utils.c: In function ‘htab_initialize’:
arch/powerpc/mm/book3s64/hash_utils.c:1401:21: error: format ‘%lx’ expects argument of type ‘long unsigned int’, but argument 2 has type ‘phys_addr_t’ {aka ‘long long unsigned int’} [-Werror=format=]
1401 | DBG("creating mapping for region: %lx..%lx (prot: %lx)\n",
arch/powerpc/mm/book3s64/hash_utils.c:1401:21: error: format ‘%lx’ expects argument of type ‘long unsigned int’, but argument 3 has type ‘phys_addr_t’ {aka ‘long long unsigned int’} [-Werror=format=]
cc1: all warnings being treated as errors
make[6]: *** [../scripts/Makefile.build:287: arch/powerpc/mm/book3s64/hash_utils.o] Error 1
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/4873e9692fc4411099c9741005d218d5e734c345.1761834163.git.ritesh.list@gmail.com
HPTE format was changed since Power9 (ISA 3.0) onwards. While dumping
kernel hash page tables, nothing gets printed on powernv P9+. This patch
utilizes the helpers added in the patch tagged as fixes, to convert new
format to old format and dump the hptes. This fix is only needed for
native_find() (powernv), since pseries continues to work fine with the
old format.
Fixes: 6b243fcfb5 ("powerpc/64: Simplify adaptation to new ISA v3.00 HPTE format")
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/4c2bb9e5b3cfbc0dd80b61b67cdd3ccfc632684c.1761834163.git.ritesh.list@gmail.com
On systems using the hash MMU, there is a software SLB preload cache that
mirrors the entries loaded into the hardware SLB buffer. This preload
cache is subject to periodic eviction — typically after every 256 context
switches — to remove old entry.
To optimize performance, the kernel skips switch_mmu_context() in
switch_mm_irqs_off() when the prev and next mm_struct are the same.
However, on hash MMU systems, this can lead to inconsistencies between
the hardware SLB and the software preload cache.
If an SLB entry for a process is evicted from the software cache on one
CPU, and the same process later runs on another CPU without executing
switch_mmu_context(), the hardware SLB may retain stale entries. If the
kernel then attempts to reload that entry, it can trigger an SLB
multi-hit error.
The following timeline shows how stale SLB entries are created and can
cause a multi-hit error when a process moves between CPUs without a
MMU context switch.
CPU 0 CPU 1
----- -----
Process P
exec swapper/1
load_elf_binary
begin_new_exc
activate_mm
switch_mm_irqs_off
switch_mmu_context
switch_slb
/*
* This invalidates all
* the entries in the HW
* and setup the new HW
* SLB entries as per the
* preload cache.
*/
context_switch
sched_migrate_task migrates process P to cpu-1
Process swapper/0 context switch (to process P)
(uses mm_struct of Process P) switch_mm_irqs_off()
switch_slb
load_slb++
/*
* load_slb becomes 0 here
* and we evict an entry from
* the preload cache with
* preload_age(). We still
* keep HW SLB and preload
* cache in sync, that is
* because all HW SLB entries
* anyways gets evicted in
* switch_slb during SLBIA.
* We then only add those
* entries back in HW SLB,
* which are currently
* present in preload_cache
* (after eviction).
*/
load_elf_binary continues...
setup_new_exec()
slb_setup_new_exec()
sched_switch event
sched_migrate_task migrates
process P to cpu-0
context_switch from swapper/0 to Process P
switch_mm_irqs_off()
/*
* Since both prev and next mm struct are same we don't call
* switch_mmu_context(). This will cause the HW SLB and SW preload
* cache to go out of sync in preload_new_slb_context. Because there
* was an SLB entry which was evicted from both HW and preload cache
* on cpu-1. Now later in preload_new_slb_context(), when we will try
* to add the same preload entry again, we will add this to the SW
* preload cache and then will add it to the HW SLB. Since on cpu-0
* this entry was never invalidated, hence adding this entry to the HW
* SLB will cause a SLB multi-hit error.
*/
load_elf_binary continues...
START_THREAD
start_thread
preload_new_slb_context
/*
* This tries to add a new EA to preload cache which was earlier
* evicted from both cpu-1 HW SLB and preload cache. This caused the
* HW SLB of cpu-0 to go out of sync with the SW preload cache. The
* reason for this was, that when we context switched back on CPU-0,
* we should have ideally called switch_mmu_context() which will
* bring the HW SLB entries on CPU-0 in sync with SW preload cache
* entries by setting up the mmu context properly. But we didn't do
* that since the prev mm_struct running on cpu-0 was same as the
* next mm_struct (which is true for swapper / kernel threads). So
* now when we try to add this new entry into the HW SLB of cpu-0,
* we hit a SLB multi-hit error.
*/
WARNING: CPU: 0 PID: 1810970 at arch/powerpc/mm/book3s64/slb.c:62
assert_slb_presence+0x2c/0x50(48 results) 02:47:29 [20157/42149]
Modules linked in:
CPU: 0 UID: 0 PID: 1810970 Comm: dd Not tainted 6.16.0-rc3-dirty #12
VOLUNTARY
Hardware name: IBM pSeries (emulated by qemu) POWER8 (architected)
0x4d0200 0xf000004 of:SLOF,HEAD hv:linux,kvm pSeries
NIP: c00000000015426c LR: c0000000001543b4 CTR: 0000000000000000
REGS: c0000000497c77e0 TRAP: 0700 Not tainted (6.16.0-rc3-dirty)
MSR: 8000000002823033 <SF,VEC,VSX,FP,ME,IR,DR,RI,LE> CR: 28888482 XER: 00000000
CFAR: c0000000001543b0 IRQMASK: 3
<...>
NIP [c00000000015426c] assert_slb_presence+0x2c/0x50
LR [c0000000001543b4] slb_insert_entry+0x124/0x390
Call Trace:
0x7fffceb5ffff (unreliable)
preload_new_slb_context+0x100/0x1a0
start_thread+0x26c/0x420
load_elf_binary+0x1b04/0x1c40
bprm_execve+0x358/0x680
do_execveat_common+0x1f8/0x240
sys_execve+0x58/0x70
system_call_exception+0x114/0x300
system_call_common+0x160/0x2c4
>From the above analysis, during early exec the hardware SLB is cleared,
and entries from the software preload cache are reloaded into hardware
by switch_slb. However, preload_new_slb_context and slb_setup_new_exec
also attempt to load some of the same entries, which can trigger a
multi-hit. In most cases, these additional preloads simply hit existing
entries and add nothing new. Removing these functions avoids redundant
preloads and eliminates the multi-hit issue. This patch removes these
two functions.
We tested process switching performance using the context_switch
benchmark on POWER9/hash, and observed no regression.
Without this patch: 129041 ops/sec
With this patch: 129341 ops/sec
We also measured SLB faults during boot, and the counts are essentially
the same with and without this patch.
SLB faults without this patch: 19727
SLB faults with this patch: 19786
Fixes: 5434ae7462 ("powerpc/64s/hash: Add a SLB preload cache")
cc: stable@vger.kernel.org
Suggested-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Donet Tom <donettom@linux.ibm.com>
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/0ac694ae683494fe8cadbd911a1a5018d5d3c541.1761834163.git.ritesh.list@gmail.com
On 32-bit book3s with hash-MMUs, tlb_flush() was a no-op. This was
unnoticed because all uses until recently were for unmaps, and thus
handled by __tlb_remove_tlb_entry().
After commit 4a18419f71 ("mm/mprotect: use mmu_gather") in kernel 5.19,
tlb_gather_mmu() started being used for mprotect as well. This caused
mprotect to simply not work on these machines:
int *ptr = mmap(NULL, 4096, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
*ptr = 1; // force HPTE to be created
mprotect(ptr, 4096, PROT_READ);
*ptr = 2; // should segfault, but succeeds
Fixed by making tlb_flush() actually flush TLB pages. This finally
agrees with the behaviour of boot3s64's tlb_flush().
Fixes: 4a18419f71 ("mm/mprotect: use mmu_gather")
Cc: stable@vger.kernel.org
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Signed-off-by: Dave Vasilevsky <dave@vasilevsky.ca>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/20251116-vasi-mprotect-g3-v3-1-59a9bd33ba00@vasilevsky.ca
At this point there are very few call chains that might lead to
d_make_discardable() on a dentry that hadn't been made persistent:
calls of simple_unlink() and simple_rmdir() in configfs and
apparmorfs.
Both filesystems do pin (part of) their contents in dcache, but
they are currently playing very unusual games with that. Converting
them to more usual patterns might be possible, but it's definitely
going to be a long series of changes in both cases.
For now the easiest solution is to have both stop using simple_unlink()
and simple_rmdir() - that allows to make d_make_discardable() warn
when given a non-persistent dentry.
Rather than giving them full-blown private copies (with calls of
d_make_discardable() replaced with dput()), let's pull the parts of
simple_unlink() and simple_rmdir() that deal with timestamps and link
counts into separate helpers (__simple_unlink() and __simple_rmdir()
resp.) and have those used by configfs and apparmorfs.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
securityfs uses simple_recursive_removal(), but does not bother to mark
dentries persistent. This is the only place where it still happens; get
rid of that irregularity.
* use simple_{start,done}_creating() and d_make_persitent(); kill_litter_super()
use was already gone, since we empty the filesystem instance before it gets
shut down.
Acked-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Parallel to binderfs stuff:
* use simple_start_creating()/simple_done_creating()/d_make_persistent()
instead of manual inode_lock()/lookup_noperm()/d_instanitate()/inode_unlock().
* allocate inode first - simpler cleanup that way.
* use simple_recursive_removal() instead of open-coding it.
* switch to kill_anon_super()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
One instance per net-ns. There's a fixed subset (several files in root,
an optional symlink in root + initially empty /clients/) + per-client
subdirectory in /clients/. Clients can appear only after the filesystem
is there and they are all gone before it gets through ->kill_sb().
Fixed subset created in fill_super(), regular files by simple_fill_super(),
then a subdirectory and a symlink - manually. It is removed by
kill_litter_super().
Per-client subdirectories are created by nfsd_client_mkdir() (populated
with client-supplied list of files in them). Removed by nfsd_client_rmdir(),
which is simple_recursive_removal().
All dentries except for the ones from simple_fill_super() come from
* nfsd_mkdir() (subdirectory, dentry from simple_start_creating()).
Called from fill_super() (creates initially empty /clients)
and from nfsd_client_mkdir (creates a per-client subdirectory
in /clients).
* _nfsd_symlink() (symlink, dentry from simple_start_creating()), called
from fill_super().
* nfsdfs_create_files() (regulars, dentry from simple_start_creating()),
called only from nfsd_client_mkdir().
Turn d_instatiate() + inode_unlock() into d_make_persistent() + simple_done_creating()
in nfsd_mkdir(), _nfsd_symlink() and nfsdfs_create_files() and we are done.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Just use d_make_persistent() + dput() (and fold the latter into
simple_finish_creating()) and that's it...
NOTE: pipe->dentry is a borrowed reference - it does not contribute
to dentry refcount.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
just have hypfs_create_file() do the usual simple_start_creating()/
d_make_persistent()/simple_done_creating() and that's it
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
hypfs dentries end up with refcount 2 when they are not busy.
Refcount 1 is enough to keep them pinned, and going that way
allows to simplify things nicely:
* don't need to drop an extra reference before the
call of kill_litter_super() in ->kill_sb(); all we need
there is to reset the cleanup list - everything on it will
be taken out automatically.
* we can make use of simple_recursive_removal() on
tree rebuilds; just make sure that only children of root
end up in the cleanup list and hypfs_delete_tree() becomes
much simpler
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
All files are regular; ep0 is there all along, other ep* may appear
and go away during the filesystem lifetime; all of those are guaranteed
to be gone by the time we umount it.
Object creation is in ffs_sb_create_file(), removals - at ->kill_sb()
time (for ep0) or by simple_remove_by_name() from ffs_epfiles_destroy()
(for the rest of them).
Switch ffs_sb_create_file() to simple_start_creating()/d_make_persistent()/
simple_done_creating() and that's it.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
No need to return dentry from ffs_sb_create_file() or keep it around
afterwards.
To avoid subtle issues with getting to ffs from epfiles in
ffs_epfiles_destroy(), pass the superblock as explicit argument.
Callers have it anyway.
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
ffs_epfile_open() can race with removal, ending up with file->private_data
pointing to freed object.
There is a total count of opened files on functionfs (both ep0 and
dynamic ones) and when it hits zero, dynamic files get removed.
Unfortunately, that removal can happen while another thread is
in ffs_epfile_open(), but has not incremented the count yet.
In that case open will succeed, leaving us with UAF on any subsequent
read() or write().
The root cause is that ffs->opened is misused; atomic_dec_and_test() vs.
atomic_add_return() is not a good idea, when object remains visible all
along.
To untangle that
* serialize openers on ffs->mutex (both for ep0 and for dynamic files)
* have dynamic ones use atomic_inc_not_zero() and fail if we had
zero ->opened; in that case the file we are opening is doomed.
* have the inodes of dynamic files marked on removal (from the
callback of simple_recursive_removal()) - clear ->i_private there.
* have open of dynamic ones verify they hadn't been already removed,
along with checking that state is FFS_ACTIVE.
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
... otherwise we just might free ffs with ffs->reset_work
still on queue. That needs to be done after ffs_data_reset() -
that's the cutoff point for configfs accesses (serialized
on gadget_info->lock), which is where the schedule_work()
would come from.
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
A reference is held by the superblock (it's dropped in ffs_kill_sb())
and filesystem will not get to ->kill_sb() while there are any opened
files, TYVM...
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
ffs_data_closed() has a seriously confusing logics in it: in addition
to the normal "decrement a counter and do some work if it hits zero"
there's "... and if it has somehow become negative, do that" bit.
It's not a race, despite smelling rather fishy. What really happens
is that in addition to "call that on close of files there, to match
the increments of counter on opens" there's one call in ->kill_sb().
Counter starts at 0 and never goes negative over the lifetime of
filesystem (or we have much worse problems everywhere - ->release()
call of some file somehow unpaired with successful ->open() of the
same). At the filesystem shutdown it will be 0 or, again, we have
much worse problems - filesystem instance destroyed with files on it
still open. In other words, at that call and at that call alone
the decrement would go from 0 to -1, hitting that chunk (and not
hitting the "if it hits 0" part).
So that check is a weirdly spelled "called from ffs_kill_sb()".
Just expand the call in the latter and kill the misplaced chunk
in ffs_data_closed().
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Extend cpu_cache_invalidate_memregion() to support invalidating a
particular range of memory by introducing start and length parameters.
Control of types of invalidation is left for when use cases turn up. For
now everything is Clean and Invalidate.
Where the range is unknown, use the provided cpu_cache_invalidate_all()
helper to act as documentation of intent in a fashion that is clearer than
passing (0, -1) to cpu_cache_invalidate_memregion().
Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Conor Dooley <conor.dooley@microchip.com>
Similar to the RK3288, the RK3368 has two different implementations of
the PWM block inside the SoC - the newer ones that we have a driver for
and that is used on every SoC and a previous variant that was likely
left as a fallback if the new one creates problems.
The devicetree is already set up for the new variant, so make sure
we actually use it - similar to the RK3288.
Signed-off-by: Heiko Stuebner <heiko.stuebner@cherry.de>
Link: https://patch.msgid.link/20251021074254.87065-4-heiko@sntech.de
Signed-off-by: Heiko Stuebner <heiko@sntech.de>
The Programmable Real-time Unit and Industrial Communication Subsystem
Megabit (ICSSM) is a microcontroller subsystem in TI SoCs such as
AM57x, AM437x, and AM335x. It provides real-time processing
capabilities for industrial communication and custom peripheral interfaces.
Currently, EVMs based on AM57x, AM437x, and AM335x use the ICSSM driver
for PRU-based Ethernet functionality.
This patch enables TI_PRUSS and TI_PRUETH as a module for TI SoCs.
Signed-off-by: Parvathi Pudi <parvathi@couthit.com>
Link: https://lore.kernel.org/r/20251103125451.1679404-1-parvathi@couthit.com
Signed-off-by: Kevin Hilman <khilman@baylibre.com>
When advancing the target expiration for the guest's APIC timer in periodic
mode, set the expiration to "now" if the target expiration is in the past
(similar to what is done in update_target_expiration()). Blindly adding
the period to the previous target expiration can result in KVM generating
a practically unbounded number of hrtimer IRQs due to programming an
expired timer over and over. In extreme scenarios, e.g. if userspace
pauses/suspends a VM for an extended duration, this can even cause hard
lockups in the host.
Currently, the bug only affects Intel CPUs when using the hypervisor timer
(HV timer), a.k.a. the VMX preemption timer. Unlike the software timer,
a.k.a. hrtimer, which KVM keeps running even on exits to userspace, the
HV timer only runs while the guest is active. As a result, if the vCPU
does not run for an extended duration, there will be a huge gap between
the target expiration and the current time the vCPU resumes running.
Because the target expiration is incremented by only one period on each
timer expiration, this leads to a series of timer expirations occurring
rapidly after the vCPU/VM resumes.
More critically, when the vCPU first triggers a periodic HV timer
expiration after resuming, advancing the expiration by only one period
will result in a target expiration in the past. As a result, the delta
may be calculated as a negative value. When the delta is converted into
an absolute value (tscdeadline is an unsigned u64), the resulting value
can overflow what the HV timer is capable of programming. I.e. the large
value will exceed the VMX Preemption Timer's maximum bit width of
cpu_preemption_timer_multi + 32, and thus cause KVM to switch from the
HV timer to the software timer (hrtimers).
After switching to the software timer, periodic timer expiration callbacks
may be executed consecutively within a single clock interrupt handler,
because hrtimers honors KVM's request for an expiration in the past and
immediately re-invokes KVM's callback after reprogramming. And because
the interrupt handler runs with IRQs disabled, restarting KVM's hrtimer
over and over until the target expiration is advanced to "now" can result
in a hard lockup.
E.g. the following hard lockup was triggered in the host when running a
Windows VM (only relevant because it used the APIC timer in periodic mode)
after resuming the VM from a long suspend (in the host).
NMI watchdog: Watchdog detected hard LOCKUP on cpu 45
...
RIP: 0010:advance_periodic_target_expiration+0x4d/0x80 [kvm]
...
RSP: 0018:ff4f88f5d98d8ef0 EFLAGS: 00000046
RAX: fff0103f91be678e RBX: fff0103f91be678e RCX: 00843a7d9e127bcc
RDX: 0000000000000002 RSI: 0052ca4003697505 RDI: ff440d5bfbdbd500
RBP: ff440d5956f99200 R08: ff2ff2a42deb6a84 R09: 000000000002a6c0
R10: 0122d794016332b3 R11: 0000000000000000 R12: ff440db1af39cfc0
R13: ff440db1af39cfc0 R14: ffffffffc0d4a560 R15: ff440db1af39d0f8
FS: 00007f04a6ffd700(0000) GS:ff440db1af380000(0000) knlGS:000000e38a3b8000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000000d5651feff8 CR3: 000000684e038002 CR4: 0000000000773ee0
PKRU: 55555554
Call Trace:
<IRQ>
apic_timer_fn+0x31/0x50 [kvm]
__hrtimer_run_queues+0x100/0x280
hrtimer_interrupt+0x100/0x210
? ttwu_do_wakeup+0x19/0x160
smp_apic_timer_interrupt+0x6a/0x130
apic_timer_interrupt+0xf/0x20
</IRQ>
Moreover, if the suspend duration of the virtual machine is not long enough
to trigger a hard lockup in this scenario, since commit 98c25ead5e
("KVM: VMX: Move preemption timer <=> hrtimer dance to common x86"), KVM
will continue using the software timer until the guest reprograms the APIC
timer in some way. Since the periodic timer does not require frequent APIC
timer register programming, the guest may continue to use the software
timer in perpetuity.
Fixes: d8f2f498d9 ("x86/kvm: fix LAPIC timer drift when guest uses periodic mode")
Cc: stable@vger.kernel.org
Signed-off-by: fuqiang wang <fuqiang.wng@gmail.com>
[sean: massage comments and changelog]
Link: https://patch.msgid.link/20251113205114.1647493-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
When restarting an hrtimer to emulate a the guest's APIC timer in periodic
mode, explicitly set the expiration using the target expiration computed
by advance_periodic_target_expiration() instead of adding the period to
the existing timer. This will allow making adjustments to the expiration,
e.g. to deal with expirations far in the past, without having to implement
the same logic in both advance_periodic_target_expiration() and
apic_timer_fn().
Cc: stable@vger.kernel.org
Signed-off-by: fuqiang wang <fuqiang.wng@gmail.com>
[sean: split to separate patch, write changelog]
Link: https://patch.msgid.link/20251113205114.1647493-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
WARN and don't restart the hrtimer if KVM's callback runs with the guest's
APIC timer in periodic mode but with a period of '0', as not advancing the
hrtimer's deadline would put the CPU into an infinite loop of hrtimer
events. Observing a period of '0' should be impossible, even when the
hrtimer is running on a different CPU than the vCPU, as KVM is supposed to
cancel the hrtimer before changing (or zeroing) the period, e.g. when
switching from periodic to one-shot.
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20251113205114.1647493-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Use the normal, checked versions for get_user() and put_user() instead of
the double-underscore versions that omit range checks, as the checked
versions are actually measurably faster on modern CPUs (12%+ on Intel,
25%+ on AMD).
The performance hit on the unchecked versions is almost entirely due to
the added LFENCE on CPUs where LFENCE is serializing (which is effectively
all modern CPUs), which was added by commit 304ec1b050 ("x86/uaccess:
Use __uaccess_begin_nospec() and uaccess_try_nospec"). The small
optimizations done by commit b19b74bc99 ("x86/mm: Rework address range
check in get_user() and put_user()") likely shave a few cycles off, but
the bulk of the extra latency comes from the LFENCE.
Don't bother trying to open-code an equivalent for performance reasons, as
the loss of inlining (e.g. see commit ea6f043fc9 ("x86: Make __get_user()
generate an out-of-line call") is largely a non-factor (ignoring setups
where RET is something entirely different),
As measured across tens of millions of calls of guest PTE reads in
FNAME(walk_addr_generic):
__get_user() get_user() open-coded open-coded, no LFENCE
Intel (EMR) 75.1 67.6 75.3 65.5
AMD (Turin) 68.1 51.1 67.5 49.3
Note, Hyper-V MSR emulation is not a remotely hot path, but convert it
anyways for consistency, and because there is a general desire to remove
__{get,put}_user() entirely.
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Closes: https://lore.kernel.org/all/CAHk-=wimh_3jM9Xe8Zx0rpuf8CPDu6DkRCGb44azk0Sz5yqSnw@mail.gmail.com
Cc: Borislav Petkov <bp@alien8.de>
Link: https://patch.msgid.link/20251106210206.221558-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Add PCIe RC support on Orion O6 board.
The Orion O6 board includes multiple PCIe root complexes. The current
device tree configuration enables detection and basic operation of PCIe
endpoints on this platform.
GPIO and pinctrl subsystems for this platform are not yet ready for
upstream inclusion. Consequently, attributes such as reset-gpios and
pinctrl configurations are temporarily omitted from the PCIe node
definitions.
Endpoint detection and functionality are confirmed to be operational with
this basic configuration. The missing GPIO and pinctrl support will be
added incrementally in future patches as the dependent subsystems become
available upstream.
Acked-by: Manivannan Sadhasivam <mani@kernel.org>
Signed-off-by: Hans Zhang <hans.zhang@cixtech.com>
Link: https://lore.kernel.org/r/20251108140305.1120117-11-hans.zhang@cixtech.com
Signed-off-by: Peter Chen <peter.chen@cixtech.com>
damon_sysfs_test_add_targets() is assuming all dynamic memory allocation
in it will succeed. Those are indeed likely in the real use cases since
those allocations are too small to fail, but theoretically those could
fail. In the case, inappropriate memory access can happen. Fix it by
appropriately cleanup pre-allocated memory and skip the execution of the
remaining tests in the failure cases.
Link: https://lkml.kernel.org/r/20251101182021.74868-21-sj@kernel.org
Fixes: b8ee5575f7 ("mm/damon/sysfs-test: add a unit test for damon_sysfs_set_targets()")
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Brendan Higgins <brendan.higgins@linux.dev>
Cc: David Gow <davidgow@google.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: <stable@vger.kernel.org> [6.7+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
damon_test_set_filters_default_reject() is assuming all dynamic memory
allocation in it will succeed. Those are indeed likely in the real use
cases since those allocations are too small to fail, but theoretically
those could fail. In the case, inappropriate memory access can happen.
Fix it by appropriately cleanup pre-allocated memory and skip the
execution of the remaining tests in the failure cases.
Link: https://lkml.kernel.org/r/20251101182021.74868-17-sj@kernel.org
Fixes: 094fb14913 ("mm/damon/tests/core-kunit: add a test for damos_set_filters_default_reject()")
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Brendan Higgins <brendan.higgins@linux.dev>
Cc: David Gow <davidgow@google.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: <stable@vger.kernel.org> [6.16+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
damon_test_update_monitoring_result() is assuming all dynamic memory
allocation in it will succeed. Those are indeed likely in the real use
cases since those allocations are too small to fail, but theoretically
those could fail. In the case, inappropriate memory access can happen.
Fix it by appropriately cleanup pre-allocated memory and skip the
execution of the remaining tests in the failure cases.
Link: https://lkml.kernel.org/r/20251101182021.74868-12-sj@kernel.org
Fixes: f4c978b659 ("mm/damon/core-test: add a test for damon_update_monitoring_results()")
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Brendan Higgins <brendan.higgins@linux.dev>
Cc: David Gow <davidgow@google.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: <stable@vger.kernel.org> [6.3+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
damon_test_ops_registration() is assuming all dynamic memory allocation in
it will succeed. Those are indeed likely in the real use cases since
those allocations are too small to fail, but theoretically those could
fail. In the case, inappropriate memory access can happen. Fix it by
appropriately cleanup pre-allocated memory and skip the execution of the
remaining tests in the failure cases.
Link: https://lkml.kernel.org/r/20251101182021.74868-10-sj@kernel.org
Fixes: 4f540f5ab4 ("mm/damon/core-test: add a kunit test case for ops registration")
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Brendan Higgins <brendan.higgins@linux.dev>
Cc: David Gow <davidgow@google.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: <stable@vger.kernel.org> [5.19+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "mm/damon/tests: fix memory bugs in kunit tests".
DAMON kunit tests were initially written assuming those will be run on
environments that are well controlled and therefore tolerant to transient
test failures and bugs in the test code itself. The user-mode linux based
manual run of the tests is one example of such an environment. And the
test code was written for adding more test coverage as fast as possible,
over making those safe and reliable.
As a result, the tests resulted in having a number of bugs including real
memory leaks, theoretical unhandled memory allocation failures, and unused
memory allocations. The allocation failures that are not handled well are
unlikely in the real world, since those allocations are too small to fail.
But in theory, it can happen and cause inappropriate memory access.
It is arguable if bugs in test code can really harm users. But, anyway
bugs are bugs that need to be fixed. Fix the bugs one by one. Also Cc
stable@ for the fixes of memory leak and unhandled memory allocation
failures. The unused memory allocations are only a matter of memory
efficiency, so not Cc-ing stable@.
The first patch fixes memory leaks in the test code for the DAMON core
layer.
Following fifteen, three, and one patches respectively fix unhandled
memory allocation failures in the test code for DAMON core layer, virtual
address space DAMON operation set, and DAMON sysfs interface, one by one
per test function.
Final two patches remove memory allocations that are correctly deallocated
at the end, but not really being used by any code.
This patch (of 22):
Kunit test function for damos_set_filters_default_reject() allocates two
'struct damos_filter' objects and not deallocates those, so that the
memory for the two objects are leaked for every time the test runs. Fix
this by deallocating those objects at the end of the test code.
Link: https://lkml.kernel.org/r/20251101182021.74868-1-sj@kernel.org
Link: https://lkml.kernel.org/r/20251101182021.74868-2-sj@kernel.org
Fixes: 094fb14913 ("mm/damon/tests/core-kunit: add a test for damos_set_filters_default_reject()")
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Brendan Higgins <brendan.higgins@linux.dev>
Cc: David Gow <davidgow@google.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: <stable@vger.kernel.org> [6.16+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Poison (or ECC) errors can be very common on a large size cluster. The
kernel MM currently does not handle ECC errors / poison on a memory region
that is not backed by struct pages. If a memory region mapped using
remap_pfn_range() for example, but not added to the kernel, MM will not
have associated struct pages. Add a new mechanism to handle memory
failure on such memory.
Make kernel MM expose a function to allow modules managing the device
memory to register the device memory SPA and the address space associated
it. MM maintains this information as an interval tree. On poison, MM can
search for the range that the poisoned PFN belong and use the
address_space to determine the mapping VMA.
In this implementation, kernel MM follows the following sequence that is
largely similar to the memory_failure() handler for struct page backed
memory:
1. memory_failure() is triggered on reception of a poison error. An
absence of struct page is detected and consequently
memory_failure_pfn() is executed.
2. memory_failure_pfn() collects the processes mapped to the PFN.
3. memory_failure_pfn() sends SIGBUS to all the processes mapping the
faulty PFN using kill_procs().
Note that there is one primary difference versus the handling of the
poison on struct pages, which is to skip unmapping to the faulty PFN.
This is done to handle the huge PFNMAP support added recently [1] that
enables VM_PFNMAP vmas to map at PMD or PUD level. A poison to a PFN
mapped in such as way would need breaking the PMD/PUD mapping into PTEs
that will get mirrored into the S2. This can greatly increase the cost of
table walks and have a major performance impact.
Link: https://lore.kernel.org/all/20240826204353.2228736-1-peterx@redhat.com/ [1]
Link: https://lkml.kernel.org/r/20251102184434.2406-3-ankita@nvidia.com
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
Cc: Aniket Agashe <aniketa@nvidia.com>
Cc: Borislav Betkov <bp@alien8.de>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hanjun Guo <guohanjun@huawei.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Joanthan Cameron <Jonathan.Cameron@huawei.com>
Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Kirti Wankhede <kwankhede@nvidia.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Matthew R. Ochs <mochs@nvidia.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
Cc: Neo Jia <cjia@nvidia.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Shuai Xue <xueshuai@linux.alibaba.com>
Cc: Smita Koralahalli Channabasappa <smita.koralahallichannabasappa@amd.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tarun Gupta <targupta@nvidia.com>
Cc: Uwe Kleine-König <u.kleine-koenig@baylibre.com>
Cc: Vikram Sethi <vsethi@nvidia.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Zhi Wang <zhiw@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Poison (or ECC) errors can be very common on a large size cluster. The
kernel MM currently handles ECC errors / poison only on memory page backed
by struct page. The handling is currently missing for the PFNMAP memory
that does not have struct pages. The series adds such support.
Implement a new ECC handling for memory without struct pages. Kernel MM
expose registration APIs to allow modules that are managing the device to
register its device memory region. MM then tracks such regions using
interval tree.
The mechanism is largely similar to that of ECC on pfn with struct pages.
If there is an ECC error on a pfn, all the mapping to it are identified
and a SIGBUS is sent to the user space processes owning those mappings.
Note that there is one primary difference versus the handling of the
poison on struct pages, which is to skip unmapping to the faulty PFN.
This is done to handle the huge PFNMAP support added recently [1] that
enables VM_PFNMAP vmas to map at PMD or PUD level. A poison to a PFN
mapped in such as way would need breaking the PMD/PUD mapping into PTEs
that will get mirrored into the S2. This can greatly increase the cost of
table walks and have a major performance impact.
nvgrace-gpu-vfio-pci module maps the device memory to user VA (Qemu) using
remap_pfn_range without being added to the kernel [2]. These device
memory PFNs are not backed by struct page. So make nvgrace-gpu-vfio-pci
module make use of the mechanism to get poison handling support on the
device memory.
This patch (of 3):
The GHES code allows calling of memory_failure() on the PFNs that pass the
pfn_valid() check. This contract is broken for the remapped PFNs which
fails the check and ghes_do_memory_failure() returns without triggering
memory_failure().
Update code to allow memory_failure() call on PFNs failing pfn_valid().
Link: https://lkml.kernel.org/r/20251102184434.2406-1-ankita@nvidia.com
Link: https://lkml.kernel.org/r/20251102184434.2406-2-ankita@nvidia.com
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
Reviewed-by: Shuai Xue <xueshuai@linux.alibaba.com>
Cc: Aniket Agashe <aniketa@nvidia.com>
Cc: Ankit Agrawal <ankita@nvidia.com>
Cc: Borislav Betkov <bp@alien8.de>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hanjun Guo <guohanjun@huawei.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Joanthan Cameron <Jonathan.Cameron@huawei.com>
Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Kirti Wankhede <kwankhede@nvidia.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Matthew R. Ochs <mochs@nvidia.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
Cc: Neo Jia <cjia@nvidia.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Smita Koralahalli Channabasappa <smita.koralahallichannabasappa@amd.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tarun Gupta <targupta@nvidia.com>
Cc: Uwe Kleine-König <u.kleine-koenig@baylibre.com>
Cc: Vikram Sethi <vsethi@nvidia.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Zhi Wang <zhiw@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Make break_ksm() receive an address range and change break_ksm_pmd_entry()
to perform a range-walk and return the address of the first ksm page
found.
This change allows break_ksm() to skip unmapped regions instead of
iterating every page address. When unmerging large sparse VMAs, this
significantly reduces runtime.
In a benchmark unmerging a 32 TiB sparse virtual address space where only
one page was populated, the runtime dropped from 9 minutes to less then 5
seconds.
Link: https://lkml.kernel.org/r/20251105184912.186329-3-pedrodemargomes@gmail.com
Signed-off-by: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
Suggested-by: David Hildenbrand (Red Hat) <david@kernel.org>
Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "ksm: perform a range-walk to jump over holes in break_ksm",
v4.
When unmerging an address range, unmerge_ksm_pages function walks every
page address in the specified range to locate ksm pages. This becomes
highly inefficient when scanning large virtual memory areas that contain
mostly unmapped regions, causing the process to get blocked for several
minutes.
This patch makes break_ksm, function called by unmerge_ksm_pages for every
page in an address range, perform a range walk, allowing it to skip over
entire unmapped holes in a VMA, avoiding unnecessary lookups.
As pointed out by David Hildenbrand in [1], unmerge_ksm_pages() is called
from:
* ksm_madvise() through madvise(MADV_UNMERGEABLE). There are not a lot
of users of that function.
* __ksm_del_vma() through ksm_del_vmas(). Effectively called when
disabling KSM for a process either through the sysctl or from s390x gmap
code when enabling storage keys for a VM.
Consider the following test program which creates a 32 TiB mapping in the
virtual address space but only populates a single page:
#include <unistd.h>
#include <stdio.h>
#include <sys/mman.h>
/* 32 TiB */
const size_t size = 32ul * 1024 * 1024 * 1024 * 1024;
int main() {
char *area = mmap(NULL, size, PROT_READ | PROT_WRITE,
MAP_NORESERVE | MAP_PRIVATE | MAP_ANON, -1, 0);
if (area == MAP_FAILED) {
perror("mmap() failed\n");
return -1;
}
/* Populate a single page such that we get an anon_vma. */
*area = 0;
/* Enable KSM. */
madvise(area, size, MADV_MERGEABLE);
madvise(area, size, MADV_UNMERGEABLE);
return 0;
}
Without this patch, this program takes 9 minutes to finish, while with
this patch it finishes in less then 5 seconds.
This patch (of 3):
This reverts commit e317a8d8b4 and changes
function break_ksm_pmd_entry() to use folios.
This reverts break_ksm() to use walk_page_range_vma() instead of
folio_walk_start().
Change break_ksm_pmd_entry() to call is_ksm_zero_pte() only if we know the
folio is present, and also rename variable ret to found. This will make
it easier to later modify break_ksm() to perform a proper range walk.
Link: https://lkml.kernel.org/r/20251105184912.186329-1-pedrodemargomes@gmail.com
Link: https://lkml.kernel.org/r/20251105184912.186329-2-pedrodemargomes@gmail.com
Link: https://lore.kernel.org/linux-mm/e0886fdf-d198-4130-bd9a-be276c59da37@redhat.com/ [1]
Signed-off-by: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
Suggested-by: David Hildenbrand (Red Hat) <david@kernel.org>
Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The state of a memory block should be restricted to values specified in
the documentation of the memory hotplug API. However, since the state
field in the memory_block struct was defined as an unsigned long, this
restriction was not enforced at compile time.
With the introduction of the enum memory_block_state, it is now possible
to incorporate the desired semantics in the field declaration and enforce
these restrictions at compile time.
[akpm@linux-foundation.org: fix whitespace, per Randy]
Link: https://lkml.kernel.org/r/20251029195617.2210700-3-linux@israelbatista.dev.br
Signed-off-by: Israel Batista <linux@israelbatista.dev.br>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Omar Sandoval <osandov@osandov.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "mm: Convert memory block states (MEM_*) macros to enums", v2.
The MEM_* constants indicating the state of a memory block are currently
defined as macros, meaning their definitions will be omitted from the
debuginfo on most kernel builds. This makes it harder for debuggers to
correctly map the block state at runtime, which can be quite useful when
analysing errors related to memory hot plugging and unplugging with tools
such as drgn.
Converting the constants to an enum ensures the correct information is
emitted by the compiler and available for the debugger, without needing to
hard-code them into the debugger and track their changes.
This patch series aims to replace the current macros with a newly created
enum named memory_block_state, while also taking advantage of the compile
time guarantees that we get when using enums.
The first patch does the conversion of the macros to an enum, while the
2nd and 3rd patches use this enum to clean up some type declarations and
make sure that only valid values are used.
This patch (of 3):
Converting the MEM_* constants from macros to an enum ensures that their
values will be correctly emitted in the debug symbols, making it easier to
trace the meaning of each value when debugging with tools such as drgn,
without the need to hard-code the values.
Since the values are mutually exclusive and they are not exposed directly
to userspace, I also dropped the misleading pattern (1<<X) that made it
look like they were combinable flags.
Link: https://lkml.kernel.org/r/20251029195617.2210700-1-linux@israelbatista.dev.br
Link: https://lkml.kernel.org/r/20251029195617.2210700-2-linux@israelbatista.dev.br
Signed-off-by: Israel Batista <linux@israelbatista.dev.br>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Omar Sandoval <osandov@osandov.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Swap devices are assumed to have similar accessing speed when swapon if no
priority is specified. It's unfair and doesn't make sense just because
one swap device is swapped on firstly, its priority will be higher than
the one swapped on later.
Here, set all swap devicess to have priority '-1' by default. With this
change, swap device with default priority will be selected round robin
when swapping out. This can improve the swapping efficiency a lot among
multiple swap devices with default priority.
Below are swapon output during the processes when high pressure
vm-scability test is being taken:
1) This is pre-commit a2468cc9bf, swap device is selectd one by one by
priority from high to low when one swap device is exhausted:
------------------------------------
[root@hp-dl385g10-03 ~]# swapon
NAME TYPE SIZE USED PRIO
/dev/zram0 partition 16G 16G -1
/dev/zram1 partition 16G 966.2M -2
/dev/zram2 partition 16G 0B -3
/dev/zram3 partition 16G 0B -4
2) This is behaviour with commit a2468cc9bf, on node, swap device
sharing the same node id is selected firstly until exhausted; while
on node no swap device sharing the node id it selects the one with
highest priority until exhaustd:
------------------------------------
[root@hp-dl385g10-03 ~]# swapon
NAME TYPE SIZE USED PRIO
/dev/zram0 partition 16G 15.7G -2
/dev/zram1 partition 16G 3.4G -3
/dev/zram2 partition 16G 3.4G -4
/dev/zram3 partition 16G 2.6G -5
3) After this patch applied, swap devices with default priority are selectd
round robin:
------------------------------------
[root@hp-dl385g10-03 block]# swapon
NAME TYPE SIZE USED PRIO
/dev/zram0 partition 16G 6.6G -1
/dev/zram1 partition 16G 6.6G -1
/dev/zram2 partition 16G 6.6G -1
/dev/zram3 partition 16G 6.6G -1
With the change, about 18% efficiency promotion relative to node based
way as below. (Surely, the pre-commit a2468cc9bf way is the worst.)
vm-scability test:
==================
Test with:
usemem --init-time -O -y -x -n 31 2G (4G memcg, zram as swap)
one by one: node based: round robin:
System time: 1087.38 s 637.92 s 526.74 s (lower is better)
Sum Throughput: 2036.55 MB/s 3546.56 MB/s 4207.56 MB/s (higher is better)
Single process Throughput: 65.69 MB/s 114.40 MB/s 135.72 MB/s (high is better)
free latency: 15769409.48 us 10138455.99 us 6810119.01 us(lower is better)
Link: https://lkml.kernel.org/r/20251028034308.929550-3-bhe@redhat.com
Signed-off-by: Baoquan He <bhe@redhat.com>
Suggested-by: Chris Li <chrisl@kernel.org>
Acked-by: Chris Li <chrisl@kernel.org>
Acked-by: Nhat Pham <nphamcs@gmail.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Kairui Song <kasong@tencent.com>
Cc: Kemeng Shi <shikemeng@huaweicloud.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "mm/swapfile.c: select swap devices of default priority round
robin", v5.
Currently, on system with multiple swap devices, swap allocation will
select one swap device according to priority. The swap device with the
highest priority will be chosen to allocate firstly.
People can specify a priority from 0 to 32767 when swapon a swap device,
or the system will set it from -2 then downwards by default. Meanwhile,
on NUMA system, the swap device with node_id will be considered first on
that NUMA node of the node_id.
In the current code, an array of plist, swap_avail_heads[nid], is used to
organize swap devices on each NUMA node. For each NUMA node, there is a
plist organizing all swap devices. The 'prio' value in the plist is the
negated value of the device's priority due to plist being sorted from low
to high. The swap device owning one node_id will be promoted to the front
position on that NUMA node, then other swap devices are put in order of
their default priority.
E.g I got a system with 8 NUMA nodes, and I setup 4 zram partition as
swap devices.
Current behaviour:
their priorities will be(note that -1 is skipped):
NAME TYPE SIZE USED PRIO
/dev/zram0 partition 16G 0B -2
/dev/zram1 partition 16G 0B -3
/dev/zram2 partition 16G 0B -4
/dev/zram3 partition 16G 0B -5
And their positions in the 8 swap_avail_lists[nid] will be:
swap_avail_lists[0]: /* node 0's available swap device list */
zram0 -> zram1 -> zram2 -> zram3
prio:1 prio:3 prio:4 prio:5
swap_avali_lists[1]: /* node 1's available swap device list */
zram1 -> zram0 -> zram2 -> zram3
prio:1 prio:2 prio:4 prio:5
swap_avail_lists[2]: /* node 2's available swap device list */
zram2 -> zram0 -> zram1 -> zram3
prio:1 prio:2 prio:3 prio:5
swap_avail_lists[3]: /* node 3's available swap device list */
zram3 -> zram0 -> zram1 -> zram2
prio:1 prio:2 prio:3 prio:4
swap_avail_lists[4-7]: /* node 4,5,6,7's available swap device list */
zram0 -> zram1 -> zram2 -> zram3
prio:2 prio:3 prio:4 prio:5
The adjustment for swap device with node_id intended to decrease the
pressure of lock contention for one swap device by taking different swap
device on different node. The adjustment was introduced in commit
a2468cc9bf ("swap: choose swap device according to numa node").
However, the adjustment is a little coarse-grained. On the node, the swap
device sharing the node's id will always be selected firstly by node's
CPUs until exhausted, then next one. And on other nodes where no swap
device shares its node id, swap device with priority '-2' will be selected
firstly until exhausted, then next with priority '-3'.
This is the swapon output during the process high pressure vm-scability
test is being taken. It's clearly showing zram0 is heavily exploited
until exhausted.
===================================
[root@hp-dl385g10-03 ~]# swapon
NAME TYPE SIZE USED PRIO
/dev/zram0 partition 16G 15.7G -2
/dev/zram1 partition 16G 3.4G -3
/dev/zram2 partition 16G 3.4G -4
/dev/zram3 partition 16G 2.6G -5
The node based strategy on selecting swap device is much better then the
old way one by one selecting swap device. However it is still
unreasonable because swap devices are assumed to have similar accessing
speed if no priority is specified when swapon. It's unfair and doesn't
make sense just because one swap device is swapped on firstly, its
priority will be higher than the one swapped on later.
So in this patchset, change is made to select the swap device round robin
if default priority. In code, the plist array swap_avail_heads[nid] is
replaced with a plist swap_avail_head which reverts commit a2468cc9bf.
Meanwhile, on top of the revert, further change is taken to make any
device w/o specified priority get the same default priority '-1'. Surely,
swap device with specified priority are always put foremost, this is not
impacted. If you care about their different accessing speed, then use
'swapon -p xx' to deploy priority for your swap devices.
New behaviour:
swap_avail_list: /* one global available swap device list */
zram0 -> zram1 -> zram2 -> zram3
prio:1 prio:1 prio:1 prio:1
This is the swapon output during the process high pressure vm-scability
being taken, all is selected round robin:
=======================================
[root@hp-dl385g10-03 linux]# swapon
NAME TYPE SIZE USED PRIO
/dev/zram0 partition 16G 12.6G -1
/dev/zram1 partition 16G 12.6G -1
/dev/zram2 partition 16G 12.6G -1
/dev/zram3 partition 16G 12.6G -1
With the change, we can see about 18% efficiency promotion as below:
vm-scability test:
==================
Test with:
usemem --init-time -O -y -x -n 31 2G (4G memcg, zram as swap)
Before: After:
System time: 637.92 s 526.74 s (lower is better)
Sum Throughput: 3546.56 MB/s 4207.56 MB/s (higher is better)
Single process Throughput: 114.40 MB/s 135.72 MB/s (higher is better)
free latency: 10138455.99 us 6810119.01 us (low is better)
This patch (of 2):
This reverts commit a2468cc9bf ("swap: choose swap device according to
numa node").
After this patch, the behaviour will change back to pre-commit
a2468cc9bf. Means the priority will be set from -1 then downwards by
default, and when swapping, it will exhault swap device one by one
according to priority from high to low. This is preparation work for
later change.
[root@hp-dl385g10-03 ~]# swapon
NAME TYPE SIZE USED PRIO
/dev/zram0 partition 16G 16G -1
/dev/zram1 partition 16G 966.2M -2
/dev/zram2 partition 16G 0B -3
/dev/zram3 partition 16G 0B -4
Link: https://lkml.kernel.org/r/20251028034308.929550-1-bhe@redhat.com
Link: https://lkml.kernel.org/r/20251028034308.929550-2-bhe@redhat.com
Signed-off-by: Baoquan He <bhe@redhat.com>
Suggested-by: Chris Li <chrisl@kernel.org>
Acked-by: Chris Li <chrisl@kernel.org>
Acked-by: Nhat Pham <nphamcs@gmail.com>
Reviewed-by: Kairui Song <kasong@tencent.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Kemeng Shi <shikemeng@huaweicloud.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
There is no good way to remove DAMON targets in the middle of the existing
targets list. It restricts efficient and flexible DAMON use cases.
Improve the usability by implementing a new DAMON sysfs interface file,
namely obsolete_target, under each target directory. It is connected to
the obsolete field of parameters commit-source targets, so allows removing
arbitrary targets in the middle of existing targets list.
Note that the sysfs files are not automatically updated. For example,
let's suppose there are three targets in the running context, and a user
removes the third target using this feature. If the user writes 'commit'
to the kdamond 'state' file again, DAMON sysfs interface will again try to
remove the third target. But because there is no matching target in the
running context, the commit will fail. It is the user's responsibility to
understand resulting DAMON internal targets list change, and construct
sysfs files (using nr_targets and other sysfs files) to correctly
represent it.
Also note that this is arguably an improvement rather than a fix of broken
things.
Link: https://lkml.kernel.org/r/20251023012535.69625-4-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Reported-by: Bijan Tabatabai <bijan311@gmail.com>
Closes: https://github.com/damonitor/damo/issues/36
Reviewed-by: Bijan Tabatabai <bijan311@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
DAMON sysfs interface tests if given online parameters update request is
valid, by committing those using the DAMON kernel API, to a test-purpose
destination context. The test-purpose destination context is constructed
using damon_new_ctx(), so it has no target, no scheme.
If a source target has the obsolete field set, the test-purpose commit
will fail because damon_commit_targets() fails when there is a source
obsolete target that cannot find its matching destination target. DAMON
sysfs interface is not letting users set the field for now, so there is no
problem. However, the following commit will support that. Also there
could be similar future changes that making commit fails based on current
context structure.
Make the test purpose commit destination context similar to the current
running one, by committing the running one to the test purpose context,
before doing the real test-purpose commit.
Link: https://lkml.kernel.org/r/20251023012535.69625-3-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Reviewed-by: Bijan Tabatabai <bijan311@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "mm/damon: support pin-point targets removal".
DAMON maintains the targets in a list, and allows committing only an
entire list of targets having the new parameters. Targets having same
index on the lists are treated as matching source and destination
targets. If an existing target cannot find a matching one in the
sources list, the target is removed. This means that there is no way to
remove only a specific monitoring target in the middle of the current
targets list.
Such pin-point target removal is really needed in some use cases,
though. Monitoring access patterns on virtual address spaces of
processes that spawned from the same ancestor is one example. If a
process of the group is terminated, the user may want to remove the
matching DAMON target as soon as possible, to save in-kernel memory
usage for the unnecessary target data. The user may also want to do
that without turning DAMON off or removing unnecessary targets, to keep
the current monitoring results for other active processes.
Extend DAMON kernel API and sysfs ABI to support the pin-point removal
in the following way. For API, add a new damon_target field, namely
'obsolete'. If the field on parameters commit source target is set, it
means the matching destination target is obsolete. Then the parameters
commit logic removes the destination target from the existing targets
list. For sysfs ABI, add a new file under the target directory, namely
'obsolete_target'. It is connected with the 'obsolete' field of the
commit source targets, so internally using the new API.
Also add a selftest for the new feature. The related helper scripts for
manipulating the sysfs interface and dumping in-kernel DAMON status are
also extended for this. Note that the selftest part was initially
posted as an individual RFC series [1], but now merged into this one.
Bijan Tabatabai has originally reported this issue, and participated in
this solution design on a GitHub issue [1] for DAMON user-space tool.
This patch (of 9):
DAMON's monitoring targets parameters update function,
damon_commit_targets(), is not providing a way to remove a target in the
middle of the existing targets list. Extend the API by adding a field to
struct damon_target. If the field of a damon_commit_targets() source
target is set, it indicates the matching target on the existing targets
list is obsolete. damon_commit_targets() understands that and removes
those from the list, while respecting the index based matching for other
non-obsolete targets.
Link: https://lkml.kernel.org/r/20251023012535.69625-1-sj@kernel.org
Link: https://lkml.kernel.org/r/20251023012535.69625-2-sj@kernel.org
Link: https://github.com/damonitor/damo/issues/36 [1]
Signed-off-by: SeongJae Park <sj@kernel.org>
Reviewed-by: Bijan Tabatabai <bijan311@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Allow to override defaults for shemem and tmpfs at config time. This is
consistent with how transparent hugepages can be configured.
Same results can be achieved with the existing
'transparent_hugepage_shmem' and 'transparent_hugepage_tmpfs' settings in
the kernel command line, but it is more convenient to define basic
settings at config time instead of changing kernel command line later.
Defaults for shmem and tmpfs were not changed. They are remained the same
as before: 'never' for both cases. Options 'deny' and 'force' are omitted
intentionally since these are special values and supposed to be used for
emergencies or testing and are not expected to be permanent ones.
Primary motivation for adding config option is to enable policy
enforcement at build time. In large-scale production environments (Meta's
for example), the kernel configuration is often maintained centrally close
to the kernel code itself and owned by the kernel engineers, while boot
parameters are managed independently (e.g. by provisioning systems). In
such setups, the kernel build defines the supported and expected behavior
in a single place, but there is no reliable or uniform control over the
kernel command line options.
A build-time default allows kernel integrators to enforce a predictable
hugepage policy for shmem/tmpfs on a base layer, ensuring reproducible
behavior and avoiding configuration drift caused by possible boot-time
differences.
In short, primary benefit is mostly operational: it provides a way to
codify preferred policy in the kernel configuration, which is versioned,
reviewed, and tested as part of the kernel build process, rather than
depending on potentially variable boot parameters.
[d@ilvokhin.com: v2]
Link: https://lkml.kernel.org/r/aQECPpjd-fU_TC79@shell.ilvokhin.com
Link: https://lkml.kernel.org/r/aPpv8sAa2sYgNu3L@shell.ilvokhin.com
Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Acked-by: Kiryl Shutsemau <kas@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "mm, swap: misc cleanup and bugfix", v2.
A few cleanups and a bugfix that are either suitable after the swap table
phase I or found during code review.
Patch 1 is a bugfix and needs to be included in the stable branch, the
rest have no behavioral change.
This patch (of 5):
Since commit 1b7e90020e ("mm, swap: use percpu cluster as allocation
fast path"), swap allocation is protected by a local lock, which means we
can't do any sleeping calls during allocation.
However, the discard routine is not taken well care of. When the swap
allocator failed to find any usable cluster, it would look at the pending
discard cluster and try to issue some blocking discards. It may not
necessarily sleep, but the cond_resched at the bio layer indicates this is
wrong when combined with a local lock. And the bio GFP flag used for
discard bio is also wrong (not atomic).
It's arguable whether this synchronous discard is helpful at all. In most
cases, the async discard is good enough. And the swap allocator is doing
very differently at organizing the clusters since the recent change, so it
is very rare to see discard clusters piling up.
So far, no issues have been observed or reported with typical SSD setups
under months of high pressure. This issue was found during my code
review. But by hacking the kernel a bit: adding a mdelay(500) in the
async discard path, this issue will be observable with WARNING triggered
by the wrong GFP and cond_resched in the bio layer for debug builds.
So now let's apply a hotfix for this issue: remove the synchronous discard
in the swap allocation path. And when order 0 is failing with all cluster
list drained on all swap devices, try to do a discard following the swap
device priority list. If any discards released some cluster, try the
allocation again. This way, we can still avoid OOM due to swap failure if
the hardware is very slow and memory pressure is extremely high.
This may cause more fragmentation issues if the discarding hardware is
really slow. Ideally, we want to discard pending clusters before
continuing to iterate the fragment cluster lists. This can be implemented
in a cleaner way if we clean up the device list iteration part first.
Link: https://lkml.kernel.org/r/20251024-swap-clean-after-swap-table-p1-v2-0-a709469052e7@tencent.com
Link: https://lkml.kernel.org/r/20251024-swap-clean-after-swap-table-p1-v2-1-c5b0e1092927@tencent.com
Fixes: 1b7e90020e ("mm, swap: use percpu cluster as allocation fast path")
Signed-off-by: Kairui Song <kasong@tencent.com>
Acked-by: Nhat Pham <nphamcs@gmail.com>
Acked-by: Chris Li <chrisl@kernel.org>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Barry Song <baohua@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@linux.alibaba.com>
Cc: Kemeng Shi <shikemeng@huaweicloud.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Folio splitting requires both the folio's original order (@old_order) and
the new target order (@split_order).
In the current implementation, @old_order is repeatedly retrieved using
folio_order().
However, for every iteration after the first, the folio being split is the
result of the previous split, meaning its order is already known to be
equal to the previous iteration's @split_order.
This commit optimizes the logic:
* Instead of calling folio_order(), we now set @old_order directly to
the value of @split_order from the previous iteration.
This change avoids unnecessary function calls and simplifies the loop
setup.
Also it removes a check for non-existent case, since for uniform splitting
we only do split when @split_order == @new_order.
Link: https://lkml.kernel.org/r/20251021212142.25766-5-richard.weiyang@gmail.com
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: wang lian <lianux.mm@gmail.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Lance Yang <lance.yang@linux.dev>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Mariano Pache <npache@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The loop executed after a successful folio split currently has two
combined responsibilities:
* updating statistics for the new folios
* determining the folio for the next split iteration.
This commit refactors the logic to directly calculate and update folio
statistics, eliminating the need for the iteration step.
We can do this because all necessary information is already available:
* All resulting new folios have the same order, which is @split_order.
* The exact number of new folios can be calculated directly using
@old_order and @split_order.
* The folio for the subsequent split is simply the one containing
@split_at.
By leveraging this knowledge, we can achieve the stat update more cleanly
and efficiently without the looping logic.
Link: https://lkml.kernel.org/r/20251021212142.25766-4-richard.weiyang@gmail.com
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: wang lian <lianux.mm@gmail.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Lance Yang <lance.yang@linux.dev>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Mariano Pache <npache@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "Fix stale IOTLB entries for kernel address space", v7.
This proposes a fix for a security vulnerability related to IOMMU Shared
Virtual Addressing (SVA). In an SVA context, an IOMMU can cache kernel
page table entries. When a kernel page table page is freed and
reallocated for another purpose, the IOMMU might still hold stale,
incorrect entries. This can be exploited to cause a use-after-free or
write-after-free condition, potentially leading to privilege escalation or
data corruption.
This solution introduces a deferred freeing mechanism for kernel page
table pages, which provides a safe window to notify the IOMMU to
invalidate its caches before the page is reused.
This patch (of 8):
In the IOMMU Shared Virtual Addressing (SVA) context, the IOMMU hardware
shares and walks the CPU's page tables. The x86 architecture maps the
kernel's virtual address space into the upper portion of every process's
page table. Consequently, in an SVA context, the IOMMU hardware can walk
and cache kernel page table entries.
The Linux kernel currently lacks a notification mechanism for kernel page
table changes, specifically when page table pages are freed and reused.
The IOMMU driver is only notified of changes to user virtual address
mappings. This can cause the IOMMU's internal caches to retain stale
entries for kernel VA.
Use-After-Free (UAF) and Write-After-Free (WAF) conditions arise when
kernel page table pages are freed and later reallocated. The IOMMU could
misinterpret the new data as valid page table entries. The IOMMU might
then walk into attacker-controlled memory, leading to arbitrary physical
memory DMA access or privilege escalation. This is also a
Write-After-Free issue, as the IOMMU will potentially continue to write
Accessed and Dirty bits to the freed memory while attempting to walk the
stale page tables.
Currently, SVA contexts are unprivileged and cannot access kernel
mappings. However, the IOMMU will still walk kernel-only page tables all
the way down to the leaf entries, where it realizes the mapping is for the
kernel and errors out. This means the IOMMU still caches these
intermediate page table entries, making the described vulnerability a real
concern.
Disable SVA on x86 architecture until the IOMMU can receive notification
to flush the paging cache before freeing the CPU kernel page table pages.
Link: https://lkml.kernel.org/r/20251022082635.2462433-1-baolu.lu@linux.intel.com
Link: https://lkml.kernel.org/r/20251022082635.2462433-2-baolu.lu@linux.intel.com
Fixes: 26b25a2b98 ("iommu: Bind process address spaces to devices")
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Betkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Robin Murohy <robin.murphy@arm.com>
Cc: Thomas Gleinxer <tglx@linutronix.de>
Cc: "Uladzislau Rezki (Sony)" <urezki@gmail.com>
Cc: Vasant Hegde <vasant.hegde@amd.com>
Cc: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Cc: Yi Lai <yi1.lai@intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
__memcg_memory_event() has been unnecessarily marked inline even when it
is not really performance critical. It is usually called to track extreme
conditions. Over the time, it has evolved to include more functionality
and inlining it is causing more harm.
Before the patch:
$ size mm/memcontrol.o net/ipv4/tcp_input.o net/ipv4/tcp_output.o
text data bss dec hex filename
35645 10574 4192 50411 c4eb mm/memcontrol.o
54738 1658 0 56396 dc4c net/ipv4/tcp_input.o
34644 1065 0 35709 8b7d net/ipv4/tcp_output.o
After the patch:
$ size mm/memcontrol.o net/ipv4/tcp_input.o net/ipv4/tcp_output.o
text data bss dec hex filename
35137 10446 4192 49775 c26f mm/memcontrol.o
54322 1562 0 55884 da4c net/ipv4/tcp_input.o
34492 1017 0 35509 8ab5 net/ipv4/tcp_output.o
[akpm@linux-foundation.org: use EXPORT_SYMBOL_GPL for __memcg_memory_event, per Michal and Christoph]
Link: https://lkml.kernel.org/r/20251021234425.1885471-1-shakeel.butt@linux.dev
Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
Acked-by: SeongJae Park <sj@kernel.org>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Sometimes, vm_area_alloc_pages() will want many pages from the buddy
allocator. Rather than making requests to the buddy allocator for at most
100 pages at a time, we can eagerly request large order pages a smaller
number of times.
We still split the large order pages down to order-0 as the rest of the
vmalloc code (and some callers) depend on it. We still defer to the bulk
allocator and fallback path in case of order-0 pages or failure.
Running 1000 iterations of allocations on a small 4GB system finds:
1000 2mb allocations:
[Baseline] [This patch]
real 46.310s real 0m34.582
user 0.001s user 0.006s
sys 46.058s sys 0m34.365s
10000 200kb allocations:
[Baseline] [This patch]
real 56.104s real 0m43.696
user 0.001s user 0.003s
sys 55.375s sys 0m42.995s
Link: https://lkml.kernel.org/r/20251021194455.33351-2-vishal.moola@gmail.com
Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
When setting regions in DAMON_RECLAIM, DAMON_MIN_REGION will be applied as
the core address alignment, and the monitoring target address ranges would
be aligned on DAMON_MIN_REGION * addr_unit. When users 1) set addr_unit
to a value larger than 1, and 2) set the monitoring target address range
as not aligned on DAMON_MIN_REGION * addr_unit, it will cause
DAMON_RECLAIM to operate on unexpectedly large physical address ranges.
For example, if the user sets the monitoring target address range to [4,
8) and addr_unit as 1024, the aimed monitoring target address range is [4
KiB, 8 KiB). Assuming DAMON_MIN_REGION is 4096, so resulting target
address range will be [0, 4096) in the DAMON core layer address system,
and [0, 4 MiB) in the physical address space, which is an unexpected
range.
To fix the issue, use min_sz_region for core address alignment when
setting regions.
Link: https://lkml.kernel.org/r/20251020130125.2875164-3-yanquanmin1@huawei.com
Fixes: 7db551fcfb ("mm/damon/reclaim: support addr_unit for DAMON_RECLAIM")
Signed-off-by: Quanmin Yan <yanquanmin1@huawei.com>
Reviewed-by: SeongJae Park <sj@kernel.org>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: ze zuo <zuoze1@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "mm/damon: fixes for address alignment issues in
DAMON_LRU_SORT and DAMON_RECLAIM", v2.
In DAMON_LRU_SORT and DAMON_RECLAIM, damon_set_regions() will apply
DAMON_MIN_REGION as the core address alignment, and the monitoring target
address ranges would be aligned on DAMON_MIN_REGION * addr_unit. When
users 1) set addr_unit to a value larger than 1, and 2) set the monitoring
target address range as not aligned on DAMON_MIN_REGION * addr_unit, it
will cause DAMON_LRU_SORT and DAMON_RECLAIM to operate on unexpectedly
large physical address ranges.
For example, if the user sets the monitoring target address range to [4,
8) and addr_unit as 1024, the aimed monitoring target address range is [4
KiB, 8 KiB). Assuming DAMON_MIN_REGION is 4096, so resulting target
address range will be [0, 4096) in the DAMON core layer address system,
and [0, 4 MiB) in the physical address space, which is an unexpected
range.
To fix the issue, add a min_sz_region parameter to
damon_set_region_biggest_system_ram_default() and use it when calling
damon_set_regions(), replacing the direct use of DAMON_MIN_REGION.
This patch (of 2):
In DAMON_LRU_SORT, damon_set_regions() will apply DAMON_MIN_REGION as the
core address alignment, and the monitoring target address ranges would be
aligned on DAMON_MIN_REGION * addr_unit. When users 1) set addr_unit to a
value larger than 1, and 2) set the monitoring target address range as not
aligned on DAMON_MIN_REGION * addr_unit, it will cause DAMON_LRU_SORT to
operate on unexpectedly large physical address ranges.
For example, if the user sets the monitoring target address range to [4,
8) and addr_unit as 1024, the aimed monitoring target address range is [4
KiB, 8 KiB). Assuming DAMON_MIN_REGION is 4096, so resulting target
address range will be [0, 4096) in the DAMON core layer address system,
and [0, 4 MiB) in the physical address space, which is an unexpected
range.
To fix the issue, add a min_sz_region parameter to
damon_set_region_biggest_system_ram_default() and use it when calling
damon_set_regions(), replacing the direct use of DAMON_MIN_REGION.
Link: https://lkml.kernel.org/r/20251020130125.2875164-1-yanquanmin1@huawei.com
Link: https://lkml.kernel.org/r/20251020130125.2875164-2-yanquanmin1@huawei.com
Fixes: 2e0fe9245d ("mm/damon/lru_sort: support addr_unit for DAMON_LRU_SORT")
Signed-off-by: Quanmin Yan <yanquanmin1@huawei.com>
Reviewed-by: SeongJae Park <sj@kernel.org>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: ze zuo <zuoze1@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Add a variant of DAMOS_QUOTA_NODE_MEMCG_USED_BP, for the free memory
portion. The value of the metric is implemented as the entire memory of
the given NUMA node subtracted by the given cgroup's usage. So from a
perspective, "unused" could be a better term than "free". But arguably it
is not very clear what is better, so use the term "free".
Link: https://lkml.kernel.org/r/20251017212706.183502-7-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Add support of DAMOS_QUOTA_NODE_MEMCG_USED_BP. For this, extend quota
goal metric inputs for the new metric, and update DAMOS core layer request
construction logic to set the target cgroup, which is specified by the
user, via the 'path' file.
Link: https://lkml.kernel.org/r/20251017212706.183502-6-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Implement the handling of the new DAMOS quota goal metric for per-memcg
per-node memory usage, namely DAMOS_QUOTA_NODE_MEMCG_USED_BP. The metric
value is calculated as the sum of active/inactive anon/file pages of the
given cgroup for a given NUMA node.
Link: https://lkml.kernel.org/r/20251017212706.183502-4-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Define a new DAMOS quota auto-tuning target metric for per-cgroup per-node
memory usage. For specifying the cgroup of the interest, add a field,
namely memcg_id, to damos_quota_goal struct.
Note that this commit is only implementing the interface. The handling of
the interface (the metric value calculation) will be implemented in the
following commit.
Link: https://lkml.kernel.org/r/20251017212706.183502-3-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "mm/damon: allow DAMOS auto-tuned for per-memcg per-node
memory usage".
Introduce two new DAMOS quota auto-tuning target metrics for per-cgroup
per-NUMA node memory utilization. Expected use cases are cgroup level
access-aware NUMA memory managements, such as memory tiering or proactive
reclamation on cgroup-based multi-tenant NUMA systems.
Background
==========
The aim-oriented aggressiveness auto-tuning feature of DAMOS is a highly
recommended way for modern DAMOS use cases. Using it, users can specify
what system status they want to achieve with what access-aware system
operations. For example, reclaim cold memory aiming for 0.5 percent of
memory pressure (proactive reclaim), or migrate hot and cold memory
between NUMA nodes having different speed (memory tiering). Then DAMOS
automatically adjusts the aggressiveness of the system operation (e.g.,
increase/decrease reclaim target coldness threshold) based on current
status of the system.
The use case is limited by the supported system status metrics for
specifying the target system status. Two new system metrics for per-node
memory usage ratio, namely DAMOS_QUOTA_NODE_MEM_{USED,FREE}_BP, were
recently added to extend the use cases for access-aware NUMA nodes
management, such as memory tiering. Those are expected to be useful for
not only memory tiering but also general access-aware inter-NUMA node page
migration, though.
Limitation
----------
The per-node memory usage based auto-tuning can be applied only
system-wide. For cgroups-based multi-tenant systems, it could arguably
harm the fairness. For example, a cgroup may use faster NUMA node memory
more than other cgroup, depending on their access pattern. If the user of
each cgroup are promised to get the same quality and amount of the system
resource, this can arguably be an unfair situation.
DAMOS supports cgroup level system operations via DAMOS filter. But the
quota auto-tuning system is not aware of cgroups.
New DAMOS Quota Tuning Metrics for Per-Cgroup Per-NUMA Memory Usage
===================================================================
To overcome the limitation, introduce two new DAMOS quota auto-tuning goal
metrics, namely DAMOS_QUOTA_NODE_MEMCG_{USED,FREE}_BP. Those can be
thought of as a variant of DAMOS_QUOTA_NODE_MEM_{USED,FREE}_BP that
extended for cgroups.
The two metrics specifies per-cgroup, per-node amount of used and unused
memory in ratio to the total memory of the node. For example, let's
assume a system has two NUMA nodes of size 100 GiB and 50 GiB. And two
cgroups are using 40 GiB and 60 GiB of node 0, 20 GiB and 10 GiB of node
1, respectively, as illustrated by the below table.
node-0 node-1
Total memory 100 GiB 50 GiB
Cgroup A usage 40 GiB 20 GiB
Cgroup B usage 60 GiB 10 GiB
Then, DAMOS_QUOTA_NODE_MEMCG_USED_BP for the cgroups for the first node
are, 40 GiB / 100 GiB = 4,000 bp (40 percent) and 60 GiB / 100 GiB = 6,000
bp (60 percent), respectively. Those for the second node are, 20 GiB / 50
GiB = 4000 bp (40 percent) and 10 GiB / 50 GiB = 2000 bp (20 percent),
respectively.
DAMOS_QUOTA_NODE_MEMCG_FREE_BP for the four cases are, 60 GiB /100 GiB =
6000 bp, 40 GiB / 100 GiB = 4000 bp, 30 GiB / 50 GiB = 6000 bp, and 40 GiB
/ 50 GiB = 8000 bp, respectively.
DAMOS_QUOTA_NODE_MEMCG_USED_BP for cgroup A node-0: 4000 bp
DAMOS_QUOTA_NODE_MEMCG_USED_BP for cgroup B node-0: 6000 bp
DAMOS_QUOTA_NODE_MEMCG_USED_BP for cgroup A node-1: 4000 bp
DAMOS_QUOTA_NODE_MEMCG_USED_BP for cgroup B node-1: 2000 bp
DAMOS_QUOTA_NODE_MEMCG_FREE_BP for cgroup A node-0: 6000 bp
DAMOS_QUOTA_NODE_MEMCG_FREE_BP for cgroup B node-0: 4000 bp
DAMOS_QUOTA_NODE_MEMCG_FREE_BP for cgroup A node-1: 6000 bp
DAMOS_QUOTA_NODE_MEMCG_FREE_BP for cgroup B node-1: 8000 bp
Using these, users can specify how much [un]used amount of memory for
per-cgroup and per-node DAMOS should make as a result of the auto-tuning.
Example Usecase: Cgroup Level Memory Tiering
============================================
Let's suppose a typical and simple tiered memory system. The system
equips two NUMA nodes. The first node (node 0) is CPU-attached and fast.
The second node (node 1) is CPU-unattached and slow. It runs two cgroups
that desire to use about 30 percent and 70 percent of the faster node as
much as possible for their hot data, respectively. Then, the user can
implement DAMOS-based memory tiering for the system using the DAMON
user-space tool (damo), like below.
# ./damo start \
`# kdamond for node 1 (slow)` \
--numa_node 1 --monitoring_intervals_goal 4% 3 5ms 10s \
`# promotion scheme for cgroup a` \
--damos_action migrate_hot 0 --damos_access_rate 5% max \
--damos_apply_interval 1s \
--damos_filter allow memcg /workloads/a \
--damos_filter allow young \
--damos_quota_interval 1s --damos_quota_space 200MB \
--damos_quota_goal node_memcg_used_bp 29.7% 0 /workloads/a \
\
`# promotion scheme for cgroup b` \
--damos_action migrate_hot 0 --damos_access_rate 5% max \
--damos_apply_interval 1s \
--damos_filter allow memcg /workloads/b \
--damos_filter allow young \
--damos_quota_interval 1s --damos_quota_space 200MB \
--damos_quota_goal node_memcg_used_bp 69.7% 0 workloads/b \
\
`# kdamond for node 0 (fast)` \
--numa_node 0 --monitoring_intervals_goal 4% 3 5ms 10s \
`# demotion scheme for cgroup a` \
--damos_action migrate_cold 1 --damos_access_rate 0% 0% \
--damos_apply_interval 1s \
--damos_filter allow memcg /workloads/a \
--damos_filter reject young \
--damos_quota_interval 1s --damos_quota_space 200MB \
--damos_quota_goal node_memcg_free_bp 70.5% 0 \
\
`# demotion scheme for cgroup b` \
--damos_action migrate_cold 1 --damos_access_rate 0% 0% \
--damos_apply_interval 1s \
--damos_filter allow memcg /workloads/a \
--damos_filter reject young \
--damos_quota_interval 1s --damos_quota_space 200MB \
--damos_quota_goal node_memcg_free_bp 30.5% 0 \
\
--damos_nr_quota_goals 1 1 1 1 --damos_nr_filters 1 1 1 1 \
--nr_targets 1 1 --nr_schemes 2 2 --nr_ctxs 1 1
With the command, the user-space tool will ask DAMON to spawn two kernel
threads, each for monitoring accesses to node 1 (slow) and node 0 (fast),
respectively. It installs two DAMOS schemes on each thread. Let's call
them "promotion scheme for cgroup a/b", and "demotion scheme for cgroup
a/b" in the order. The promotion schemes are installed on the DAMON
thread for node 1 (slow), and demotion schemes are installed on the DAMON
thread for node 0 (fast).
Cgroup Level Hot Pages Migration (Promotion)
--------------------------------------------
Promotion schemes will find memory regions on node 1 (slow), that some
access was detected. The schemes will then migrate the found memory to
node 0 (fast), hottest pages first.
For accurate and effective migration, these schemes use two page level
filters. First, the migration will be filtered for only cgroup A and
cgroup B. That is, "promotion scheme for cgroup B" will not do the
migration if the page is for cgroup A. Secondly, the schemes will ignore
pages that having their page table's Accessed bits unset. The per-page
Accessed bit check logic will also unset the bit if it was set, for the
next check.
For controlled amounts of system resource consumption and aiming on the
target memory usage, the schemes use quotas setup. The migration is
limited to be done only up to 200 MiB per second, to limit the peak system
resource usage. And DAMOS_QUOTA_NODE_MEMCG_USED_BP target is set for
29.7% and 69.7% of node 0 (fast), respectively. The target value is lower
than the high level goal (30% and 70% system memory), to give headroom on
node 0 (fast). DAMOS will adjust the speed of the pages migration based
on the target and current per-cgroup node 0 memory usage. For example, if
cgroup A is utilizing only 10% of node 0, DAMOS will try to migrate more
of cgroup A hot pages from node 1 to node 0, up to 200 MiB per second. If
cgroup A utilizes more than 29.7% of node 0 memory, the cgroup A hot pages
migration from node 1 to node 0 will be slowed and eventually stopped.
Cgroup Level Cold Pages Migration (Demotion)
--------------------------------------------
Demotion schemes are similar to promotion schemes, but differ in filtering
setup and quota tuning setup. Those filter out pages having their page
table Accessed bits set. And set 70.5% and 30.5% of node 0 memory free
rate for the cgroup A and B, respectively. Hence, if promotion schemes or
something made cgroup A and/or B uses more than 29.5% and 69.5% of node 0,
demotion schemes will start migrating cold pages of appropriate cgroups in
node 0 to node 1, under the 200 MiB per second speed cap, while adjusting
the speed based on how much more than wanted memory is being used.
The quota target values are set to overlap with promotion targets, to keep
a minimum level of page exchanges between the nodes. This is to avoid a
case that the target memory utilization is met, and then access pattern
changes (pages in node 1 become hotter than pages in node 0) while the
memory utilization is unchanged. Without the overlap, neither promotion
of hotter pages in node 1, nor demotion of colder pages in node 0 will
happen since both goals are met. As a result, the faster and slower node
will unexpectedly serve cold and hot data.
Test: Per-cgroup Memory Tiering
===============================
I ran a simplified cgroup level memory tiering using the feature, and
confirmed it works as intended.
Setup
-----
I configured a QEMU virtual machine representing a simplified version of
the system that described on the above cgroup level memory tiering example
use case. The system equips 40 CPU cores and two NUMA nodes each having
30 GiB physical memory. The first node (node 0) represents the faster
NUMA node, and the second node (node 1) represents the slower NUMA node.
In specific, below qemu command line options are used.
[...]
-object memory-backend-ram,size=30G,id=m0 \
-object memory-backend-ram,size=30G,id=m1 \
-numa node,cpus=0-39,memdev=m0 \
-numa node,memdev=m1 \
[...]
I booted the virtual machine with a kernel that this patch series is
applied. On the virtual machine, I created two cgroups, namely workload_a
and workload_b. And ran a test program in each cgroup, resulting in one
process per cgroup. The test program allocates 10 GiB memory and evenly
split it into 10 regions. After the allocation, it repeatedly access the
first region for one minute, than the second one for one minute, and so
on. After the one minute repeated access for the 10-th region is done, it
repeats the access from the first region. So the process has 10 GiB of
data in total, but only 1 GiB of it is hot at a given moment, and the hot
data is gradually changed.
While the processes are running, run DAMON for a simple access-aware
memory tiering using below script. It migrates hot and cold data of the
cgroups into node 0 and node 1, aiming the first and the second cgroups
(workload_a and workload_b, respectively) utilizing about 9.7 percent and
19.7 percent of node 0, respectively.
Note that this setup is a simplified version of the above example use
case, for ease of test. Also note that we assigned 30 GiB physical memory
to node 0, but DAMON in this setup works for only 27 GiB of the memory.
It is due to an internal implementation detail of DAMON user-space tool
that not really important for this test.
#!/bin/bash
damo start \
--numa_node 1 \
--damos_action migrate_hot 0 --damos_access_rate 5% max \
--damos_apply_interval 1s \
--damos_filter allow memcg /workload_a \
--damos_filter allow young \
--damos_quota_interval 1s \
--damos_quota_goal node_memcg_used_bp 9.7% 0 /workload_a \
--damos_action migrate_hot 0 --damos_access_rate 5% max \
--damos_apply_interval 1s \
--damos_filter allow memcg /workload_b \
--damos_filter allow young \
--damos_quota_interval 1s \
--damos_quota_goal node_memcg_used_bp 19.7% 0 /workload_b \
--numa_node 0 \
--damos_action migrate_cold 1 --damos_access_rate 0% 0% \
--damos_apply_interval 1s \
--damos_filter allow memcg /workload_a \
--damos_filter reject young \
--damos_quota_interval 1s \
--damos_quota_goal node_memcg_free_bp 90.5% 0 /workload_a \
--damos_action migrate_cold 1 --damos_access_rate 0% 0% \
--damos_apply_interval 1s \
--damos_filter allow memcg /workload_b \
--damos_filter reject young \
--damos_quota_interval 1s \
--damos_quota_goal node_memcg_free_bp 80.5% 0 /workload_b \
--damos_nr_quota_goals 1 1 1 1 --damos_nr_filters 2 2 2 2 \
--nr_targets 1 1 --nr_schemes 2 2 --nr_ctxs 1 1
After starting DAMON, the pages continuously be migrated across nodes. A
few minutes later, the memory usage of the cgroups converges into the
aimed amounts, and keeps the level, as expected. To confirm the status is
kept in the target level as expected, I collected the memory usage stat of
the cgroups using memory.numa_stat file, after the stats are converged. I
repeat the stat collection 42 times with 5 seconds delay between each of
the collections. The results are as below:
node0_memory_usage average stdev
workload_a 2.79GiB 522.06MiB
workload_b 5.15GiB 739.10MiB
The average values are quite close to the targeted values: 27 GiB * 9.7% =
2.619 GiB for workload_a, and 27 GiB * 19.7% = 5.319 GiB. A level of
variances are expected, given the overlap of the promotion/demotion
targets, and dynamic data access pattern of the workloads. Give that, the
measured variances are at a reasonable level.
Patches Sequence
================
The first patch (patch 1) updates the kernel-doc comment of
damos_quota_goal struct to clarify usage of optional fields of the struct,
since later patches will add such optional fields.
Following four patches (patches 2-5) implement a new DAMOS quota goal
metric for per-cgroup per-node memory usage. Those extends the core layer
interface for the new metric (patch 2), implement the metric value
calculation on the core layer (patch 3), add DAMON sysfs interface file
for the target cgroup specification (patch 4), and implement support of
the new metric on DAMON sysfs interface (patch 5).
Next two patches implment the second new DAMOS quota goal metric for
per-cgroup per-node free (or, unused) memory. Those implement it in the
core layer (patch 6) and DAMON sysfs interface (patch 7), extending the
existing implementation for memory usage metric.
Final three patches update the design (patch 8), the usage (patch 9), and
the ABI (patch 10) documents for the changes that are introduced by this
patch series.
This patch (of 10):
damos_quota_goal kerneldoc comment is not explaining when @metric is used.
Update the comment for that.
Link: https://lkml.kernel.org/r/20251017212706.183502-1-sj@kernel.org
Link: https://lkml.kernel.org/r/20251017212706.183502-2-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
After commit 6b0dfabb35 ("fs: Remove aops->writepage"), we no longer
attempt to write back filesystem folios through reclaim.
However, in the shrink_folio_list() function, there still remains some
logic related to writeback control of dirty file folios. The original
logic was that, for direct reclaim, or when folio_test_reclaim() is false,
or the PGDAT_DIRTY flag is not set, the dirty file folios would be
directly activated to avoid being scanned again; otherwise, it will try to
writeback the dirty file folios. However, since we can no longer perform
writeback on dirty folios, the dirty file folios will still be activated.
Additionally, under the original logic, if we continue to try writeback
dirty file folios, we will also check the references flag,
sc->may_writepage, and may_enter_fs(), which may result in dirty file
folios being left in the inactive list. This is unreasonable. Even if
these dirty folios are scanned again, we still cannot clean them.
Therefore, the checks on these dirty file folios appear to be redundant
and can be removed. Dirty file folios should be directly moved to the
active list to avoid being scanned again. Since we set the PG_reclaim
flag for the dirty folios, once the writeback is completed, they will be
moved back to the tail of the inactive list to be retried for quick
reclaim.
Link: https://lkml.kernel.org/r/ba5c49955fd93c6850bcc19abf0e02e1573768aa.1760687075.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The kernel can throttle network sockets if the memory cgroup associated
with the corresponding socket is under memory pressure. The throttling
actions include clamping the transmit window, failing to expand receive or
send buffers, aggressively prune out-of-order receive queue, FIN deferred
to a retransmitted packet and more. Let's add memcg metric to track such
throttling actions.
At the moment memcg memory pressure is defined through vmpressure and in
future it may be defined using PSI or we may add more flexible way for the
users to define memory pressure, maybe through ebpf. However the
potential throttling actions will remain the same, so this newly
introduced metric will continue to track throttling actions irrespective
of how memcg memory pressure is defined.
Link: https://lkml.kernel.org/r/20251016161035.86161-1-shakeel.butt@linux.dev
Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Daniel Sedlak <daniel.sedlak@cdn77.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kacinski <kuba@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Simon Horman <horms@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The pcp locking relies on pcp_spin_trylock() which has to be used together
with pcp_trylock_prepare()/pcp_trylock_finish() to work properly on !SMP
!RT configs. This is tedious and error-prone.
We can remove pcp_spin_lock() and underlying pcpu_spin_lock() because we
don't use it. Afterwards pcp_spin_unlock() is only used together with
pcp_spin_trylock(). Therefore we can add the UP_flags parameter to them
both and handle pcp_trylock_prepare()/finish() within.
Additionally for the configs where pcp_trylock_prepare()/finish() are
no-op (SMP || RT) make them pass &UP_flags to a no-op inline function.
This ensures typechecking and makes the local variable "used" so we can
remove the __maybe_unused attributes.
In my compile testing, bloat-o-meter reported no change on SMP config, so
the compiler is capable of optimizing away the no-ops same as before, and
we have simplified the code using pcp_spin_trylock().
Link: https://lkml.kernel.org/r/20251015-b4-pcp-lock-cleanup-v2-1-740d999595d5@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Joshua Hahn <joshua.hahnjy@gmail.com>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Cc: Brendan Jackman <jackmanb@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Before returning, free_frozen_page_commit calls free_pcppages_bulk using
nr_pcp_free to determine how many pages can appropritately be freed, based
on the tunable parameters stored in pcp. While this number is an accurate
representation of how many pages should be freed in total, it is not an
appropriate number of pages to free at once using free_pcppages_bulk,
since we have seen the value consistently go above 2000 in the Meta fleet
on larger machines.
As such, perform batched page freeing in free_pcppages_bulk by using
pcp->batch. In order to ensure that other processes are not starved of
the zone lock, free both the zone lock and pcp lock to yield to other
threads.
Note that because free_frozen_page_commit now performs a spinlock inside
the function (and can fail), the function may now return with a freed pcp.
To handle this, return true if the pcp is locked on exit and false
otherwise.
In addition, since free_frozen_page_commit must now be aware of what UP
flags were stored at the time of the spin lock, and because we must be
able to report new UP flags to the callers, add a new unsigned long*
parameter UP_flags to keep track of this.
The following are a few synthetic benchmarks, made on three machines. The
first is a large machine with 754GiB memory and 316 processors. The
second is a relatively smaller machine with 251GiB memory and 176
processors. The third and final is the smallest of the three, which has
62GiB memory and 36 processors.
On all machines, I kick off a kernel build with -j$(nproc). Negative
delta is better (faster compilation)
Large machine (754GiB memory, 316 processors)
make -j$(nproc)
+------------+---------------+-----------+
| Metric (s) | Variation (%) | Delta(%) |
+------------+---------------+-----------+
| real | 0.8070 | - 1.4865 |
| user | 0.2823 | + 0.4081 |
| sys | 5.0267 | -11.8737 |
+------------+---------------+-----------+
Medium machine (251GiB memory, 176 processors)
make -j$(nproc)
+------------+---------------+----------+
| Metric (s) | Variation (%) | Delta(%) |
+------------+---------------+----------+
| real | 0.2806 | +0.0351 |
| user | 0.0994 | +0.3170 |
| sys | 0.6229 | -0.6277 |
+------------+---------------+----------+
Small machine (62GiB memory, 36 processors)
make -j$(nproc)
+------------+---------------+----------+
| Metric (s) | Variation (%) | Delta(%) |
+------------+---------------+----------+
| real | 0.1503 | -2.6585 |
| user | 0.0431 | -2.2984 |
| sys | 0.1870 | -3.2013 |
+------------+---------------+----------+
Here, variation is the coefficient of variation, i.e. standard deviation
/ mean.
[joshua.hahnjy@gmail.com: simplify checks]
Link: https://lkml.kernel.org/r/20251014192827.851389-1-joshua.hahnjy@gmail.com
Link: https://lkml.kernel.org/r/20251014145011.3427205-4-joshua.hahnjy@gmail.com
Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
Suggested-by: Chris Mason <clm@fb.com>
Co-developed-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Brendan Jackman <jackmanb@google.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Michal Hocko <mhocko@suse.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "mm/page_alloc: Batch callers of free_pcppages_bulk", v5.
Motivation & Approach
=====================
While testing workloads with high sustained memory pressure on large
machines in the Meta fleet (1Tb memory, 316 CPUs), we saw an unexpectedly
high number of softlockups. Further investigation showed that the zone
lock in free_pcppages_bulk was being held for a long time, and was called
to free 2k+ pages over 100 times just during boot.
This causes starvation in other processes for the zone lock, which can
lead to the system stalling as multiple threads cannot make progress
without the locks. We can see these issues manifesting as warnings:
[ 4512.591979] rcu: INFO: rcu_sched self-detected stall on CPU
[ 4512.604370] rcu: 20-....: (9312 ticks this GP) idle=a654/1/0x4000000000000000 softirq=309340/309344 fqs=5426
[ 4512.626401] rcu: hardirqs softirqs csw/system
[ 4512.638793] rcu: number: 0 145 0
[ 4512.651177] rcu: cputime: 30 10410 174 ==> 10558(ms)
[ 4512.666657] rcu: (t=21077 jiffies g=783665 q=1242213 ncpus=316)
While these warnings don't indicate a crash or a kernel panic, they do
point to the underlying issue of lock contention. To prevent starvation
in both locks, batch the freeing of pages using pcp->batch.
Because free_pcppages_bulk is called with the pcp lock and acquires the
zone lock, relinquishing and reacquiring the locks are only effective when
both of them are broken together (unless the system was built with queued
spinlocks). Thus, instead of modifying free_pcppages_bulk to break both
locks, batch the freeing from its callers instead.
A similar fix has been implemented in the Meta fleet, and we have seen
significantly less softlockups.
Testing
=======
The following are a few synthetic benchmarks, made on three machines. The
first is a large machine with 754GiB memory and 316 processors.
The second is a relatively smaller machine with 251GiB memory and 176
processors. The third and final is the smallest of the three, which has 62GiB
memory and 36 processors.
On all machines, I kick off a kernel build with -j$(nproc).
Negative delta is better (faster compilation).
Large machine (754GiB memory, 316 processors)
make -j$(nproc)
+------------+---------------+-----------+
| Metric (s) | Variation (%) | Delta(%) |
+------------+---------------+-----------+
| real | 0.8070 | - 1.4865 |
| user | 0.2823 | + 0.4081 |
| sys | 5.0267 | -11.8737 |
+------------+---------------+-----------+
Medium machine (251GiB memory, 176 processors)
make -j$(nproc)
+------------+---------------+----------+
| Metric (s) | Variation (%) | Delta(%) |
+------------+---------------+----------+
| real | 0.2806 | +0.0351 |
| user | 0.0994 | +0.3170 |
| sys | 0.6229 | -0.6277 |
+------------+---------------+----------+
Small machine (62GiB memory, 36 processors)
make -j$(nproc)
+------------+---------------+----------+
| Metric (s) | Variation (%) | Delta(%) |
+------------+---------------+----------+
| real | 0.1503 | -2.6585 |
| user | 0.0431 | -2.2984 |
| sys | 0.1870 | -3.2013 |
+------------+---------------+----------+
Here, variation is the coefficient of variation, i.e. standard deviation
/ mean.
Based on these results, it seems like there are varying degrees to how
much lock contention this reduces. For the largest and smallest machines
that I ran the tests on, it seems like there is quite some significant
reduction. There is also some performance increases visible from
userspace.
Interestingly, the performance gains don't scale with the size of the
machine, but rather there seems to be a dip in the gain there is for the
medium-sized machine. One possible theory is that because the high
watermark depends on both memory and the number of local CPUs, what
impacts zone contention the most is not these individual values, but
rather the ratio of mem:processors.
This patch (of 5):
Currently, refresh_cpu_vm_stats returns an int, indicating how many
changes were made during its updates. Using this information, callers
like vmstat_update can heuristically determine if more work will be done
in the future.
However, all of refresh_cpu_vm_stats's callers either (a) ignore the
result, only caring about performing the updates, or (b) only care about
whether changes were made, but not *how many* changes were made.
Simplify the code by returning a bool instead to indicate if updates
were made.
In addition, simplify fold_diff and decay_pcp_high to return a bool
for the same reason.
Link: https://lkml.kernel.org/r/20251014145011.3427205-1-joshua.hahnjy@gmail.com
Link: https://lkml.kernel.org/r/20251014145011.3427205-2-joshua.hahnjy@gmail.com
Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: SeongJae Park <sj@kernel.org>
Cc: Brendan Jackman <jackmanb@google.com>
Cc: Chris Mason <clm@fb.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The kernel-doc for __vmalloc_node_noprof() incorrectly states that
__GFP_NOFAIL reclaim modifier is not supported. In fact it has been
supported since commit 9376130c39 ("mm/vmalloc: add support for
__GFP_NOFAIL").
To avoid duplication and future drift, point this helper's doc to
__vmalloc_node_range_noprof() for details and the full description.
Link: https://lkml.kernel.org/r/20251013174222.90123-1-urezki@gmail.com
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Reviewed-by: Baoquan He <bhe@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
In mm/vmalloc.c, the function vmap_pte_range() assumes that the mapping
size is aligned to PAGE_SIZE. If this assumption is violated, the loop
will become infinite because the termination condition (`addr != end`)
will never be met. This can lead to overwriting other VA ranges and/or
random pages physically follow the page table.
It's the caller's responsibility to ensure that the mapping size is
aligned to PAGE_SIZE. However, the memory corruption is hard to root
cause. To identify the programming error in the caller easier, check
whether the mapping size is PAGE_SIZE aligned with WARN_ON_ONCE().
[yadong.qi@linux.alibaba.com: fix uninitialized value issue]
Closes: https://lore.kernel.org/r/202510110050.VG9YKMRK-lkp@intel.com/
Link: https://lkml.kernel.org/r/20251010014311.1689-1-yadong.qi@linux.alibaba.com
Signed-off-by: Yadong Qi <yadong.qi@linux.alibaba.com>
Reviewed-by: Huang Ying <ying.huang@linux.alibaba.com>
Reviewed-by: Dev Jain <dev.jain@arm.com>
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The current implementation uses nested loops: first iterating over all
online nodes, then over zones within each node. This can be simplified by
using the for_each_populated_zone() macro which directly iterates through
all populated zones.
This change:
1. Removes the intermediate init_zones_in_node() function
2. Simplifies init_early_allocated_pages() to use direct zone iteration
3. Updates init_pages_in_zone() to take only zone parameter and access
node_id via zone->zone_pgdat
The functionality remains identical, but the code is cleaner and more
maintainable.
Link: https://lkml.kernel.org/r/20250930092153.843109-2-husong@kylinos.cn
Signed-off-by: Song Hu <husong@kylinos.cn>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Ye Liu <liuye@kylinos.cn>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Brendan Jackman <jackmanb@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Deduplication of kasan_enabled() checks which are already used by callers.
* Altered functions:
check_page_allocation
Delete the check because callers have it already in __wrappers in
include/linux/kasan.h:
__kasan_kfree_large
__kasan_mempool_poison_pages
__kasan_mempool_poison_object
kasan_populate_vmalloc, kasan_release_vmalloc
Add __wrappers in include/linux/kasan.h.
They are called externally in mm/vmalloc.c.
__kasan_unpoison_vmalloc, __kasan_poison_vmalloc
Delete checks because there're already kasan_enabled() checks
in respective __wrappers in include/linux/kasan.h.
release_free_meta -- Delete the check because the higher caller path
has it already. See the stack trace:
__kasan_slab_free -- has the check already
__kasan_mempool_poison_object -- has the check already
poison_slab_object
kasan_save_free_info
release_free_meta
kasan_enabled() -- Delete here
Link: https://lkml.kernel.org/r/20251009155403.1379150-3-snovitoll@gmail.com
Signed-off-by: Sabyrzhan Tasbolatov <snovitoll@gmail.com>
Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Dmitriy Vyukov <dvyukov@google.com>
Cc: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Now that rmap_walk() is guaranteed to be called with the folio lock held,
we can stop serializing on the src VMA anon_vma lock when moving an
exclusive folio from a src VMA to a dst VMA in UFFDIO_MOVE ioctl.
When moving a folio, we modify folio->mapping through
folio_move_anon_rmap() and adjust folio->index accordingly. Doing that
while we could have concurrent RMAP walks would be dangerous. Therefore,
to avoid that, we had to acquire anon_vma of src VMA in write-mode. That
meant that when multiple threads called UFFDIO_MOVE concurrently on
distinct pages of the same src VMA, they would serialize on it, hurting
scalability.
In addition to avoiding the scalability bottleneck, this patch also
simplifies the complicated lock dance that UFFDIO_MOVE has to go through
between RCU, folio-lock, ptl, and anon_vma.
folio_move_anon_rmap() already enforces that the folio is locked. So when
we have the folio locked we can no longer race with concurrent rmap_walk()
as used by folio_referenced() and others who call it on unlocked non-KSM
anon folios, and therefore the anon_vma lock is no longer required.
Note that this handling is now the same as for other
folio_move_anon_rmap() users that also do not hold the anon_vma lock --
namely COW reuse handling (do_wp_page()->wp_can_reuse_anon_folio(),
do_huge_pmd_wp_page(), and hugetlb_wp()). These users never required the
anon_vma lock as they are only moving the anon VMA closer to the anon_vma
leaf of the VMA, for example, from an anon_vma root to a leaf of that
root. rmap walks were always able to tolerate that scenario.
Link: https://lkml.kernel.org/r/20250923071019.775806-3-lokeshgidra@google.com
Signed-off-by: Lokesh Gidra <lokeshgidra@google.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Jann Horn <jannh@google.com>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Nicolas Geoffray <ngeoffray@google.com>
Cc: Harry Yoo <harry.yoo@oracle.com>
Cc: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "Improve UFFDIO_MOVE scalability by removing anon_vma lock", v2.
Userfaultfd has a scalability issue in its UFFDIO_MOVE ioctl, which is
heavily used in Android as its java garbage collector uses it for
concurrent heap compaction.
The issue arises because UFFDIO_MOVE updates folio->mapping to an anon_vma
with a different root, in order to move the folio from a src VMA to dst
VMA. It performs the operation with the folio locked, but this is
insufficient, because rmap_walk() can be performed on non-KSM anonymous
folios without folio lock.
This means that UFFDIO_MOVE has to acquire the anon_vma write lock of the
root anon_vma belonging to the folio it wishes to move.
This causes scalability bottleneck when multiple threads perform
UFFDIO_MOVE simultanously on distinct pages of the same src VMA. In field
traces of arm64 android devices, we have observed janky user interactions
due to long (sometimes over ~50ms) uninterruptible sleeps on main UI
thread caused by anon_vma lock contention in UFFDIO_MOVE. This is
particularly severe during the beginning of GC's compaction phase when it
is likely to have multiple threads involved.
This patch resolves the issue by removing the exception in rmap_walk() for
non-KSM anon folios by ensuring that all folios are locked during rmap
walk. This is less problematic than it might seem, as the only major
caller which utilises this mode is shrink_active_list(), which is covered
in detail in the first patch of this series.
As a result of changing our approach to locking, we can remove all the
code that took steps to acquire an anon_vma write lock instead of a folio
lock. This results in a significant simplification and scalability
improvement of the code (currently only in UFFDIO_MOVE). Furthermore, as
a side-effect, folio_lock_anon_vma_read() gets simpler as we don't need to
worry that folio->mapping may have changed under us.
This patch (of 2):
Guarantee that rmap_walk() is called on locked folios so that threads
changing folio->mapping and folio->index for non-KSM anon folios can
serialize on fine-grained folio lock rather than anon_vma lock. Other
folio types are already always locked before rmap_walk(). With this, we
are going from 'not necessarily' locking the non-KSM anon folio to
'definitely' locking it during rmap walks.
This patch is in preparation for removing anon_vma write-lock from
UFFDIO_MOVE.
With this patch, three functions are now expected to be called with a
locked folio. To be careful of not missing any case, here is the
exhaustive list of all their callers.
1) rmap_walk() is called from:
a) folio_referenced()
b) damon_folio_mkold()
c) damon_folio_young()
d) page_idle_clear_pte_refs()
e) try_to_unmap()
f) try_to_migrate()
g) folio_mkclean()
h) remove_migration_ptes()
In the above list, first 4 are changed in this patch to try-lock non-KSM
anon folios, similar to other types of folios. The remaining functions in
the list already hold folio lock when calling rmap_walk().
2) folio_lock_anon_vma_read() is called from following functions:
a) collect_procs_anon()
b) page_idle_clear_pte_refs()
c) damon_folio_mkold()
d) damon_folio_young()
e) folio_referenced()
f) try_to_unmap()
g) try_to_migrate()
All the functions in above list, except collect_procs_anon(), are covered
by the rmap_walk() list above. For collect_procs_anon(), with
kill_procs_now() changed to take folio lock in this patch ensures that all
callers of folio_lock_anon_vma_read() now hold the lock.
3) folio_get_anon_vma() is called from following functions, all of which
already hold the folio lock:
a) move_pages_huge_pmd()
b) __folio_split()
c) move_pages_ptes()
d) migrate_folio_unmap()
e) unmap_and_move_huge_page()
Functionally, this patch doesn't break the logic because rmap walkers
generally do some other check to see if what is expected to mapped did
happen so it's fine, or otherwise treat things as best-effort.
Among the 4 functions changed in this patch, folio_referenced() is the
only core-mm function, and is also frequently accessed. To assess the
impact of locking non-KSM anon folios in
shrink_active_list()->folio_referenced() path, we performed an app cycle
test on an arm64 android device. During the whole duration of the test
there were over 140k invocations of shrink_active_list(), out of which
over 29k had at least one non-KSM anon folio on which folio_referenced()
was called. In none of these invocations folio_trylock() failed.
Of course, we now take a lock where we wouldn't previously have. In the
past it would have had a major impact in causing a CoW write fault to copy
a page in do_wp_page(), as commit 09854ba94c ("mm: do_wp_page()
simplification") caused a failure to obtain folio lock to result in a page
copy even if one wasn't necessary.
However, since commit 6c287605fd ("mm: remember exclusively mapped
anonymous pages with PG_anon_exclusive"), and the introduction of the
folio anon exclusive flag, this issue is significantly mitigated.
The only case remaining that we might worry about from this perspective is
that of read-only folios immediately after fork where the anon exclusive
bit will not have been set yet.
We note however in the case of read-only just-forked folios that
wp_can_reuse_anon_folio() will notice the raised reference count
established by shrink_active_list() via isolate_lru_folios() and refuse to
reuse in any case, so this will in fact have no impact - the folio lock is
ultimately immaterial here.
All-in-all it appears that there is little opportunity for meaningful
negative impact from this change.
Link: https://lkml.kernel.org/r/20250923071019.775806-1-lokeshgidra@google.com
Link: https://lkml.kernel.org/r/20250923071019.775806-2-lokeshgidra@google.com
Signed-off-by: Lokesh Gidra <lokeshgidra@google.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Harry Yoo <harry.yoo@oracle.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Barry Song <baohua@kernel.org>
Cc: SeongJae Park <sj@kernel.org>
Cc: Jann Horn <jannh@google.com>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: Nicolas Geoffray <ngeoffray@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Currently, gigantic hugepages cannot use the overcommit mechanism
(nr_overcommit_hugepages), forcing users to permanently reserve memory via
nr_hugepages even when pages might not be actively used.
The restriction was added in 2011 [1], which was before there was support
for reserving 1G hugepages at runtime. Remove this blanket restriction on
gigantic hugepage overcommit. This will bring the same benefits to
gigantic pages as hugepages:
- Memory is only taken out of regular use when actually needed
- Unused surplus pages can be returned to the system
- Better memory utilization, especially with CMA backing which can
significantly increase the changes of hugepage allocation
Without this patch:
echo 3 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_overcommit_hugepages
bash: echo: write error: Invalid argument
With this patch:
echo 3 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_overcommit_hugepages
./mmap_hugetlb_test
Successfully allocated huge pages at address: 0x7f9d40000000
cat mmap_hugetlb_test.c
...
unsigned long ALLOC_SIZE = 3 * (unsigned long) HUGE_PAGE_SIZE;
addr = mmap(NULL,
ALLOC_SIZE, // 3GB
PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB | MAP_HUGE_1GB,
-1,
0);
if (addr == MAP_FAILED) {
fprintf(stderr, "mmap failed: %s\n", strerror(errno));
return 1;
}
printf("Successfully allocated huge pages at address: %p\n", addr);
...
Link: https://lkml.kernel.org/r/20251009172433.4158118-2-usamaarif642@gmail.com
Link: https://git.zx2c4.com/linux-rng/commit/mm/hugetlb.c?id=adbe8726dc2a3805630d517270db17e3af86e526 [1]
Signed-off-by: Usama Arif <usamaarif642@gmail.com>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Acked-by: Oscar Salvador <osalvador@suse.de>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Rik van Riel <riel@surriel.com>
Cc: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
zone_batchsize returns the appropriate value that should be used for
pcp->batch. If it finds a zone with less than 4096 pages or PAGE_SIZE >
1M, however, it leads to some incorrect math.
In the above case, we will get an intermediary value of 1, which is then
rounded down to the nearest power of two, and 1 is subtracted from it.
Since 1 is already a power of two, we will get batch = 1-1 = 0:
batch = rounddown_pow_of_two(batch + batch/2) - 1;
A pcp->batch value of 0 is nonsensical. If this were actually set, then
functions like drain_zone_pages would become no-ops, since they could
only free 0 pages at a time.
Of the two callers of zone_batchsize, the one that is actually used to
set pcp->batch works around this by setting pcp->batch to the maximum
of 1 and zone_batchsize. However, the other caller, zone_pcp_init,
incorrectly prints out the batch size of the zone to be 0.
This is probably rare in a typical zone, but the DMA zone can often have
less than 4096 pages, which means it will print out "LIFO batch:0".
Before: [ 0.001216] DMA zone: 3998 pages, LIFO batch:0
After: [ 0.001210] DMA zone: 3998 pages, LIFO batch:1
Instead of dealing with the error handling and the mismatch between the
reported and actual zone batchsize, just return 1 if the zone_batchsize
is 1 page or less before the rounding.
Link: https://lkml.kernel.org/r/20251009192933.3756712-3-joshua.hahnjy@gmail.com
Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Brendan Jackman <jackmanb@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "mm/page_alloc: pcp->batch cleanups", v2.
Two small cleanups for mm/page_alloc.
Patch 1 cleans up a misleading comment about how pcp->batch is calculated,
and folds in the calculation to increase clarity. No functional change
intended.
Patch 2 corrects zones from reporting that their pcp->batch is 0 when it
is actually 1. Namely, corrects ZONE_DMA from reporting that its batch
size is 0.
This patch (of 2):
Recently while working on another patch about batching free_pcppages_bulk
[1], I was curious why pcp->batch was always 63 on my machine. This led
me to zone_batchsize(), where I found this set of lines to determine what
the batch size should be for the host:
batch = min(zone_managed_pages(zone) >> 10, SZ_1M / PAGE_SIZE);
batch /= 4; /* We effectively *= 4 below */
if (batch < 1)
batch = 1;
All of this is good, except the comment above which says "We effectively
*= 4 below". Nowhere else in the function zone_batchsize(), is there a
corresponding multipliation by 4. Looking into the history of this, it
seems like Dave Hansen had also noticed this back in 2013 [1]. Turns out
there *used* to be a corresponding *= 4, which was turned into a *= 6
later on to be used in pageset_setup_from_batch_size(), which no longer
exists.
Despite this mismatch not being corrected in the comments, it seems that
getting rid of the /= 4 leads to a performance regression on machines with
less than 250G memory and 176 processors. As such, let us preserve the
functionality but clean up the comments.
Fold the /= 4 into the calculation above: bitshift by 10+2=12, and instead
of dividing 1MB, divide 256KB and adjust the comments accordingly. No
functional change intended.
Link: https://lkml.kernel.org/r/20251009192933.3756712-1-joshua.hahnjy@gmail.com
Link: https://lkml.kernel.org/r/20251009192933.3756712-2-joshua.hahnjy@gmail.com
Link: https://lore.kernel.org/all/20251002204636.4016712-1-joshua.hahnjy@gmail.com/ [1]
Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
Suggested-by: Dave Hansen <dave.hansen@intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Brendan Jackman <jackmanb@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Add the file 'show_stacks_handles' to show just stack traces and their
handles, in order to resolve stack traces and handles (i.e., to identify
the stack traces for handles in previous reads from 'show_handles').
All stacks/handles must show up, regardless of their number of pages, that
might have become zero or no longer make 'count_threshold', but made it in
previous reads from 'show_handles' -- and need to be resolved later.
P.S.: now, print the extra newline independently of the number of pages.
Link: https://lkml.kernel.org/r/20251001175611.575861-5-mfo@igalia.com
Signed-off-by: Mauricio Faria de Oliveira <mfo@igalia.com>
Cc: Brendan Jackman <jackmanb@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "mm/page_owner: add debugfs files 'show_handles' and
'show_stacks_handles'", v2.
Context:
The page_owner debug feature can help understand a particular situation in
in a point in time (e.g., identify biggest memory consumers; verify memory
counters that do not add up).
Another useful usecase is to collect data repeatedly over time, and use it
for profiling, monitoring, and even comparing different kernel versions,
at the stack trace level (e.g., watch for trends, leaks, correlations, and
regressions).
For this usecase, userspace periorically collects the data from page_owner
and organizes it in data structures appropriate for access per-stack
trace.
Problem:
The usecase of tracking memory usage per stack trace (or tracking it for a
particular stack trace) requires uniquely identifying each stack trace
(i.e., keys to store their memory usage over periodic data collections).
This has to be done for every stack trace in every sample/data collection,
even if tracking only one stack trace (to identify it among all others).
Therefore, an approach like hashing the stack traces in userspace to
create unique keys/identifiers for them during post-processing can quickly
become expensive, considering the repetition and a growing number of stack
traces.
Solution:
Fortunately, the kernel can provide a unique identifier for stack traces
in page_owner, which is the handle number in stackdepot. This eliminates
the need for creating keys (hashing) in userspace during post-processing.
Additionally, with that information, the stack traces themselves are not
needed until the memory usage should be resolved from a handle to a stack
trace (say, to look at the stack traces of a few top consumers). This can
reduce the amount of text emitted/copied by the kernel to userspace, and
save userspace from matching and discarding stack traces when not needed.
Changes:
This patchset adds 2 files to provide information, like 'show_stacks':
- show_handles: print handle number and number of pages (no stack traces)
- show_stacks_handles: print handle numbers and stack traces (no pages)
Now, it's possible to periodically collect data with handle numbers (keys)
and without stack traces (lower overhead) from 'show_handles', and later
do a final collection with handles and stack traces from
'show_stacks_handles' to resolve the handles to their stack traces.
The output format follows the existing 'show_stacks' file, for simplicity,
but it can certainly be changed if a different format is more convenient.
Example:
The number of base pages collected can be stored per-handle number over
the periodic data collections, and finally resolved to stack traces
per-handle number as well with a final collection.
Later, one can, for example, identify the biggest consumers and watch
their trends or correlate increases/decreases with other events in the
system, or watch a particular stack trace(s) of interest during
development.
Testing:
Tested on next-20250929.
- show_stacks:
register_dummy_stack+0x32/0x70
init_page_owner+0x29/0x2f0
page_ext_init+0x27c/0x2b0
mm_core_init+0xdc/0x110
nr_base_pages: 47
- show_handles:
handle: 1
nr_base_pages: 47
- show_stacks_handles:
register_dummy_stack+0x32/0x70
init_page_owner+0x29/0x2f0
page_ext_init+0x27c/0x2b0
mm_core_init+0xdc/0x110
handle: 1
- count_threshold:
# echo 100 >/sys/kernel/debug/page_owner_stacks/count_threshold
# grep register_dummy_stack show_stacks # not present
# grep -B4 '^handle: 1$' show_handles # not present
# grep -B4 '^handle: 1$' show_stacks_handles # present
register_dummy_stack+0x32/0x70
init_page_owner+0x29/0x2f0
page_ext_init+0x27c/0x2b0
mm_core_init+0xdc/0x110
handle: 1
This patch (of 5):
Currently, struct seq_file.private is used as an iterator in stack_list by
stack_start|next(), for stack_print().
Create a context struct for this, in order to add another field next.
No behavior change intended.
P.S.: page_owner_stack_open() is expanded with separate statements for
variable definition and return just in preparation for the next patch.
Link: https://lkml.kernel.org/r/20251001175611.575861-1-mfo@igalia.com
Link: https://lkml.kernel.org/r/20251001175611.575861-2-mfo@igalia.com
Signed-off-by: Mauricio Faria de Oliveira <mfo@igalia.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Brendan Jackman <jackmanb@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm_get_unmapped_area() is a wrapper around arch_get_unmapped_area() /
arch_get_unmapped_area_topdown(), both of which search current->mm for
some free space. Neither take an mm_struct - they implicitly operate on
current->mm.
But the wrapper takes an mm_struct and uses it to decide whether to search
bottom up or top down. All callers pass in current->mm for this, so
everything is working consistently. But it feels like an accident waiting
to happen; eventually someone will call that function with a different mm,
expecting to find free space in it, but what gets returned is free space
in the current mm.
So let's simplify by removing the parameter and have the wrapper use
current->mm to decide which end to start at. Now everything is consistent
and self-documenting.
Link: https://lkml.kernel.org/r/20251003155306.2147572-1-ryan.roberts@arm.com
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Dev Jain <dev.jain@arm.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Commit 4687fdbb80 ("mm/filemap: Support VM_HUGEPAGE for file mappings")
introduced a special handling for VM_HUGEPAGE mappings: even if the
readahead is disabled, 1 or 2 HPAGE_PMD_ORDER pages are allocated.
This change causes a significant regression for containers with a tight
memory.max limit, if VM_HUGEPAGE is widely used. Prior to this commit,
mmap_miss logic would eventually lead to the readahead disablement,
effectively reducing the memory pressure in the cgroup. With this change
the kernel is trying to allocate 1-2 huge pages for each fault, no matter
if these pages are used or not before being evicted, increasing the memory
pressure multi-fold.
To fix the regression, let's make the new VM_HUGEPAGE conditional to the
mmap_miss check, but keep independent from the ra->ra_pages. This way the
main intention of commit 4687fdbb80 ("mm/filemap: Support VM_HUGEPAGE
for file mappings") stays intact, but the regression is resolved.
The logic behind this changes is simple: even if a user explicitly
requests using huge pages to back the file mapping (using VM_HUGEPAGE
flag), under a very strong memory pressure it's better to fall back to
ordinary pages.
Link: https://lkml.kernel.org/r/20251006175106.377411-1-roman.gushchin@linux.dev
Fixes: 4687fdbb80 ("mm/filemap: Support VM_HUGEPAGE for file mappings")
Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
Reviewed-by: Dev Jain <dev.jain@arm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "ksm: fix exec/fork inheritance", v2.
This series fixes exec/fork inheritance. See the detailed description of
the issue below.
This patch (of 2):
Background
==========
commit d7597f59d1 ("mm: add new api to enable ksm per process")
introduced MMF_VM_MERGE_ANY for mm->flags, and allowed user to set it by
prctl() so that the process's VMAs are forcibly scanned by ksmd.
Subsequently, the 3c6f33b727 ("mm/ksm: support fork/exec for prctl")
supported inheriting the MMF_VM_MERGE_ANY flag when a task calls execve().
Finally, commit 3a9e567ca4 ("mm/ksm: fix ksm exec support for prctl")
fixed the issue that ksmd doesn't scan the mm_struct with MMF_VM_MERGE_ANY
by adding the mm_slot to ksm_mm_head in __bprm_mm_init().
Problem
=======
In some extreme scenarios, however, this inheritance of MMF_VM_MERGE_ANY
during exec/fork can fail. For example, when the scanning frequency of
ksmd is tuned extremely high, a process carrying MMF_VM_MERGE_ANY may
still fail to pass it to the newly exec'd process. This happens because
ksm_execve() is executed too early in the do_execve flow (prematurely
adding the new mm_struct to the ksm_mm_slot list).
As a result, before do_execve completes, ksmd may have already performed a
scan and found that this new mm_struct has no VM_MERGEABLE VMAs, thus
clearing its MMF_VM_MERGE_ANY flag. Consequently, when the new program
executes, the flag MMF_VM_MERGE_ANY inheritance missed.
Root reason
===========
commit d7597f59d1 ("mm: add new api to enable ksm per process") clear
the flag MMF_VM_MERGE_ANY when ksmd found no VM_MERGEABLE VMAs.
Solution
========
Firstly, Don't clear MMF_VM_MERGE_ANY when ksmd found no VM_MERGEABLE
VMAs, because perhaps their mm_struct has just been added to ksm_mm_slot
list, and its process has not yet officially started running or has not
yet performed mmap/brk to allocate anonymous VMAS.
Secondly, recheck MMF_VM_MERGEABLE again if a process takes
MMF_VM_MERGE_ANY, and create a mm_slot and join it into ksm_scan_list
again.
Link: https://lkml.kernel.org/r/20251007182504440BJgK8VXRHh8TD7IGSUIY4@zte.com.cn
Link: https://lkml.kernel.org/r/20251007182821572h_SoFqYZXEP1mvWI4n9VL@zte.com.cn
Fixes: 3c6f33b727 ("mm/ksm: support fork/exec for prctl")
Fixes: d7597f59d1 ("mm: add new api to enable ksm per process")
Signed-off-by: xu xin <xu.xin16@zte.com.cn>
Cc: Stefan Roesch <shr@devkernel.io>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jinjiang Tu <tujinjiang@huawei.com>
Cc: Wang Yaxin <wang.yaxin@zte.com.cn>
Cc: Yang Yang <yang.yang29@zte.com.cn>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Extend __kvmalloc_node_noprof() to handle non-blocking GFP flags
(GFP_NOWAIT and GFP_ATOMIC). Previously such flags were rejected,
returning NULL. With this change:
- kvmalloc() can fall back to vmalloc() if non-blocking contexts;
- for non-blocking allocations the VM_ALLOW_HUGE_VMAP option is
disabled, since the huge mapping path still contains might_sleep();
- documentation update to reflect that GFP_NOWAIT and GFP_ATOMIC
are now supported.
Link: https://lkml.kernel.org/r/20251007122035.56347-11-urezki@gmail.com
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Baoquan He <bhe@redhat.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Marco Elver <elver@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
might_alloc() catches invalid blocking allocations in contexts where
sleeping is not allowed.
However when PF_MEMALLOC is set, the page allocator already skips reclaim
and other blocking paths. In such cases, a blocking gfp_mask does not
actually lead to blocking, so triggering might_alloc() splats is
misleading.
Adjust might_alloc() to skip warnings when the current task has
PF_MEMALLOC set, matching the allocator's actual blocking behaviour.
Link: https://lkml.kernel.org/r/20251007122035.56347-9-urezki@gmail.com
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Reviewed-by: Baoquan He <bhe@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Marco Elver <elver@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
kmsan_vmap_pages_range_noflush() allocates its temp s_pages/o_pages arrays
with GFP_KERNEL, which may sleep. This is inconsistent with vmalloc() as
it will support non-blocking requests later.
Plumb gfp_mask through the kmsan_vmap_pages_range_noflush(), so it can use
it internally for its demand.
Please note, the subsequent __vmap_pages_range_noflush() still uses
GFP_KERNEL and can sleep. If a caller runs under reclaim constraints,
sleeping is forbidden, it must establish the appropriate memalloc scope
API.
Link: https://lkml.kernel.org/r/20251007122035.56347-8-urezki@gmail.com
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Reviewed-by: Alexander Potapenko <glider@google.com>
Cc: Marco Elver <elver@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Make __vmalloc_area_node() respect non-blocking GFP masks such as
GFP_ATOMIC and GFP_NOWAIT.
- Add memalloc_apply_gfp_scope()/memalloc_restore_scope()
helpers to apply a proper scope.
- Apply memalloc_apply_gfp_scope()/memalloc_restore_scope()
around vmap_pages_range() for page table setup.
- Set "nofail" to false if a non-blocking mask is used, as
they are mutually exclusive.
This is particularly important for page table allocations that internally
use GFP_PGTABLE_KERNEL, which may sleep unless such scope restrictions are
applied. For example:
<snip>
__pte_alloc_kernel()
pte_alloc_one_kernel(&init_mm);
pagetable_alloc_noprof(GFP_PGTABLE_KERNEL & ~__GFP_HIGHMEM, 0);
<snip>
Note: in most cases, PTE entries are established only up to the level
required by current vmap space usage, meaning the page tables are
typically fully populated during the mapping process.
Link: https://lkml.kernel.org/r/20251007122035.56347-6-urezki@gmail.com
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Baoquan He <bhe@redhat.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Marco Elver <elver@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
__vmalloc_area_node() may call free_vmap_area() or vfree() on error paths,
both of which can sleep. This becomes problematic if the function is
invoked from an atomic context, such as when GFP_ATOMIC or GFP_NOWAIT is
passed via gfp_mask.
To fix this, unify error paths and defer the cleanup of partly initialized
vm_struct objects to a workqueue. This ensures that freeing happens in a
process context and avoids invalid sleeps in atomic regions.
Link: https://lkml.kernel.org/r/20251007122035.56347-5-urezki@gmail.com
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Baoquan He <bhe@redhat.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Marco Elver <elver@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
alloc_vmap_area() currently assumes that sleeping is allowed during
allocation. This is not true for callers which pass non-blocking GFP
flags, such as GFP_ATOMIC or GFP_NOWAIT.
This patch adds logic to detect whether the given gfp_mask permits
blocking. It avoids invoking might_sleep() or falling back to reclaim
path if blocking is not allowed.
This makes alloc_vmap_area() safer for use in non-sleeping contexts, where
previously it could hit unexpected sleeps, trigger warnings.
It is a preparation and adjustment step to later allow both GFP_ATOMIC and
GFP_NOWAIT allocations in this series.
Link: https://lkml.kernel.org/r/20251007122035.56347-4-urezki@gmail.com
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Baoquan He <bhe@redhat.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Marco Elver <elver@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "__vmalloc()/kvmalloc() and no-block support", v4.
This patch (of 10):
Introduce a new test case "no_block_alloc_test" that verifies non-blocking
allocations using __vmalloc() with GFP_ATOMIC and GFP_NOWAIT flags.
It is recommended to build kernel with CONFIG_DEBUG_ATOMIC_SLEEP enabled
to help catch "sleeping while atomic" issues. This test ensures that
memory allocation logic under atomic constraints does not inadvertently
sleep.
Link: https://lkml.kernel.org/r/20251007122035.56347-2-urezki@gmail.com
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Marco Elver <elver@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The DSP needs to access peripherals on AIPSTZ5 (to communicate with
the AP using AUDIOMIX MU, for instance). To do so, the security-related
registers of the bridge have to be configured before the DSP is started.
Enforce a dependency on AIPSTZ5 by adding the 'access-controllers'
property to the 'dsp' node.
Reviewed-by: Daniel Baluta <daniel.baluta@nxp.com>
Signed-off-by: Laurentiu Mihalcea <laurentiu.mihalcea@nxp.com>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
Change the programming model of the "aips5" node to allow configuring
the security-related registers exposed by the AIPSTZ5 bridge. Without
this, masters such as the HIFI4 DSP will have their access to the
peripherals connected to the bridge denied after power cycling the
AUDIOMIX domain.
Co-developed-by: Daniel Baluta <daniel.baluta@nxp.com>
Signed-off-by: Daniel Baluta <daniel.baluta@nxp.com>
Signed-off-by: Laurentiu Mihalcea <laurentiu.mihalcea@nxp.com>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
From software perspective, Rev.C HDMI and Rev.B HDMI don't differ since
the panel is connected via HDMI and the touchscreen is connected via
USB. However, the bootloader firmware expects to find a dts with the
correct revc-hdmi compatible.
Signed-off-by: Marco Felsch <m.felsch@pengutronix.de>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
The LED enumerators are missing, which prevents the LEDs from being
accurately told apart by label. Fill in the enumerators the same way
they are already present on PDK3.
Signed-off-by: Marek Vasut <marek.vasut@mailbox.org>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
Add support for the Ethernet connection over GMAC controller connected to
the Micrel KSZ9031 Ethernet RGMII PHY located on the boards.
The mentioned GMAC controller is one of two network controllers
embedded on the NXP Automotive SoCs S32G2 and S32G3.
The supported boards:
* EVB: S32G-VNP-EVB with S32G2 SoC
* RDB2: S32G-VNP-RDB2
* RDB3: S32G-VNP-RDB3
Tested-by: Enric Balletbo i Serra <eballetb@redhat.com>
Signed-off-by: Jan Petrous (OSS) <jan.petrous@oss.nxp.com>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
Add pwm node used by the backlight output pin BKL1_PWM and
reference it from the pwm-backlight node.
Signed-off-by: Max Krummenacher <max.krummenacher@toradex.com>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
LPUART1 is not disabled, but used by system manager (SM) and should not
be used by Linux.
Signed-off-by: Alexander Stein <alexander.stein@ew.tq-group.com>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
66 MHz is the maximum FlexPI clock frequency in normal/overdrive mode
when RXCLKSRC = 0 (Default)
Fixes: 91d1ff322c ("arm64: dt: imx95: Add TQMa95xxSA")
Signed-off-by: Alexander Stein <alexander.stein@ew.tq-group.com>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
Make SoM .dtsi SoC-agnostic by moving SoC include to board level
imx6qdl-var-som.dtsi currently includes imx6q.dtsi, which makes this SoM
description Quad/Dual specific and prevents reuse from i.MX6DL boards.
Changes:
- Move imx6q.dtsi include from imx6qdl-var-som.dtsi to
imx6q-var-mx6customboard.dts.
- Remove /dts-v1/; from imx6qdl-var-som.dtsi (dtsi files should not declare
version)
This keeps the SoM .dtsi SoC-agnostic (it already relies on imx6qdl.dtsi for
family-common parts) and allows boards using the DualLite or Solo to include
imx6dl.dtsi instead.
Why this is needed:
To reuse imx6qdl-var-som.dtsi on i.MX6DL board.
No functional changes for imx6q-var-mx6customboard are intended.
Signed-off-by: Stefan Prisacariu <stefan.prisacariu@prevas.dk>
Link: https://lore.kernel.org/all/20250925104942.4148376-1-stefan.prisacariu@prevas.dk/
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
Enable the i.MX AIPSTZ driver, which is used for i.MX8MP-based boards such
as NXP's IMX8MP-EVK.
Signed-off-by: Laurentiu Mihalcea <laurentiu.mihalcea@nxp.com>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
From software perspective, Rev.C HDMI and Rev.B HDMI don't differ since
the panel is connected via HDMI and the touchscreen is connected via
USB. However, the bootloader firmware expects to find a dts with the
correct revc-hdmi compatible.
Signed-off-by: Marco Felsch <m.felsch@pengutronix.de>
Acked-by: Conor Dooley <conor.dooley@microchip.com>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
Describe the RGB LED indicator according to the reality - it is a single
part containing all the three R,G and B LEDs in one package.
With this description the chan-name property becomes useless, remove it.
Signed-off-by: Michal Vokáč <michal.vokac@ysoft.com>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
Lynx, Pegasus and Pegasus+ boards have a speaker connected to the PWM3.
Enable a pwm-beeper on these boards so the system can produce simple
sounds.
Signed-off-by: Michal Vokáč <michal.vokac@ysoft.com>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
The VAR-SOM-MX93 integrates an ADS7846 resistive touchscreen controller.
The controller is physically located on the SOM, and its signals are
routed to the SOM pins, allowing carrier boards to make use of it.
This patch adds the ADS7846 node and the appropriate SPI controller.
Signed-off-by: Stefano Radaelli <stefano.radaelli21@gmail.com>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
The VAR-SOM-MX93 can integrate the WM8904, a high-performance
ultra-low-power stereo codec optimized for portable audio applications.
This patch adds the WM8904 device to the appropriate I2C bus, enables
the SAI peripheral, and introduces the sound node to expose the
sound card to the system.
Signed-off-by: Stefano Radaelli <stefano.radaelli21@gmail.com>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
The VAR-SOM-MX93 features Dual Freescale/NXP PCA9541 chip as a Power
Management Integrated circuit (PMIC).
The PMIC is programmable via the I2C interface and its associated
register map, and this patch adds its support.
Signed-off-by: Stefano Radaelli <stefano.radaelli21@gmail.com>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
Add device tree nodes for the WiFi and Bluetooth module mounted on the
VAR-SOM-MX93. The module can be based on either the NXP IW612 or IW611
chipset, depending on the configuration chosen by the customer.
Regardless of the chipset used, WiFi communicates over SDIO and Bluetooth
over UART.
Signed-off-by: Stefano Radaelli <stefano.radaelli21@gmail.com>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
Add the lpuart1 dts node to support the PCIE9098 bluetooth on M.2
connector.
Signed-off-by: Sherry Sun <sherry.sun@nxp.com>
Signed-off-by: Frank Li <Frank.Li@nxp.com>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
Add phandle to the OCOTP mac-address nodes so the FEC can obtain a fixed
MAC address specific to each board.
Signed-off-by: Frank Li <Frank.Li@nxp.com>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
default, state_100mhz and state_200mhz use the same settings. But current
driver use these to indicate if sd3.0 support.
Add SD gpio pin group (Reset, CD, WP) for usdhc2.
Signed-off-by: Frank Li <Frank.Li@nxp.com>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
Assign double SD bus frequency to support SDR104 mode, where the operating
clock runs at 208 MHz.
Signed-off-by: Frank Li <Frank.Li@nxp.com>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
default, state_100mhz and state_200mhz use the same settings. But current
driver use these to indicate if sd3.0 support.
Signed-off-by: Frank Li <Frank.Li@nxp.com>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
Add overlay to support PWM fan on the phyBOARD-Nash-i.MX93 board. Fan
can be connected to the FAN (X48) connector on the board and will be
controlled according to the following CPU temperature trips table:
- bellow 50 degrees - fan is off (<1% duty cycle)
- between 50 and 58 degrees - low fan speed (~35% duty cycle)
- between 58 and 65 degrees - fan medium speed (~60% duty cycle)
- above 65 degrees - fan at full speed (>99% duty cycle)
The output frequency of PWM signal is set to 25 kHz.
Signed-off-by: Primoz Fiser <primoz.fiser@norik.com>
Reviewed-by: Alberto Merciai <alb3rt0.m3rciai@gmail.com>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
Enable internal pull up of the active low audio codec reset pin.
Otherwise the audio codec does not reset properly and is not working.
Signed-off-by: Teresa Remmet <t.remmet@phytec.de>
Signed-off-by: Jan Remmet <j.remmet@phytec.de>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
Add support for powertip,ph128800t006-zhc01 connected via peb-av-10
Signed-off-by: Jan Remmet <j.remmet@phytec.de>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
The PEB-AV-10 board can be used with different displays or in audio-only
mode.
Split the device tree overlays to reflect these use cases. To use the
board with the EDT ETML1010G3DRA display, the overlay
imx8mm-phyboard-polis-peb-av-10-etml1010g3dra.dtbo must now be used
instead of imx8mm-phyboard-polis-peb-av-10.dtbo.
Signed-off-by: Jan Remmet <j.remmet@phytec.de>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
sn65dsi83 is mounted on som. Add the static configuration there.
So it can be used by other boards too.
Use mipi_dsi_out from imx8mm.dtsi directly.
Signed-off-by: Jan Remmet <j.remmet@phytec.de>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
Tree has invariant part + two subtrees that get replaced upon each
policy load. Invariant parts stay for the lifetime of filesystem,
these two subdirs - from policy load to policy load (serialized
on lock_rename(root, ...)).
All object creations are via d_alloc_name()+d_add() inside selinuxfs,
all removals are via simple_recursive_removal().
Turn those d_add() into d_make_persistent()+dput() and that's mostly it.
Acked-by: Paul Moore <paul@paul-moore.com>
Reviewed-by: Stephen Smalley <stephen.smalley.work@gmail.com>
Tested-by: Stephen Smalley <stephen.smalley.work@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
allocating dentry after the inode has been set up reduces the amount
of boilerplate - "attach this inode under that name and this parent
or drop inode in case of failure" simplifies quite a few places.
Acked-by: Paul Moore <paul@paul-moore.com>
Reviewed-by: Stephen Smalley <stephen.smalley.work@gmail.com>
Tested-by: Stephen Smalley <stephen.smalley.work@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Don't bother to store the dentry of /policy_capabilities - it belongs
to invariant part of tree and we only use it to populate that directory,
so there's no reason to keep it around afterwards.
Same situation as with /avc, /ss, etc. There are two directories that
get replaced on policy load - /class and /booleans. These we need to
stash (and update the pointers on policy reload); /policy_capabilities
is not in the same boat.
Acked-by: Paul Moore <paul@paul-moore.com>
Reviewed-by: Stephen Smalley <stephen.smalley.work@gmail.com>
Tested-by: Stephen Smalley <stephen.smalley.work@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
removals are done with locked_recursive_removal(); switch creations to
simple_start_creating()/d_make_persistent()/simple_done_creating() and
take them to a helper (add_entry()), while we are at it - simpler control
flow that way.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
creation/removal is via normal VFS paths; make ->mkdir() and ->symlink()
use d_make_persistent(); ->rmdir() and ->unlink() - d_make_discardable()
instead of dput() and that's it.
d_make_persistent() works for unhashed just fine...
Note that only persistent dentries are ever hashed there; unusual absense
of ->d_delete() in dentry_operations is due to that - anything that has
refcount reach 0 will be unhashed there, so it won't get to checking
->d_delete anyway.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Objects are created either by d_alloc_name()+d_add() (in
binderfs_ctl_create()) or by simple_start_creating()+d_instantiate().
Removals are by simple_recurisive_removal().
Switch d_add()/d_instantiate() to d_make_persistent() + dput().
Voila - kill_litter_super() is not needed anymore.
Fold dput()+unlocking the parent into simple_done_creating(), while
we are at it.
NOTE: return value of binderfs_create_file() is borrowed; it may get
stored in proc->binderfs_entry. See binder_release()...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
It's called once, during binderfs mount, right after allocating
root dentry. Checking that it hadn't been already called is
only obfuscating things.
Looks like that bogosity had been copied from devpts...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Two kinds of objects there - ptmx and everything else (pty). The former is
created on mount and kept until the fs shutdown; the latter get created
and removed by tty layer (the references are borrowed into tty->driver_data).
The reference to ptmx dentry is also kept, but we only ever use it to
find ptmx inode on remount.
* turn d_add() into d_make_persistent() + dput() both in mknod_ptmx() and
in devpts_pty_new().
* turn dput() to d_make_discardable() in devpts_pty_kill().
* switch mknod_ptmx() to simple_{start,done}_creating().
* instead of storing in pts_fs_info a reference to ptmx dentry, store a reference
to its inode, seeing that this is what we use it for.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
static contents for each "service processor", whatever the fuck it is.
Congruent subdirectories of root, created at mount time, taken out
by kill_litter_super(). All dentries created with d_alloc_name() and are
left pinned. The odd part is that the list of service providers is
assumed to be unchanging - no locking, nothing to handle removals or
extra elements added later on.
... and it's a PCI device. If you ever tell it to remove an instance,
you are fucked - it doesn't bother with removing its directory from filesystem,
it has a strange check that presumably wanted to be a check for removed
devices, but it had never been fleshed out.
Anyway, d_add() -> d_make_persistent()+dput() in ibmasmfs_create_dir() and
ibmasmfs_create_file(), and make the latter return int - no need to even
borrow that dentry, callers completely ignore it.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
have spufs_new_file() use d_make_persistent() instead of d_add() and
do an uncondition dput() in the caller; the rest is completely
straightforward.
[a braino in spufs_mkgang() fixed]
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Initially filesystem is populated with d_alloc_name() + d_add().
That becomes d_alloc_name() + d_make_persistent() + dput().
Dynamic creation is switched to d_make_persistent();
removal - to simple_unlink() (no point open-coding it in
efivarfs_unlink(), better call it there)
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
we'd already verified that DEBUGFS_ALLOW_API was there in
start_creating() - it would've failed otherwise
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
similar to tracefs - simulation of normal codepath for creation,
simple_recursive_removal() for removal.
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
A mix of persistent and non-persistent dentries in there. Strictly
speaking, no need for kill_litter_super() anyway - it pins an internal
mount whenever a persistent dentry is created, so at fs shutdown time
there won't be any to deal with.
However, let's make it explicit - replace d_instantiate() with
d_make_persistent() + dput() (the latter in tracefs_end_creating(),
where it folds with inode_unlock() into simple_done_creating())
for dentries we want persistent and have d_make_discardable() done
either by simple_recursive_removal() (used by tracefs_remove())
or explicitly in eventfs_remove_events_dir().
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
object creation by d_alloc_name()+d_add() in pstore_mkfile(), removal -
via normal VFS codepaths (with ->unlink() using simple_unlink()) or
in pstore_put_backend_records() via locked_recursive_removal()
Replace d_add() with d_make_persistent()+dput() - that's what really
happens there. The reference that goes into record->dentry is valid
only until the unlink (and explicitly cleared by pstore_unlink()).
Reviewed-by: Kees Cook <kees@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
objects are created in fuse_ctl_add_dentry() by d_alloc_name()+d_add(),
removed by simple_remove_by_name().
What we return is a borrowed reference - it is valid until the call of
fuse_ctl_remove_conn() and we depend upon the exclusion (on fuse_mutex)
for safety. Return value is used only within the caller
(fuse_ctl_add_conn()).
Replace d_add() with d_make_persistent() + dput(). dput() is paired
with d_alloc_name() and return value is the result of d_make_persistent().
Acked-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
All modifications via normal VFS codepaths; just take care of making
persistent in ->create() and ->mkdir() and that's it (removal side
doesn't need any changes, since it uses simple_rmdir() for ->rmdir()
and calls simple_unlink() from ->unlink()).
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
object creation goes through the normal VFS paths or approximation
thereof (user_path_create()/done_path_create() in case of bpf_obj_do_pin(),
open-coded simple_{start,done}_creating() in bpf_iter_link_pin_kernel()
at mount time), removals go entirely through the normal VFS paths (and
->unlink() is simple_unlink() there).
Enough to have bpf_dentry_finalize() use d_make_persistent() instead
of dget() and we are done.
Convert bpf_iter_link_pin_kernel() to simple_{start,done}_creating(),
while we are at it.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
All modifications via normal VFS codepaths; just take care of making
persistent in in mqueue_create_attr() and discardable in mqueue_unlink()
and it doesn't need kill_litter_super() at all.
mqueue_unlink() side is best handled by having it call simple_unlink()
rather than duplicating its guts...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Very much ramfs-like; dget()+d_instantiate() -> d_make_persistent()
(in two places) is all it takes.
NB: might make sense to turn its ->put_super() into ->kill_sb().
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Entirely static tree populated by simple_fill_super(). Can use
kill_anon_super() as-is.
Acked-by: Casey Schaufler <casey@schaufler-ca.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
entirely static tree, populated by simple_fill_super(). Can switch
to kill_anon_super() without any other changes.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
These are guaranteed to be empty by the time they are shut down;
both are single-instance and there is an internal mount maintained
for as long as there is any contents.
Both have that internal mount pinned by every object in root.
In other words, kill_litter_super() boils down to kill_anon_super()
for those.
Reviewed-by: Joel Becker <jlbec@evilplan.org>
Acked-by: Paul Moore <paul@paul-moore> (LSM)
Acked-by: Andreas Hindborg <a.hindborg@kernel.org> (configfs)
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
... and there's no need to remember those pointers anywhere - ->kill_sb()
no longer needs to bother since kill_anon_super() will take care of
them anyway and proc_pid_readdir() only wants the inumbers, which
we had in a couple of static variables all along.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Quite a bit is already done by infrastructure changes (simple_link(),
simple_unlink()) - all that is left is replacing d_instantiate() +
pinning dget() (in ->symlink() and ->mknod()) with d_make_persistent(),
and, in case of shmem, using simple_unlink() and simple_link() in
->unlink() and ->link() resp., instead of open-coding those there.
Since d_make_persistent() accepts (and hashes) unhashed ones, shmem
situation gets simpler - we no longer care whether ->lookup() has hashed
the sucker.
With that done, we don't need kill_litter_super() for these filesystems
anymore - by the umount time all remaining dentries will be marked
persistent and kill_litter_super() will boil down to call of
kill_anon_super().
The same goes for devtmpfs and rootfs - they are handled by
ramfs or by shmem, depending upon config.
NB: strictly speaking, both devtmpfs and rootfs ought to use
ramfs_kill_sb() if they end up using ramfs; that's a separate
story and the only impact of "just use kill_{litter,anon}_super()"
is that we fail to free their sb->s_fs_info... on reboot.
That's orthogonal to the changes in this series - kill_litter_super()
is identical to kill_anon_super() for those at this point.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Note that simple_unlink() et.al. are used by many filesystems; for now
they can not assume that persistency mark will have been set back
when the object got created. Once all conversions are done we'll
have them complain if called for something that had not been marked
persistent.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* d_make_persistent(dentry, inode) - bump refcount, mark persistent and
make hashed positive. Return value is a borrowed reference to dentry;
it can be used until something removes persistency (at the very least,
until the parent gets unlocked, but some filesystems may have stronger
exclusion).
* d_make_discardable() - remove persistency mark and drop reference.
d_make_persistent() is similar to combination of d_instantiate(), dget()
and setting flag. The only difference is that unlike d_instantiate()
it accepts hashed and unhashed negatives alike. It is always called in
strong locking environment (parent held exclusive, or, in some cases,
dentry coming from d_alloc_name()); if we ever start using it with parent
held only shared and dentry coming from d_alloc_parallel(), we'll need
to copy the in-lookup logics from __d_add().
d_make_discardable() is eqiuvalent to combination of removing flag and
dput(); since flag removal requires ->d_lock, there's no point trying
to avoid taking that for refcount decrement as fast_dput() does.
The slow path of dput() has been taken into a helper and reused in
d_make_discardable() instead.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Some filesystems use a kinda-sorta controlled dentry refcount leak to pin
dentries of created objects in dcache (and undo it when removing those).
Reference is grabbed and not released, but it's not actually _stored_
anywhere. That works, but it's hard to follow and verify; among other
things, we have no way to tell _which_ of the increments is intended
to be an unpaired one. Worse, on removal we need to decide whether
the reference had already been dropped, which can be non-trivial if
that removal is on umount and we need to figure out if this dentry is
pinned due to e.g. unlink() not done. Usually that is handled by using
kill_litter_super() as ->kill_sb(), but there are open-coded special
cases of the same (consider e.g. /proc/self).
Things get simpler if we introduce a new dentry flag (DCACHE_PERSISTENT)
marking those "leaked" dentries. Having it set claims responsibility
for +1 in refcount.
The end result this series is aiming for:
* get these unbalanced dget() and dput() replaced with new primitives that
would, in addition to adjusting refcount, set and clear persistency flag.
* instead of having kill_litter_super() mess with removing the remaining
"leaked" references (e.g. for all tmpfs files that hadn't been removed
prior to umount), have the regular shrink_dcache_for_umount() strip
DCACHE_PERSISTENT of all dentries, dropping the corresponding
reference if it had been set. After that kill_litter_super() becomes
an equivalent of kill_anon_super().
Doing that in a single step is not feasible - it would affect too many places
in too many filesystems. It has to be split into a series.
Here we
* introduce the new flag
* teach shrink_dcache_for_umount() to handle it (i.e. remove
and drop refcount on anything that survives to umount with that flag
still set)
* teach kill_litter_super() that anything with that flag does
*not* need to be unpinned.
Next commits will add primitives for maintaing that flag and convert the
common helpers to those. After that - a long series of per-filesystem
patches converting to those primitives.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
simple_recursive_removal(), but instead of victim dentry it takes
parent + name.
Used to be open-coded in fs/fuse/control.c, but there's no need to expose
the guts of that thing there and there are other potential users, so
let's lift it into libfs...
Acked-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
If we have LOCKDOWN_TRACEFS, the function bails out - *after*
having locked the parent directory and without bothering to
undo that. Just check it before tracefs_start_creating()...
Fixes: e24709454c "tracefs/eventfs: Add missing lockdown checks"
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
fuse_ctl_remove_conn() used to decrement the link count of root
manually; that got subsumed by simple_recursive_removal(), but
in case when subdirectory creation has failed the latter won't
get called.
Just move the modification of parent's link count into
fuse_ctl_add_dentry() to keep the things simple. Allows to
get rid of the nlink argument as well...
Fixes: fcaac5b427 "fuse_ctl: use simple_recursive_removal()"
Acked-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Enable i.MX95 pinctrl driver necessary for booting. Also enable the
missing drivers required for Ethernet and PCIe functionality. These
drivers are used on i.MX95 boards, including the NXP i.MX95 19x19 EVK.
The below configurations were enabled (listed with their DT nodes on
imx95.dtsi):
* CONFIG_PINCTRL_IMX_SCMI for the `scmi_iomuxc` pinctrl.
* CONFIG_CLK_IMX95_BLK_CTL for the HSIO domain clock controller
(`hsio_blk_ctl`) used by the PCIe controller.
* CONFIG_NXP_NETC_BLK_CTRL for the NETC hardware domain controller
(`netc_blk_ctrl`).
* CONFIG_NXP_ENETC4 for the Ethernet controller (`enetc_port*`).
Signed-off-by: João Paulo Gonçalves <joao.goncalves@toradex.com>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
We used regulator-settling-time-us for the wifi regulator which is
wrong for regulator-fixed. We have to use startup-delay-us instead.
Signed-off-by: Stefan Eichenberger <stefan.eichenberger@toradex.com>
Signed-off-by: Max Krummenacher <max.krummenacher@toradex.com>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
This sets in_voltage_scale to calculate the measured voltage from the
raw digital value of the ADC.
Signed-off-by: Max Krummenacher <max.krummenacher@toradex.com>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
Functionality has been added without removing the associated TODO
comments.
Clean that up by removing TODOs no longer applicable.
Signed-off-by: Max Krummenacher <max.krummenacher@toradex.com>
Reviewed-by: Daniel Baluta <daniel.baluta@nxp.com>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
The HDMI TX Parallel Audio Interface (HTX_PAI) is a bridge between the
Audio Subsystem to the HDMI TX Controller.
Shrink register map size of hdmi_pvi to avoid overlapped hdmi_pai device.
Signed-off-by: Shengjiu Wang <shengjiu.wang@nxp.com>
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Tested-by: Alexander Stein <alexander.stein@ew.tq-group.com>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
The P3509 carrier board does not connect the ID GPIO. Prior to this, the
GPIO role switch driver could not detect the mode of the OTG port.
Signed-off-by: Aaron Kling <webgeek1234@gmail.com>
Reviewed-by: Jon Hunter <jonathanh@nvidia.com>
Signed-off-by: Thierry Reding <treding@nvidia.com>
The USB Micro-B port on p3450 is capable of OTG and doesn't need to be
hardcoded to peripheral. No other supported Tegra device is set up like
this, so align for consistency.
Signed-off-by: Aaron Kling <webgeek1234@gmail.com>
Signed-off-by: Thierry Reding <treding@nvidia.com>
Add interrupts for Tegra234 USB wake events to support the USB wake-up
function.
Signed-off-by: Haotien Hsu <haotienh@nvidia.com>
Signed-off-by: Thierry Reding <treding@nvidia.com>
The Tegra210 L4T bootloader RAM training will corrupt the in-RAM kernel
DT if no reserved-memory node exists. This prevents said bootloader from
being able to boot a kernel without this node, unless a chainloaded
bootloader loads the DT. Add the node to eliminate the requirement for
extra boot stages.
Signed-off-by: Aaron Kling <webgeek1234@gmail.com>
Signed-off-by: Thierry Reding <treding@nvidia.com>
The Tegra210 L4T bootloader RAM training will corrupt the in-RAM kernel
DT if no reserved-memory node exists. This prevents said bootloader from
being able to boot a kernel without this node, unless a chainloaded
bootloader loads the DT. Add the node to eliminate the requirement for
extra boot stages.
Signed-off-by: Aaron Kling <webgeek1234@gmail.com>
Signed-off-by: Thierry Reding <treding@nvidia.com>
The other engines are already enabled, finish filling out the media
engine nodes and power domains.
Signed-off-by: Aaron Kling <webgeek1234@gmail.com>
Reviewed-by: Jon Hunter <jonathanh@nvidia.com>
Signed-off-by: Thierry Reding <treding@nvidia.com>
The APB DMA controller node is currently named "dma@60020000", but
according to the DT bindings the node name should be "dma-controller".
Update the node name to match the binding and fix dtbs_check warnings.
Signed-off-by: Nino Zhang <ninozhang001@gmail.com>
Signed-off-by: Thierry Reding <treding@nvidia.com>
Add missing address-cells 0 to GIC interrupt node to silence W=1
warning:
tegra210.dtsi:31.3-41: Warning (interrupt_map): /pcie@1003000:interrupt-map:
Missing property '#address-cells' in node /interrupt-controller@50041000, using 0 as fallback
Value '0' is correct because:
1. GIC interrupt controller does not have children,
2. interrupt-map property (in PCI node) consists of five components and
the fourth component "parent unit address", which size is defined by
'#address-cells' of the node pointed to by the interrupt-parent
component, is not used (=0)
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Signed-off-by: Thierry Reding <treding@nvidia.com>
Add missing address-cells 0 to GIC interrupt node to silence W=1
warning:
tegra194.dtsi:2391.4-42: Warning (interrupt_map): /bus@0/pcie@14100000:interrupt-map:
Missing property '#address-cells' in node /bus@0/interrupt-controller@3881000, using 0 as fallback
Value '0' is correct because:
1. GIC interrupt controller does not have children,
2. interrupt-map property (in PCI node) consists of five components and
the fourth component "parent unit address", which size is defined by
'#address-cells' of the node pointed to by the interrupt-parent
component, is not used (=0)
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Signed-off-by: Thierry Reding <treding@nvidia.com>
Add missing address-cells 0 to GIC interrupt node to silence W=1
warning:
tegra186.dtsi:1355.3-41: Warning (interrupt_map): /pcie@10003000:interrupt-map:
Missing property '#address-cells' in node /interrupt-controller@3881000, using 0 as fallback
Value '0' is correct because:
1. GIC interrupt controller does not have children,
2. interrupt-map property (in PCI node) consists of five components and
the fourth component "parent unit address", which size is defined by
'#address-cells' of the node pointed to by the interrupt-parent
component, is not used (=0)
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Signed-off-by: Thierry Reding <treding@nvidia.com>
Add missing address-cells 0 to GIC interrupt node to silence W=1
warning:
tegra132.dtsi:32.3-41: Warning (interrupt_map): /pcie@1003000:interrupt-map:
Missing property '#address-cells' in node /interrupt-controller@50041000, using 0 as fallback
Value '0' is correct because:
1. GIC interrupt controller does not have children,
2. interrupt-map property (in PCI node) consists of five components and
the fourth component "parent unit address", which size is defined by
'#address-cells' of the node pointed to by the interrupt-parent
component, is not used (=0)
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Signed-off-by: Thierry Reding <treding@nvidia.com>
This adds OPP tables for ACTMON and EMC, enabling dynamic frequency
scaling for system memory.
Signed-off-by: Aaron Kling <webgeek1234@gmail.com>
Signed-off-by: Thierry Reding <treding@nvidia.com>
Add interconnect properties to the Memory Controller, External Memory
Controller and the Display Controller nodes in order to describe the
hardware interconnection.
Signed-off-by: Aaron Kling <webgeek1234@gmail.com>
Signed-off-by: Thierry Reding <treding@nvidia.com>
This enables the action monitor to facilitate dynamic frequency scaling.
Signed-off-by: Aaron Kling <webgeek1234@gmail.com>
Signed-off-by: Thierry Reding <treding@nvidia.com>
Populate USB wake events for Tegra234 XUSB host controller.
These wake-up events are optional to maintain backward compatibility and
because the USB controller does not require them for normal operation.
Signed-off-by: Haotien Hsu <haotienh@nvidia.com>
Acked-by: Conor Dooley <conor.dooley@microchip.com>
Signed-off-by: Thierry Reding <treding@nvidia.com>
Add to the known supported SEV-SNP policy bits that don't require any
implementation support from KVM in order to successfully use them.
At this time, this includes:
- CXL_ALLOW
- MEM_AES_256_XTS
- RAPL_DIS
- CIPHERTEXT_HIDING_DRAM
- PAGE_SWAP_DISABLE
Arguably, RAPL_DIS and CIPHERTEXT_HIDING_DRAM require KVM and the CCP
driver to enable these features in order for the setting of the policy
bits to be successfully handled. But, a guest owner may not wish their
guest to run on a system that doesn't provide support for those features,
so allowing the specification of these bits accomplishes that. Whether
or not the bit is supported by SEV firmware, a system that doesn't support
these features will either fail during the KVM validation of supported
policy bits before issuing the LAUNCH_START or fail during the
LAUNCH_START.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://patch.msgid.link/ec040de9864099cf592a97c201dc4cc110b2b0cf.1761593632.git.thomas.lendacky@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Add USB wake events for Tegra234 so that system can be woken up from
suspend when USB devices hot-plug/unplug event is detected.
Signed-off-by: Haotien Hsu <haotienh@nvidia.com>
Signed-off-by: Thierry Reding <treding@nvidia.com>
After commit d6ace46c82 ("ext4: remove obsolete EXT3 config options")
was added, when using the 'tegra_defconfig' kernel configuration,
mounting an MMC device on Tegra20, Tegra30 and Tegra124 boards is
failing with "unknown filesystem type 'ext4'". Fix this by updating the
'tegra_defconfig' to use the EXT4 config options and remove the
obselete EXT2/3 options.
Fixes: d6ace46c82 ("ext4: remove obsolete EXT3 config options")
Signed-off-by: Jon Hunter <jonathanh@nvidia.com>
Signed-off-by: Thierry Reding <treding@nvidia.com>
Make sure to drop the reference taken to the AHB platform device when
looking up its driver data while enabling the SMMU.
Note that holding a reference to a device does not prevent its driver
data from going away.
Fixes: 89c788bab1 ("ARM: tegra: Add SMMU enabler in AHB")
Cc: stable@vger.kernel.org # 3.5
Signed-off-by: Johan Hovold <johan@kernel.org>
Signed-off-by: Thierry Reding <treding@nvidia.com>
The Mi Pad is a tablet computer based on Nvidia Tegra K1 SoC which
originally ran the Android operating system. The Mi Pad has a 7.9" IPS
display with 1536 x 2048 (324 ppi) resolution. 2 GB of RAM and 16/64 GB
of internal memory that can be supplemented with a microSDXC card giving
up to 128 GB of additional storage.
Signed-off-by: Svyatoslav Ryhel <clamor95@gmail.com>
Signed-off-by: Thierry Reding <treding@nvidia.com>
The "aotog" is an optional aperture, so if that aperture is not defined
for a given device, then initialise the 'aotag' pointer to NULL instead
of returning an error. Note that the PMC driver will not use 'aotag'
pointer if initialised to NULL.
Co-developed-by: Shardar Mohammed <smohammed@nvidia.com>
Signed-off-by: Shardar Mohammed <smohammed@nvidia.com>
Signed-off-by: Prathamesh Shete <pshete@nvidia.com>
Signed-off-by: Jon Hunter <jonathanh@nvidia.com>
Signed-off-by: Thierry Reding <treding@nvidia.com>
The Jetson Nano series of modules only have 2 EMC table entries,
different from other SoC SKUs. As the EMC driver uses the SoC speedo ID
to populate the EMC OPP tables, add a new speedo ID to uniquely identify
this.
Signed-off-by: Aaron Kling <webgeek1234@gmail.com>
Reviewed-by: Mikko Perttunen <mperttunen@nvidia.com>
Signed-off-by: Thierry Reding <treding@nvidia.com>
Existing code only sets CPU and GPU speedo IDs 0 and 1. The CPU DVFS
code supports 11 IDs and nouveau supports 5. This aligns with what the
downstream vendor kernel supports. Align SKUs with the downstream list.
The Tegra210 CVB tables were added in the first referenced fixes commit.
Since then, all Tegra210 SoCs have tried to scale to 1.9 GHz, when the
supported devkits are only supposed to scale to 1.5 or 1.7 GHZ.
Overclocking should not be the default state.
Fixes: 2b2dbc2f94 ("clk: tegra: dfll: add CVB tables for Tegra210")
Fixes: 579db6e5d9 ("arm64: tegra: Enable DFLL support on Jetson Nano")
Signed-off-by: Aaron Kling <webgeek1234@gmail.com>
Signed-off-by: Thierry Reding <treding@nvidia.com>
Enable NVIDIA VRS (Voltage Regulator Specification) RTC device module.
It provides functionality to get/set system time, retain system time
across boot, wake system from suspend and shutdown state.
Supported platforms:
- NVIDIA Jetson AGX Orin Developer Kit
- NVIDIA IGX Orin Development Kit
- NVIDIA Jetson Orin NX Developer Kit
- NVIDIA Jetson Orin Nano Developer Kit
Signed-off-by: Shubhi Garg <shgarg@nvidia.com>
Reviewed-by: Jon Hunter <jonathanh@nvidia.com>
Tested-by: Jon Hunter <jonathanh@nvidia.com>
Signed-off-by: Thierry Reding <treding@nvidia.com>
On Tegra platforms using ACPI, the SMCCC driver already registers the
SoC device. This makes the registration performed by the Tegra fuse
driver redundant.
When booted via ACPI, skip registering the SoC device and suppress
printing SKU information from the Tegra fuse driver, as this information
is already provided by the SMCCC driver.
Fixes: 972167c690 ("soc/tegra: fuse: Add ACPI support for Tegra194 and Tegra234")
Cc: stable@vger.kernel.org
Signed-off-by: Kartik Rajput <kkartik@nvidia.com>
Signed-off-by: Thierry Reding <treding@nvidia.com>
Clock and reset names are not needed if node contains only one clock and
one reset.
Signed-off-by: Svyatoslav Ryhel <clamor95@gmail.com>
Reviewed-by: Mikko Perttunen <mperttunen@nvidia.com>
Signed-off-by: Thierry Reding <treding@nvidia.com>
HDA is part of the DISP_USB bus, so move it into that and drop the
address prefix accordingly.
Acked-by: Jon Hunter <jonathanh@nvidia.com>
Signed-off-by: Thierry Reding <treding@nvidia.com>
Document CSI HW block found in Tegra20 and Tegra30 SoC.
The #nvidia,mipi-calibrate-cells is not an introduction of property, such
property already exists in nvidia,tegra114-mipi.yaml and is used in
multiple device trees. In case of Tegra30 and Tegra20 CSI block combines
mipi calibration function and CSI function, in Tegra114+ mipi calibration
got a dedicated hardware block which is already supported. This property
here is used to align with mipi-calibration logic used by Tegra114+.
Signed-off-by: Svyatoslav Ryhel <clamor95@gmail.com>
Reviewed-by: Rob Herring (Arm) <robh@kernel.org>
Signed-off-by: Thierry Reding <treding@nvidia.com>
For BT use cases, pins are configured with pull-up state in sleep state
to avoid noise. If IRQ type is configured as level high and the GPIO line
is also in a high state, it causes continuous interrupt assertions leading
to an IRQ storm when wakeup irq enables at system suspend/runtime suspend.
Switching to edge-triggered interrupt (IRQ_TYPE_EDGE_FALLING) resolves
this by only triggering on state transitions (high-to-low) rather than
maintaining sensitivity to the static level state, effectively preventing
the continuous interrupt condition and eliminating the wakeup IRQ storm.
Fixes: 9380e0a1d4 ("arm64: dts: qcom: qrb2210-rb1: add Bluetooth support")
Signed-off-by: Praveen Talari <praveen.talari@oss.qualcomm.com>
Reviewed-by: Konrad Dybcio <konrad.dybcio@oss.qualcomm.com>
Link: https://lore.kernel.org/r/20251110101043.2108414-2-praveen.talari@oss.qualcomm.com
Signed-off-by: Bjorn Andersson <andersson@kernel.org>
The author failed to document the dependencies of this commit, resulting
in a regression.
This reverts commit 03e928442d.
Signed-off-by: Bjorn Andersson <andersson@kernel.org>
The current EPP, ISP and MPE schemas are largely compatible with Tegra114+,
requiring only minor adjustments. Additionally, the TSEC schema for the
Security engine, which is available from Tegra114 onwards, is included.
Signed-off-by: Svyatoslav Ryhel <clamor95@gmail.com>
Reviewed-by: Rob Herring (Arm) <robh@kernel.org>
Reviewed-by: Mikko Perttunen <mperttunen@nvidia.com>
Signed-off-by: Thierry Reding <treding@nvidia.com>
Agilex3 SoCFPGA development kit is a small form factor board similar to
Agilex5 013b board. Agilex3 is derived from Agilex5 SoCFPGA, with the main
difference of CPU cores — Agilex3 has 2 cores compared to 4 in Agilex5.
Signed-off-by: Niravkumar L Rabara <niravkumarlaxmidas.rabara@altera.com>
Signed-off-by: Dinh Nguyen <dinguyen@kernel.org>
Several drivers can benefit from registering per-instance data along
with the syscore operations. To achieve this, move the modifiable fields
out of the syscore_ops structure and into a separate struct syscore that
can be registered with the framework. Add a void * driver data field for
drivers to store contextual data that will be passed to the syscore ops.
Acked-by: Rafael J. Wysocki (Intel) <rafael@kernel.org>
Signed-off-by: Thierry Reding <treding@nvidia.com>
>From Power10 processors onwards, each chip has 2 hemispheres. For LPARs
running on PowerVM Hypervisor, hypervisor determines the allocation of
CPU groups to each LPAR, resulting in two LPARs with the same number of
CPUs potentially having different numbers of CPUs from each hemisphere.
Additionally, it is not feasible to ascertain the hemisphere based
solely on the CPU number.
Users wishing to assign their workload to all CPUs, or a subset of CPUs
within a specific hemisphere, encounter difficulties in identifying the
cpumask. To address this, it is proposed to expose hemisphere
information as a die in sysfs. This aligns with other architectures
and facilitates the identification of CPUs within the same hemisphere.
Tools such as lstopo can also access this information.
Please note: The hypervisor reveals the locality of the CPUs to
hemispheres only in dedicated mode. Consequently, in systems where
hemisphere information is unavailable, such as shared LPARs, the
die_cpus information in sysfs will mirror package_cpus, with
die_id set to -1.
Without this change.
$ grep . /sys/devices/system/cpu/cpu16/topology/{die*,package*} 2>/dev/null
/sys/devices/system/cpu/cpu16/topology/package_cpus:000000,000000ff,ffff0000
/sys/devices/system/cpu/cpu16/topology/package_cpus_list:16-39
With this change.
$ grep . /sys/devices/system/cpu/cpu16/topology/{die*,package*} 2>/dev/null
/sys/devices/system/cpu/cpu16/topology/die_cpus:000000,00000000,00ff0000
/sys/devices/system/cpu/cpu16/topology/die_cpus_list:16-23
/sys/devices/system/cpu/cpu16/topology/die_id:2
/sys/devices/system/cpu/cpu16/topology/package_cpus:000000,000000ff,ffff0000
/sys/devices/system/cpu/cpu16/topology/package_cpus_list:16-39
snipped lstopo-no-graphics o/p
Group0 L#0 (total=8747584KB)
Package L#0 (total=3564096KB CPUModel="POWER10 (architected), altivec supported" CPURevision="2.0 (pvr 0080 0200)")
NUMANode L#0 (P#0 local=3564096KB total=3564096KB)
Die L#0 (P#0)
Core L#0 (P#0)
<snipped>
Package L#1 (total=5183488KB CPUModel="POWER10 (architected), altivec supported" CPURevision="2.0 (pvr 0080 0200)")
NUMANode L#1 (P#1 local=5183488KB total=5183488KB)
Die L#2 (P#2)
Core L#2 (P#16)
L3Cache L#4 (size=4096KB linesize=128 ways=16)
L2Cache L#4 (size=1024KB linesize=128 ways=8)
L1dCache L#4 (size=32KB linesize=128 ways=8)
L1iCache L#4 (size=48KB linesize=128 ways=6)
PU L#16 (P#16)
PU L#17 (P#18)
PU L#18 (P#20)
PU L#19 (P#22)
L3Cache L#5 (size=4096KB linesize=128 ways=16)
L2Cache L#5 (size=1024KB linesize=128 ways=8)
L1dCache L#5 (size=32KB linesize=128 ways=8)
L1iCache L#5 (size=48KB linesize=128 ways=6)
PU L#20 (P#17)
PU L#21 (P#19)
PU L#22 (P#21)
PU L#23 (P#23)
Die L#3 (P#3)
Core L#3 (P#24)
L3Cache L#6 (size=4096KB linesize=128 ways=16)
L2Cache L#6 (size=1024KB linesize=128 ways=8)
L1dCache L#6 (size=32KB linesize=128 ways=8)
L1iCache L#6 (size=48KB linesize=128 ways=6)
PU L#24 (P#24)
PU L#25 (P#26)
PU L#26 (P#28)
PU L#27 (P#30)
L3Cache L#7 (size=4096KB linesize=128 ways=16)
L2Cache L#7 (size=1024KB linesize=128 ways=8)
L1dCache L#7 (size=32KB linesize=128 ways=8)
L1iCache L#7 (size=48KB linesize=128 ways=6)
PU L#28 (P#25)
PU L#29 (P#27)
PU L#30 (P#29)
PU L#31 (P#31)
Core L#4 (P#32)
L3Cache L#8 (size=4096KB linesize=128 ways=16)
L2Cache L#8 (size=1024KB linesize=128 ways=8)
L1dCache L#8 (size=32KB linesize=128 ways=8)
L1iCache L#8 (size=48KB linesize=128 ways=6)
PU L#32 (P#32)
PU L#33 (P#34)
PU L#34 (P#36)
PU L#35 (P#38)
L3Cache L#9 (size=4096KB linesize=128 ways=16)
L2Cache L#9 (size=1024KB linesize=128 ways=8)
L1dCache L#9 (size=32KB linesize=128 ways=8)
L1iCache L#9 (size=48KB linesize=128 ways=6)
PU L#36 (P#33)
PU L#37 (P#35)
PU L#38 (P#37)
PU L#39 (P#39)
Group0 L#1 (total=7736896KB)
Package L#2 (total=5170880KB CPUModel="POWER10 (architected), altivec supported" CPURevision="2.0 (pvr 0080 0200)")
NUMANode L#2 (P#2 local=5170880KB total=5170880KB)
Die L#4 (P#4)
<snipped>
Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.ibm.com>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/20251112074859.814087-1-srikar@linux.ibm.com
The rk3288 power-controller node contains an assigned-clocks property
that conflicts with the bindings. From the git history it shows that they
wanted to assign the rk3288 EDP_24M clock input centrally before an edp
node was available. Move the edp assigned-clocks property to the edp node
to reduce dtbs_check output.
Signed-off-by: Johan Jonker <jbx6244@gmail.com>
Link: https://patch.msgid.link/7d6fa223-ab90-4c44-9180-54df78467ea5@gmail.com
Signed-off-by: Heiko Stuebner <heiko@sntech.de>
The Rock 5 ITX only needs enablement for 2 nodes in order to send audio
on HDMI1, the connector closer to the 12V barrel jack and farther from
S/PDIF. It is sufficient to declare the audio injection as okay, and
to activate I2S6.
Note that for the other HDMI output it is not that trivial, as the video
data there originates from the SoC's DisplayPort output DP1 and is only
converted to HDMI in U7 (an RA620).
Signed-off-by: Torsten Duwe <duwe@lst.de>
[fixed commit subject prefixes]
Link: https://patch.msgid.link/20251110181153.CC62B6732A@verein.lst.de
Signed-off-by: Heiko Stuebner <heiko@sntech.de>
According to the APM Volume #2, Section 15.17, Table 15-10 (24593—Rev.
3.42—March 2024), When "GIF==0", an "Debug exception or trap, due to
breakpoint register match" should be "Ignored and discarded".
KVM lacks any handling of this. Even when vGIF is enabled and vGIF==0,
the CPU does not ignore #DBs and relies on the VMM to do so.
Handling this is possible, but the complexity is unjustified given the
rarity of using HW breakpoints when GIF==0 (e.g. near VMRUN). KVM would
need to intercept the #DB, temporarily disable the breakpoint,
singe-step over the instruction (probably reusing NMI singe-stepping),
and re-enable the breakpoint.
Instead, document this as an erratum.
Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com>
Link: https://patch.msgid.link/20251030223757.2950309-1-yosry.ahmed@linux.dev
Signed-off-by: Sean Christopherson <seanjc@google.com>
When re-injecting a soft interrupt from an INT3, INT0, or (select) INTn
instruction, discard the exception and retry the instruction if the code
stream is changed (e.g. by a different vCPU) between when the CPU
executes the instruction and when KVM decodes the instruction to get the
next RIP.
As effectively predicted by commit 6ef88d6e36 ("KVM: SVM: Re-inject
INT3/INTO instead of retrying the instruction"), failure to verify that
the correct INTn instruction was decoded can effectively clobber guest
state due to decoding the wrong instruction and thus specifying the
wrong next RIP.
The bug most often manifests as "Oops: int3" panics on static branch
checks in Linux guests. Enabling or disabling a static branch in Linux
uses the kernel's "text poke" code patching mechanism. To modify code
while other CPUs may be executing that code, Linux (temporarily)
replaces the first byte of the original instruction with an int3 (opcode
0xcc), then patches in the new code stream except for the first byte,
and finally replaces the int3 with the first byte of the new code
stream. If a CPU hits the int3, i.e. executes the code while it's being
modified, then the guest kernel must look up the RIP to determine how to
handle the #BP, e.g. by emulating the new instruction. If the RIP is
incorrect, then this lookup fails and the guest kernel panics.
The bug reproduces almost instantly by hacking the guest kernel to
repeatedly check a static branch[1] while running a drgn script[2] on
the host to constantly swap out the memory containing the guest's TSS.
[1]: https://gist.github.com/osandov/44d17c51c28c0ac998ea0334edf90b5a
[2]: https://gist.github.com/osandov/10e45e45afa29b11e0c7209247afc00b
Fixes: 6ef88d6e36 ("KVM: SVM: Re-inject INT3/INTO instead of retrying the instruction")
Cc: stable@vger.kernel.org
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Link: https://patch.msgid.link/1cc6dcdf36e3add7ee7c8d90ad58414eeb6c3d34.1762278762.git.osandov@fb.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Not all system controller registers are accessible from Linux. Accessing
such registers generates synchronous external abort. Populate the
readable_reg and writeable_reg members of the regmap config to inform the
regmap core which registers can be accessed. The list will need to be
updated whenever new system controller functionality is exported through
regmap.
Fixes: 2da2740fb9 ("soc: renesas: rz-sysc: Add syscon/regmap support")
Signed-off-by: Claudiu Beznea <claudiu.beznea.uj@bp.renesas.com>
Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be>
Link: https://patch.msgid.link/20251105070526.264445-3-claudiu.beznea.uj@bp.renesas.com
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Enable Ethernet support on the RZ/T2H and RZ/N2H EVKs.
Configure the MIIC converter in mode 0x6:
Port 0 <-> ETHSW Port 0
Port 1 <-> ETHSW Port 1
Port 2 <-> GMAC2
Port 3 <-> GMAC1
Enable the ETHSS, GMAC1 and GMAC2 nodes. ETHSW support will be added
once the switch driver is available.
Configure the MIIC converters to map ports according to the selected
switching mode, with converters 0 and 1 mapped to switch ports and
converters 2 and 3 mapped to GMAC ports.
Signed-off-by: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com>
Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be>
Link: https://patch.msgid.link/20251110203926.692242-1-prabhakar.mahadev-lad.rj@bp.renesas.com
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
The Renesas RZ/V2H SoC includes a Thermal Sensor Unit (TSU) block designed
to measure the junction temperature. The device provides real-time
temperature measurements for thermal management, utilizing two dedicated
channels for temperature sensing:
- TSU0, which is located near the DRP-AI block
- TSU1, which is located near the CPU and DRP-AI block
Since TSU1 is physically closer the CPU and the highest temperature
spot, it is used for CPU throttling through a passive trip and cooling
map. TSU0 is configured only with a critical trip.
Add TSU nodes along with thermal zones and keep them enabled in the SoC
DTSI.
Signed-off-by: Ovidiu Panait <ovidiu.panait.rb@renesas.com>
Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be>
Link: https://patch.msgid.link/20251020143107.13974-4-ovidiu.panait.rb@renesas.com
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Add support for Partial-IO poweroff. In Partial-IO pins of a few
hardware units can generate system wakeups while DDR memory is not
powered resulting in a fresh boot of the system. These hardware units in
the SoC are always powered so that some logic can detect pin activity.
If the system supports Partial-IO as described in the fw capabilities, a
sys_off handler is added. This sys_off handler decides if the poweroff
is executed by entering normal poweroff or Partial-IO instead. The
decision is made by checking if wakeup is enabled on all devices that
may wake up the SoC from Partial-IO.
The possible wakeup devices are found by checking which devices
reference a "Partial-IO" system state in the list of wakeup-source
system states. Only devices that are actually enabled by the user will
be considered as an active wakeup source. If none of the wakeup sources
is enabled the system will do a normal poweroff. If at least one wakeup
source is enabled it will instead send a TI_SCI_MSG_PREPARE_SLEEP
message from the sys_off handler. Sending this message will result in an
immediate shutdown of the system. No execution is expected after this
point. The code will wait for 5s and do an emergency_restart afterwards
if Partial-IO wasn't entered at that point.
A short documentation about Partial-IO can be found in section 6.2.4.5
of the TRM at
https://www.ti.com/lit/pdf/spruiv7
Signed-off-by: Markus Schneider-Pargmann (TI.com) <msp@baylibre.com>
Reviewed-by: Dhruva Gole <d-gole@ti.com>
Reviewed-by: Kendall Willis <k-willis@ti.com>
Reviewed-by: Sebin Francis <sebin.francis@ti.com>
Link: https://patch.msgid.link/20251103-topic-am62-partialio-v6-12-b4-v10-2-0557e858d747@baylibre.com
Signed-off-by: Nishanth Menon <nm@ti.com>
Use struct_size() instead of manually calculating the number of bytes to
allocate for 'caps', including the nested flexible array, and copy all of
'caps' to user space with a single copy_to_user() call (thanks to the full
size being provided by struct_size()).
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Link: https://patch.msgid.link/20251017213914.167301-1-thorsten.blum@linux.dev
[sean: separate from swap of get_user() vs. kzalloc() ordering]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Take MMU lock around tdh_vp_init() in KVM_TDX_INIT_VCPU to prevent
meeting contention during retries in some no-fail MMU paths.
The TDX module takes various try-locks internally, which can cause
SEAMCALLs to return an error code when contention is met. Dealing with
an error in some of the MMU paths that make SEAMCALLs is not straight
forward, so KVM takes steps to ensure that these will meet no contention
during a single BUSY error retry. The whole scheme relies on KVM to take
appropriate steps to avoid making any SEAMCALLs that could contend while
the retry is happening.
Unfortunately, there is a case where contention could be met if userspace
does something unusual. Specifically, hole punching a gmem fd while
initializing the TD vCPU. The impact would be triggering a KVM_BUG_ON().
The resource being contended is called the "TDR resource" in TDX docs
parlance. The tdh_vp_init() can take this resource as exclusive if the
'version' passed is 1, which happens to be version the kernel passes. The
various MMU operations (tdh_mem_range_block(), tdh_mem_track() and
tdh_mem_page_remove()) take it as shared.
There isn't a KVM lock that maps conceptually and in a lock order friendly
way to the TDR lock. So to minimize infrastructure, just take MMU lock
around tdh_vp_init(). This makes the operations we care about mutually
exclusive. Since the other operations are under a write mmu_lock, the code
could just take the lock for read, however this is weirdly inverted from
the actual underlying resource being contended. Since this is covering an
edge case that shouldn't be hit in normal usage, be a little less weird
and take the mmu_lock for write around the call.
Fixes: 02ab57707b ("KVM: TDX: Implement hooks to propagate changes of TDP MMU mirror page table")
Reported-by: Yan Zhao <yan.y.zhao@intel.com>
Suggested-by: Yan Zhao <yan.y.zhao@intel.com>
Link: https://patch.msgid.link/20251028002824.1470939-1-rick.p.edgecombe@intel.com
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
[sean: tweak comment and capture PUNCH_HOLE interaction]
Signed-off-by: Sean Christopherson <seanjc@google.com>
This was done as condition on direct_io_allow_mmap, but I believe
this is not right, as a file might be open two times - once with
write-back enabled another time with FOPEN_DIRECT_IO.
Signed-off-by: Bernd Schubert <bschubert@ddn.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
generic_file_direct_write() also does this and has a large
comment about.
Reproducer here is xfstest's generic/209, which is exactly to
have competing DIO write and cached IO read.
Signed-off-by: Bernd Schubert <bschubert@ddn.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
By "length of a string" usually the number of non-null chars is
meant (i.e. strlen(str)). So the variable 'namelen' was confusingly
named, whereas 'namesize' refers more to what's being done in
'get_security_context'.
Suggested-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Miquel Sabaté Solà <mssola@mssola.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
As pointed out in [1], strcpy() is deprecated in favor of strscpy().
Furthermore, the size of the buffer for the name to be copied is well known
at this point since we are going to move the pointer by that much on the
next line. Hence, it's safe to assume 'namelen' for the size of the string
to be copied.
[1] https://github.com/KSPP/linux/issues/88
Signed-off-by: Miquel Sabaté Solà <mssola@mssola.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Add the "altr,agilex5-dw-i3c-master" compatible string to the
I3C controller nodes on the Agilex5 SoCFPGA platform.
Signed-off-by: Adrian Ng Ho Yin <adrianhoyin.ng@altera.com>
Signed-off-by: Dinh Nguyen <dinguyen@kernel.org>
Add support for the Milianke MLKPAI FS01 board based on the Anlogic
DR1V90 SoC. The board features 512MB of onboard memory, USB-C UART, 1GbE
RJ45 Ethernet, USB-A 2.0 port, TF card slot, and 256Mbit Quad-SPI flash.
Currently, the board can boot to a console via UART1, which is connected
to the onboard serial chip and routed to the Type-C interface.
Acked-by: Conor Dooley <conor.dooley@microchip.com>
Signed-off-by: Junhui Liu <junhui.liu@pigmoral.tech>
Signed-off-by: Conor Dooley <conor.dooley@microchip.com>
DR1V90 is a FPSoC from Anlogic, which features a RISC-V core as the PS
part and 94,464 LUTs for the PL part.
The PS part integrates a Nuclei UX900 RISC-V core with 32KB L1 icache
and 32KB L1 dcache. It also provides two "snps,dw-apb-uart" compatible
UART controllers.
Some basic information of the processor can be obtained by running a
simple application from nuclei-sdk [1]:
-----Nuclei RISC-V CPU Configuration Information-----
MARCHID: 0xc900
MIMPID: 0x20300
ISA: RV64 A B C D F I M P S U
MCFG: TEE ECC ECLIC PLIC PPI ILM DLM ICACHE DCACHE IREGION No-Safety-Mechanism DLEN=VLEN/2
ILM: 256 KB has-ecc
DLM: 256 KB has-ecc
ICACHE: 32 KB(set=256,way=2,lsize=64,ecc=1)
DCACHE: 32 KB(set=256,way=2,lsize=64,ecc=1)
TLB: MainTLB(set=32,way=2,entry=1,ecc=1) ITLB(entry=8) DTLB(entry=8)
IREGION: 0x68000000 128 MB
Unit Size Address
INFO 64KB 0x68000000
DEBUG 64KB 0x68010000
ECLIC 64KB 0x68020000
TIMER 64KB 0x68030000
PLIC 64MB 0x6c000000
INFO-Detail:
mpasize : 0
PPI: 0xf8000000 128 MB
-----End of Nuclei CPU INFO-----
Link: https://github.com/Nuclei-Software/nuclei-sdk/blob/master/application/baremetal/cpuinfo/main.c [1]
Acked-by: Conor Dooley <conor.dooley@microchip.com>
Signed-off-by: Junhui Liu <junhui.liu@pigmoral.tech>
Signed-off-by: Conor Dooley <conor.dooley@microchip.com>
The Anlogic DR1V90 SoC integrates a UART controller compatible with
snps,dw-apb-uart, operating at a 50 MHz clock.
Acked-by: Rob Herring (Arm) <robh@kernel.org>
Signed-off-by: Junhui Liu <junhui.liu@pigmoral.tech>
Signed-off-by: Conor Dooley <conor.dooley@microchip.com>
Add MTIMER support for Anlogic DR1V90 SoC, which uses Nuclei UX900 with
a TIMER unit compliant with the ACLINT specification.
Signed-off-by: Junhui Liu <junhui.liu@pigmoral.tech>
Acked-by: Rob Herring (Arm) <robh@kernel.org>
Signed-off-by: Conor Dooley <conor.dooley@microchip.com>
Add Anlogic DR1V90 FPSoC, featuring a UX900 RISC-V core as the
processing system (PS) and 94,464 LUTs programmable logic (PL). It is
used by the Milianke MLKPAI-FS01 board, a SBC equipped with 512MB DDR3
memory, USB-C UART, 1GbE RJ45 Ethernet, USB-A 2.0 port, TF card slot,
and 256Mbit Quad-SPI flash.
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Signed-off-by: Junhui Liu <junhui.liu@pigmoral.tech>
Signed-off-by: Conor Dooley <conor.dooley@microchip.com>
The UX900 is a RISC-V core from Nuclei, used in the Anlogic DR1V90 SoC.
It features a 64-bit architecture and dual-issue, 9-stage pipeline, with
lots of optional extensions including V, K, Zc, and more.
Acked-by: Conor Dooley <conor.dooley@microchip.com>
Signed-off-by: Junhui Liu <junhui.liu@pigmoral.tech>
Signed-off-by: Conor Dooley <conor.dooley@microchip.com>
Add vendor prefixes for "anlogic", "milianke" and "nuclei". These are
required for describing the Milianke MLKPAI-FS01 board with DR1V90 SoC
from Anlogic, which uses a processor core designed by Nuclei.
Acked-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Signed-off-by: Junhui Liu <junhui.liu@pigmoral.tech>
Signed-off-by: Conor Dooley <conor.dooley@microchip.com>
The SpacemiT K1 SoC QSPI IP uses the Freescale driver. Enable it as a
module in the default kernel configuration for RISC-V.
Acked-by: Paul Walmsley <pjw@kernel.org> # for arch/riscv
Signed-off-by: Alex Elder <elder@riscstar.com>
Signed-off-by: Conor Dooley <conor.dooley@microchip.com>
SpacemiT K1 SoC is equipped with a total of nine I2C controllers,
ranging from I2C0 to I2C8.
Prior to this change, only I2C2 and I2C8 were explicitly defined
within the device tree. This patch comprehensively adds the
device tree node definitions for I2C controller 0, 1, 4 to 7.
The I2C3 node is not added because it belongs exclusively to the
secure domain which not used in the linux realm.
All newly added I2C nodes are set to "disabled" status by default,
allowing future board-specific device tree to enable and configure.
Signed-off-by: Troy Mitchell <troy.mitchell@linux.spacemit.com>
Link: https://lore.kernel.org/r/20251105-k1-add-i2c-node-v1-2-d18dae246137@linux.spacemit.com
Signed-off-by: Yixun Lan <dlan@gentoo.org>
There is no functional change with this patch. It simply refactors
function fuse_conn_put() to not use negative logic, which makes it more
easier to read.
Signed-off-by: Luis Henriques <luis@igalia.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
With the infrastructure introduced to periodically invalidate expired
dentries, it is now possible to add an extra work queue to invalidate
dentries when an epoch is incremented. This work queue will only be
triggered when the 'inval_wq' parameter is set.
Signed-off-by: Luis Henriques <luis@igalia.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
This patch adds the necessary infrastructure to keep track of all dentries
created for FUSE file systems. A set of rbtrees, protected by hashed
locks, will be used to keep all these dentries sorted by expiry time.
A new module parameter 'inval_wq' is also added. When set, it will start
a work queue which will periodically invalidate expired dentries. The
value of this new parameter is the period, in seconds, for this work
queue. Once this parameter is set, every new dentry will be added to one
of the rbtrees.
When the work queue is executed, it will check all the rbtrees and will
invalidate those dentries that have timed-out.
The work queue period can not be smaller than 5 seconds, but can be
disabled by setting 'inval_wq' to zero (which is the default).
Signed-off-by: Luis Henriques <luis@igalia.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Add and export a new helper d_dispose_if_unused() which is simply a wrapper
around to_shrink_list(), to add an entry to a dispose list if it's not used
anymore.
Also export shrink_dentry_list() to kill all dentries in a dispose list.
Suggested-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Luis Henriques <luis@igalia.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Document the new userspace-visible features and APIs for handling
synchronous external abort (SEA)
- KVM_CAP_ARM_SEA_TO_USER: How userspace enables the new feature.
- KVM_EXIT_ARM_SEA: exit userspace gets when it needs to handle SEA
and what userspace gets while taking the SEA.
Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
Link: https://msgid.link/20251013185903.1372553-4-jiaqiyan@google.com
[ oliver: make documentation concise, remove implementation detail ]
Signed-off-by: Oliver Upton <oupton@kernel.org>
Test how KVM handles guest SEA when APEI is unable to claim it, and
KVM_CAP_ARM_SEA_TO_USER is enabled.
The behavior is triggered by consuming recoverable memory error (UER)
injected via EINJ. The test asserts two major things:
1. KVM returns to userspace with KVM_EXIT_ARM_SEA exit reason, and
has provided expected fault information, e.g. esr, flags, gva, gpa.
2. Userspace is able to handle KVM_EXIT_ARM_SEA by injecting SEA to
guest and KVM injects expected SEA into the VCPU.
Tested on a data center server running Siryn AmpereOne processor
that has RAS support.
Several things to notice before attempting to run this selftest:
- The test relies on EINJ support in both firmware and kernel to
inject UER. Otherwise the test will be skipped.
- The under-test platform's APEI should be unable to claim the SEA.
Otherwise the test will be skipped.
- Some platform doesn't support notrigger in EINJ, which may cause
APEI and GHES to offline the memory before guest can consume
injected UER, and making test unable to trigger SEA.
Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
Link: https://msgid.link/20251013185903.1372553-3-jiaqiyan@google.com
Signed-off-by: Oliver Upton <oupton@kernel.org>
When APEI fails to handle a stage-2 synchronous external abort (SEA),
today KVM injects an asynchronous SError to the VCPU then resumes it,
which usually results in unpleasant guest kernel panic.
One major situation of guest SEA is when vCPU consumes recoverable
uncorrected memory error (UER). Although SError and guest kernel panic
effectively stops the propagation of corrupted memory, guest may
re-use the corrupted memory if auto-rebooted; in worse case, guest
boot may run into poisoned memory. So there is room to recover from
an UER in a more graceful manner.
Alternatively KVM can redirect the synchronous SEA event to VMM to
- Reduce blast radius if possible. VMM can inject a SEA to VCPU via
KVM's existing KVM_SET_VCPU_EVENTS API. If the memory poison
consumption or fault is not from guest kernel, blast radius can be
limited to the triggering thread in guest userspace, so VM can
keep running.
- Allow VMM to protect from future memory poison consumption by
unmapping the page from stage-2, or to interrupt guest of the
poisoned page so guest kernel can unmap it from stage-1 page table.
- Allow VMM to track SEA events that VM customers care about, to restart
VM when certain number of distinct poison events have happened,
to provide observability to customers in log management UI.
Introduce an userspace-visible feature to enable VMM handle SEA:
- KVM_CAP_ARM_SEA_TO_USER. As the alternative fallback behavior
when host APEI fails to claim a SEA, userspace can opt in this new
capability to let KVM exit to userspace during SEA if it is not
owned by host.
- KVM_EXIT_ARM_SEA. A new exit reason is introduced for this.
KVM fills kvm_run.arm_sea with as much as possible information about
the SEA, enabling VMM to emulate SEA to guest by itself.
- Sanitized ESR_EL2. The general rule is to keep only the bits
useful for userspace and relevant to guest memory.
- Flags indicating if faulting guest physical address is valid.
- Faulting guest physical and virtual addresses if valid.
Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
Co-developed-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://msgid.link/20251013185903.1372553-2-jiaqiyan@google.com
Signed-off-by: Oliver Upton <oupton@kernel.org>
Refer to PCI Express M.2 Specification r5.1 sec3.1.1 Power Sources and
Grounds.
PCI Express M.2 Socket 1 utilizes a 3.3 V power source. The voltage
source, 3.3 V, is expected to be available during the system’s
stand-by/suspend state to support wake event processing on the
communications card.
Add vpcie3v3aux regulator to let this 3.3 V power source always on for
PCIe M.2 Key E connector(PCIe0) on i.MX95 19x19 EVK board.
PCIe1 uses one standard PCIe slot connector, but combines the +3.3v and
+3.3Vaux into a same 3.3v power source, and intends to let it always on.
Add vpcie3v3aux regulator to let this 3.3 V power source always on for
PCIe1 on i.MX95 19x19 EVK board too.
Signed-off-by: Richard Zhu <hongxing.zhu@nxp.com>
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
Refer to PCI Express M.2 Specification r5.1 sec3.1.1 Power Sources and
Grounds.
PCI Express M.2 Socket 1 utilizes a 3.3 V power source. The voltage
source, 3.3 V, is expected to be available during the system’s
stand-by/suspend state to support wake event processing on the
communications card.
Add vpcie3v3aux regulator to let this 3.3 V power source always on for
PCIe M.2 Key E connector on i.MX95 15x15 EVK board.
Signed-off-by: Richard Zhu <hongxing.zhu@nxp.com>
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
Refer to PCI Express M.2 Specification r5.1 sec3.1.1 Power Sources and
Grounds.
PCI Express M.2 Socket 1 utilizes a 3.3 V power source. The voltage
source, 3.3 V, is expected to be available during the system’s
stand-by/suspend state to support wake event processing on the
communications card.
Add vpcie3v3aux regulator to let this 3.3 V power source always on for
PCIe M.2 Key E connector on i.MX8QXP MEK board.
Signed-off-by: Richard Zhu <hongxing.zhu@nxp.com>
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
Refer to PCI Express M.2 Specification r5.1 sec3.1.1 Power Sources and
Grounds.
PCI Express M.2 Socket 1 utilizes a 3.3 V power source. The voltage
source, 3.3 V, is expected to be available during the system’s
stand-by/suspend state to support wake event processing on the
communications card.
Add vpcie3v3aux regulator to let this 3.3 V power source always on for
PCIe M.2 Key E connector on i.MX8QM MEK board.
Signed-off-by: Richard Zhu <hongxing.zhu@nxp.com>
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
Refer to PCI Express M.2 Specification r5.1 sec3.1.1 Power Sources and
Grounds.
PCI Express M.2 Socket 1 utilizes a 3.3 V power source. The voltage
source, 3.3 V, is expected to be available during the system’s
stand-by/suspend state to support wake event processing on the
communications card.
Add vpcie3v3aux regulator to let this 3.3 V power source always on for
PCIe M.2 Key E connector on i.MX8MQ EVK board.
Signed-off-by: Richard Zhu <hongxing.zhu@nxp.com>
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
Refer to PCI Express M.2 Specification r5.1 sec3.1.1 Power Sources and
Grounds.
PCI Express M.2 Socket 1 utilizes a 3.3 V power source. The voltage
source, 3.3 V, is expected to be available during the system’s
stand-by/suspend state to support wake event processing on the
communications card.
Add vpcie3v3aux regulator to let this 3.3 V power source always on for
PCIe M.2 Key E connector on i.MX8MP EVK board.
Signed-off-by: Richard Zhu <hongxing.zhu@nxp.com>
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
Refer to PCI Express M.2 Specification r5.1 sec3.1.1 Power Sources and
Grounds.
PCI Express M.2 Socket 1 utilizes a 3.3 V power source. The voltage
source, 3.3 V, is expected to be available during the system’s
stand-by/suspend state to support wake event processing on the
communications card.
Add vpcie3v3aux regulator to let this 3.3 V power source always on for
PCIe M.2 Key E connector on i.MX8DXL EVK board.
Signed-off-by: Richard Zhu <hongxing.zhu@nxp.com>
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
According to PCIe r6.1, sec 5.5.1.
The following rules define how the L1.1 and L1.2 substates are entered:
Both the Upstream and Downstream Ports must monitor the logical state of
the CLKREQ# signal.
Typical implement is using open drain, which connect RC's clkreq# to
EP's clkreq# together and pull up clkreq#.
imx8qxp-mek matches this requirement, so add supports-clkreq to allow
PCIe device enter ASPM L1 Sub-State.
Signed-off-by: Richard Zhu <hongxing.zhu@nxp.com>
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
According to PCIe r6.1, sec 5.5.1.
The following rules define how the L1.1 and L1.2 substates are entered:
Both the Upstream and Downstream Ports must monitor the logical state of
the CLKREQ# signal.
Typical implement is using open drain, which connect RC's clkreq# to
EP's clkreq# together and pull up clkreq#.
imx8qm-mek matches this requirement, so add supports-clkreq to allow
PCIe device enter ASPM L1 Sub-State.
Signed-off-by: Richard Zhu <hongxing.zhu@nxp.com>
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
According to PCIe r6.1, sec 5.5.1.
The following rules define how the L1.1 and L1.2 substates are entered:
Both the Upstream and Downstream Ports must monitor the logical state of
the CLKREQ# signal.
Typical implement is using open drain, which connect RC's clkreq# to
EP's clkreq# together and pull up clkreq#.
imx8mq-evk matches this requirement, so add supports-clkreq to allow
PCIe device enter ASPM L1 Sub-State.
Signed-off-by: Richard Zhu <hongxing.zhu@nxp.com>
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
According to PCIe r6.1, sec 5.5.1.
The following rules define how the L1.1 and L1.2 substates are entered:
Both the Upstream and Downstream Ports must monitor the logical state of
the CLKREQ# signal.
Typical implement is using open drain, which connect RC's clkreq# to
EP's clkreq# together and pull up clkreq#.
imx8mp-evk matches this requirement, so add supports-clkreq to allow
PCIe device enter ASPM L1 Sub-State.
Signed-off-by: Richard Zhu <hongxing.zhu@nxp.com>
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
According to PCIe r6.1, sec 5.5.1.
The following rules define how the L1.1 and L1.2 substates are entered:
Both the Upstream and Downstream Ports must monitor the logical state of
the CLKREQ# signal.
Typical implement is using open drain, which connect RC's clkreq# to
EP's clkreq# together and pull up clkreq#.
imx8mm-evk matches this requirement, so add supports-clkreq to allow
PCIe device enter ASPM L1 Sub-State.
Signed-off-by: Richard Zhu <hongxing.zhu@nxp.com>
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
According to PCIe r6.1, sec 5.5.1.
The following rules define how the L1.1 and L1.2 substates are entered:
Both the Upstream and Downstream Ports must monitor the logical state of
the CLKREQ# signal.
Typical implement is using open drain, which connect RC's clkreq# to
EP's clkreq# together and pull up clkreq#.
imx95-19x19-evk matches this requirement, so add supports-clkreq to
allow PCIe device enter ASPM L1 Sub-State.
Signed-off-by: Richard Zhu <hongxing.zhu@nxp.com>
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
According to PCIe r6.1, sec 5.5.1.
The following rules define how the L1.1 and L1.2 substates are entered:
Both the Upstream and Downstream Ports must monitor the logical state of
the CLKREQ# signal.
Typical implement is using open drain, which connect RC's clkreq# to
EP's clkreq# together and pull up clkreq#.
imx95-15x15-evk matches this requirement, so add supports-clkreq to
allow PCIe device enter ASPM L1 Sub-State.
Signed-off-by: Richard Zhu <hongxing.zhu@nxp.com>
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
While the Beaglebone capes have the Atmel AT24C256 chip (256kbit or 32kB),
the internal Beaglebone eeprom chip (i2c bus 0, addr 0x50), is an AT24C32
(32kbit or 4kB). Yet the device tree lists AT24C256 as the compatible chip
prior to this patch. You can confirm this by running
`sudo hexdump -C /sys/bus/nvmem/devices/0-00500/nvmem`. You can see the
factory data is repeated every 0x1000 addresses (every 4096 bytes or 32768
bits). This is because the read command wraps around to reading 0x0000 when
a user requests address 0x1000.
This is not a huge issue for reading, but it is for writing to the EEPROM
for two reasons:
1) If a user writes to addresses 0x1000 - 0x104e, they'll accidentally
overwrite the factory data stored at 0x0000 - 0x104e. This also is an issue
for writing to 0x2000 - 0x204e, and so on.
2) AT24C256 has 64-byte pages, but AT24C32 only has 32 byte pages. Thus, if
you attempt to write more than 32 bytes, bytes 32-64 will wrap around. This
causes your data in the actual EEPROM chip's bytes 0-32 to be overwritten by
the data in your request's bytes 32-64, while the EEPROM chip's bytes 32-64
remain 0xFF (unwritten). Lastly, the Beaglebone Black's user manual does
correctly mention that the internal EEPROM is 4kB (while capes are 32kB or
256kbit). It's just this bit of code that does not match.
Signed-off-by: George Kelly <george.kelly1097@gmail.com>
Link: https://lore.kernel.org/r/20251108102741.47628-1-george.kelly1097@gmail.com
Signed-off-by: Kevin Hilman <khilman@baylibre.com>
In the 'mdt_loader.h' header, both the prototype and the inline
version of the qcom_mdt_load() function uses 'fw_name' as name for
the firmware name parameter. Additionally, the other qcom_mdt_*
functions are using that as well.
For consistency, rename the 'firmware' parameter in the implementation
of the qcom_mdt_load() to 'fw_name' and update the function accordingly.
No functional changes.
Signed-off-by: Gabor Juhos <j4g8y7@gmail.com>
Reviewed-by: Dmitry Baryshkov <dmitry.baryshkov@oss.qualcomm.com>
Link: https://lore.kernel.org/r/20251111-mdt-loader-cleanup-v1-2-71afee094dce@gmail.com
Signed-off-by: Bjorn Andersson <andersson@kernel.org>
The qcom_mdt_load_no_init() function is just a simple wrapper around
of __qcom_mdt_load(). Since commit 0daf35da39 ("soc: qcom: mdt_loader:
Remove pas id parameter") both functions are using the same type of
parameters and providing the same functionality.
Keeping two functions for the same purpose is superfluous, so rename
the __qcom_mdt_load() function to qcom_mdt_load_no_init() and remove
the wrapper.
No functional changes.
Signed-off-by: Gabor Juhos <j4g8y7@gmail.com>
Reviewed-by: Dmitry Baryshkov <dmitry.baryshkov@oss.qualcomm.com>
Link: https://lore.kernel.org/r/20251111-mdt-loader-cleanup-v1-1-71afee094dce@gmail.com
Signed-off-by: Bjorn Andersson <andersson@kernel.org>
Commit e26ee4efbc ("fuse: allocate ff->release_args only if release is
needed") skips allocating ff->release_args if the server does not
implement open. However in doing so, fuse_prepare_release() now skips
grabbing the reference on the inode, which makes it possible for an
inode to be evicted from the dcache while there are inflight readahead
requests. This causes a deadlock if the server triggers reclaim while
servicing the readahead request and reclaim attempts to evict the inode
of the file being read ahead. Since the folio is locked during
readahead, when reclaim evicts the fuse inode and fuse_evict_inode()
attempts to remove all folios associated with the inode from the page
cache (truncate_inode_pages_range()), reclaim will block forever waiting
for the lock since readahead cannot relinquish the lock because it is
itself blocked in reclaim:
>>> stack_trace(1504735)
folio_wait_bit_common (mm/filemap.c:1308:4)
folio_lock (./include/linux/pagemap.h:1052:3)
truncate_inode_pages_range (mm/truncate.c:336:10)
fuse_evict_inode (fs/fuse/inode.c:161:2)
evict (fs/inode.c:704:3)
dentry_unlink_inode (fs/dcache.c:412:3)
__dentry_kill (fs/dcache.c:615:3)
shrink_kill (fs/dcache.c:1060:12)
shrink_dentry_list (fs/dcache.c:1087:3)
prune_dcache_sb (fs/dcache.c:1168:2)
super_cache_scan (fs/super.c:221:10)
do_shrink_slab (mm/shrinker.c:435:9)
shrink_slab (mm/shrinker.c:626:10)
shrink_node (mm/vmscan.c:5951:2)
shrink_zones (mm/vmscan.c:6195:3)
do_try_to_free_pages (mm/vmscan.c:6257:3)
do_swap_page (mm/memory.c:4136:11)
handle_pte_fault (mm/memory.c:5562:10)
handle_mm_fault (mm/memory.c:5870:9)
do_user_addr_fault (arch/x86/mm/fault.c:1338:10)
handle_page_fault (arch/x86/mm/fault.c:1481:3)
exc_page_fault (arch/x86/mm/fault.c:1539:2)
asm_exc_page_fault+0x22/0x27
Fix this deadlock by allocating ff->release_args and grabbing the
reference on the inode when preparing the file for release even if the
server does not implement open. The inode reference will be dropped when
the last reference on the fuse file is dropped (see fuse_file_put() ->
fuse_release_end()).
Fixes: e26ee4efbc ("fuse: allocate ff->release_args only if release is needed")
Cc: stable@vger.kernel.org
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reported-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
The RTL8211E ethernet PHY on the Debix Model A board it located at
address 1. Replace the broadcast address with the correct unicast
address.
Signed-off-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
Commit 13799748b9 ("powerpc/64: use interrupt restart table to speed
up return from interrupt") removed the inconditional clearing of
MSR[RI] when returning from interrupt into kernel. But powerpc/32
doesn't implement interrupt restart table hence still need MSR[RI]
to be cleared.
It could be added back in interrupt_exit_kernel_prepare() but it is
easier and better to add it back in entry_32.S for following reasons:
- Writing to MSR must be followed by a synchronising instruction
- The smaller the non recoverable section is the better it is
So add a macro called clr_ri and use it in the three places that play
up with SRR0/SRR1. Use it just before another mtspr for synchronisation
to avoid having to add an isync.
Now that's done in entry_32.S, exit_must_hard_disable() can return
false for non book3s/64, taking into account that BOOKE doesn't have
MSR_RI.
Also add back blacklisting syscall_exit_finish for kprobe. This was
initially added by commit 7cdf440138 ("powerpc/entry32: Blacklist
syscall exit points for kprobe.") then lost with
commit 6f76a01173 ("powerpc/syscall: implement system call
entry/exit logic in C for PPC32").
Fixes: 6f76a01173 ("powerpc/syscall: implement system call entry/exit logic in C for PPC32")
Fixes: 13799748b9 ("powerpc/64: use interrupt restart table to speed up return from interrupt")
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/66d0ab070563ad460ed481328ab0887c27f21a2c.1757593807.git.christophe.leroy@csgroup.eu
The elfcorehdr segment in the kdump image stores information about the
memory regions (called crash memory ranges) that the kdump kernel must
capture.
When a memory hot-remove event occurs, the kernel regenerates the
elfcorehdr for the currently loaded kdump image to remove the
hot-removed memory from the crash memory ranges.
Call chain:
remove_mem_range()
update_crash_elfcorehdr()
arch_crash_handle_hotplug_event()
crash_handle_hotplug_event()
While removing the hot-removed memory from the crash memory ranges in
remove_mem_range(), if the removed memory lies within an existing crash
range, that range is split into two. During this split, the size of the
second range was being calculated incorrectly.
This leads to dump capture failure with makedumpfile with below error:
$ makedumpfile -l -d 31 /proc/vmcore /tmp/vmcore
readpage_elf: Attempt to read non-existent page at 0xbbdab0000.
readmem: type_addr: 0, addr:c000000bbdab7f00, size:16
validate_mem_section: Can't read mem_section array.
readpage_elf: Attempt to read non-existent page at 0xbbdab0000.
readmem: type_addr: 0, addr:c000000bbdab7f00, size:8
get_mm_sparsemem: Can't get the address of mem_section.
The updated crash memory range in PT_LOAD entry is holding incorrect
data (checkout FileSiz and MemSiz):
readelf -a /proc/vmcore
<snip...>
Type Offset VirtAddr PhysAddr
FileSiz MemSiz Flags Align
LOAD 0x0000000b013d0000 0xc000000b80000000 0x0000000b80000000
0xffffffffc0000000 0xffffffffc0000000 RWE 0x0
<snip...>
Update the size calculation for the new crash memory range to fix this
issue.
Note: This problem will not occur if the kdump kernel is loaded or
reloaded after a memory hot-remove operation.
Fixes: 849599b702 ("powerpc/crash: add crash memory hotplug support")
Reported-by: Shirisha G <shirisha@linux.ibm.com>
Signed-off-by: Sourabh Jain <sourabhjain@linux.ibm.com>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/20251105033941.1752287-1-sourabhjain@linux.ibm.com
Commit 35c18f2933 ("Add a new optional ",cma" suffix to the
crashkernel= command line option") and commit ab475510e0 ("kdump:
implement reserve_crashkernel_cma") added CMA support for kdump
crashkernel reservation.
Extend crashkernel CMA reservation support to powerpc.
The following changes are made to enable CMA reservation on powerpc:
- Parse and obtain the CMA reservation size along with other crashkernel
parameters
- Call reserve_crashkernel_cma() to allocate the CMA region for kdump
- Include the CMA-reserved ranges in the usable memory ranges for the
kdump kernel to use.
- Exclude the CMA-reserved ranges from the crash kernel memory to
prevent them from being exported through /proc/vmcore.
With the introduction of the CMA crashkernel regions,
crash_exclude_mem_range() needs to be called multiple times to exclude
both crashk_res and crashk_cma_ranges from the crash memory ranges. To
avoid repetitive logic for validating mem_ranges size and handling
reallocation when required, this functionality is moved to a new wrapper
function crash_exclude_mem_range_guarded().
To ensure proper CMA reservation, reserve_crashkernel_cma() is called
after pageblock_order is initialized.
Update kernel-parameters.txt to document CMA support for crashkernel on
powerpc architecture.
Signed-off-by: Sourabh Jain <sourabhjain@linux.ibm.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/20251107080334.708028-1-sourabhjain@linux.ibm.com
Systems can now be partitioned into resource groups. By default all
systems will be part of default resource group. Once a resource group is
created, and resources allocated to the resource group, those resources
will be removed from the default resource group. If a LPAR moved to a
resource group, then it can only use resources in the resource group.
So maximum processors that can be allocated to a LPAR can be equal or
smaller than the resources in the resource group.
lparcfg can now exposes the resource group id to which this LPAR belongs
to. It also exposes the number of processors in the current resource
group. The default resource group id happens to be 0. These would be
documented in the upcoming PAPR update.
Example of an LPAR in a default resource group
root@ltcp11-lp3 $ grep resource_group /proc/powerpc/lparcfg
resource_group_number=0
resource_group_active_processors=50
root@ltcp11-lp3 $
Example of an LPAR in a non-default resource group
root@ltcp11-lp5 $ grep resource_group /proc/powerpc/lparcfg
resource_group_number=1
resource_group_active_processors=30
root@ltcp11-lp5 $
Signed-off-by: Srikar Dronamraju <srikar@linux.ibm.com>
Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/20250716104600.59102-1-srikar@linux.ibm.com
Add vdd-supply and vddio-supply for fsl,mpl3115 to fix below CHECK_DTBS
warnings:
arch/arm/boot/dts/nxp/imx/imx53-ppd.dtb: pressure-sensor@60 (fsl,mpl3115): 'vdd-supply' is a required property
Signed-off-by: Frank Li <Frank.Li@nxp.com>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
Delete usb3_lpcg node for imx8dxl because not exist at such hardware.
Signed-off-by: Frank Li <Frank.Li@nxp.com>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
Add fsl,tuning-step for usdhc1 and usdhc2 to improve card compatibility.
Signed-off-by: Frank Li <Frank.Li@nxp.com>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
Default, state_100mhz and state_200mhz use the same settings. But current
kernel driver use these to indicate if sd3.0 support.
Add max-frequency for usdhc2 because board design limitation.
Signed-off-by: Frank Li <Frank.Li@nxp.com>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
Introduce the Stratix10 SoC Service Layer (SVC) node for Agilex5 SoCs. This
node includes the compatible string "intel,agilex5-svc" and references a
reserved memory region used for communication with the Secure Device
Manager (SDM).
Agilex5 introduces changes in how reserved memory is mapped and accessed
compared to previous SoC generations. This commit updates the device tree
structure to support Agilex5-specific handling of the SVC interface.
Signed-off-by: Khairul Anuar Romli <khairul.anuar.romli@altera.com>
Signed-off-by: Dinh Nguyen <dinguyen@kernel.org>
CP110 based platforms rely on the bootloader for pci port
initialization.
TF-A actively prevents non-uboot re-configuration of pci lanes, and many
boards do not have software control over the pci card reset.
If a pci port had link at boot-time and the clock is stopped at a later
point, the link fails and can not be recovered.
PCI controller driver probe - and by extension ownership of a driver for
the pci clocks - may be delayed especially on large modular kernels,
causing the clock core to start disabling unused clocks.
Add the CLK_IGNORE_UNUSED flag to the three pci port's clocks to ensure
they are not stopped before the pci controller driver has taken
ownership and tested for an existing link.
This fixes failed pci link detection when controller driver probes late,
e.g. with arm64 defconfig and CONFIG_PHY_MVEBU_CP110_COMPHY=m.
Closes: https://lore.kernel.org/r/b71596c7-461b-44b6-89ab-3cfbd492639f@solid-run.com
Signed-off-by: Josua Mayer <josua@solid-run.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Gregory CLEMENT <gregory.clement@bootlin.com>
This reverts commit 794a066688 because it
misunderstood interworking between arm trusted firmware and the common
phy driver, and does not consistently resolve the issue it was intended
to address.
Further diagnostics have revealed the root cause for the reported system
lock-up in a race condition between pci driver probe and clock core
disabling unused clocks.
Revert the wrong change restoring driver control over all pci lanes.
As a temporary workaround for the original issue, users can boot with
"clk_ignore_unused".
Signed-off-by: Josua Mayer <josua@solid-run.com>
Signed-off-by: Gregory CLEMENT <gregory.clement@bootlin.com>
Update the "nand-rb" pinctrl child node names to use the defined "-pins"
suffix fixing DT schema warnings.
Signed-off-by: Rob Herring (Arm) <robh@kernel.org>
Signed-off-by: Gregory CLEMENT <gregory.clement@bootlin.com>
The LXA device trees are the only STM32MP1 device tree that specify
vusb_d/usb_a-supply and apparently not for good reason:
- vusb_d-supply (vdd_usb) is the same as the phy-supply for usbphyc_port1
- vusb_a-supply (reg18) is the same as vdda1v8-supply for usbphyc_port1
and usbphyc_port1 is linked to the usbotg_hs node via the phys property.
Specifying the regulators in the &usbotg_hs node is thus superfluous and
has been even found to be harmful in one instance:
Linux v6.10 was found to lock up every 50-125 or so reboots on the LXA
TAC when the DWC2 driver probe enables the regulators in bulk, unless
both properties were removed.
This issue was so far not reproducible on v6.17 (> 500 reboots), but as
these properties are unnecessary and different from other STM32MP1
boards, remove them anyway.
Signed-off-by: Ahmad Fatoum <a.fatoum@pengutronix.de>
Link: https://lore.kernel.org/r/20251007-lxa-usb-dt-v1-1-cacde8088bb9@pengutronix.de
Signed-off-by: Alexandre Torgue <alexandre.torgue@foss.st.com>
Move st,adc-freq, st,mod-12b, st,ref-sel, and st,sample-time properties
from the touchscreen subnode to the parent touch@44 node. These properties
are defined in the st,stmpe.yaml schema for the parent node, not the
touchscreen subnode, resolving the validation error about unevaluated
properties.
Fixes: 27538a18a4 ("ARM: dts: stm32: add STM32MP1-based Phytec SoM")
Signed-off-by: Jihed Chaibi <jihed.chaibi.dev@gmail.com>
Link: https://lore.kernel.org/r/20250915224611.169980-1-jihed.chaibi.dev@gmail.com
Signed-off-by: Alexandre Torgue <alexandre.torgue@foss.st.com>
The CSI-2 receivers in the i.MX8MP have 3 output channels. Specify this
in the device tree, to enable support for more than one channel.
Signed-off-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com>
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
In Agilex5, the TBU (Translation Buffer Unit) can now operate in non-secure
mode, enabling Linux to utilize it through the IOMMU framework. This allows
improved memory management capabilities in non-secure environments. With
Agilex5 lifting this restriction, we are now extending the device tree
bindings to support IOMMU for the Agilex5 SVC.
Signed-off-by: Khairul Anuar Romli <khairul.anuar.romli@altera.com>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Signed-off-by: Dinh Nguyen <dinguyen@kernel.org>
Remove the code to disable IRQs when unregistering KVM's user-return
notifier now that KVM doesn't invoke kvm_on_user_return() when disabling
virtualization via IPI function call, i.e. now that there's no need to
guard against re-entrancy via IPI callback.
Note, disabling IRQs has largely been unnecessary since commit
a377ac1cd9 ("x86/entry: Move user return notifier out of loop") moved
fire_user_return_notifiers() into the section with IRQs disabled. In doing
so, the commit somewhat inadvertently fixed the underlying issue that
was papered over by commit 1650b4ebc9 ("KVM: Disable irq while
unregistering user notifier"). I.e. in practice, the code and comment
has been stale since commit a377ac1cd9.
Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com>
[sean: rewrite changelog after rebasing, drop lockdep assert]
Reviewed-by: Kai Huang <kai.huang@intel.com>
Link: https://patch.msgid.link/20251030191528.3380553-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Leave KVM's user-return notifier registered in the unlikely case that the
notifier is registered when disabling virtualization via IPI callback in
response to reboot/shutdown. On reboot/shutdown, keeping the notifier
registered is ok as far as MSR state is concerned (arguably better then
restoring MSRs at an unknown point in time), as the callback will run
cleanly and restore host MSRs if the CPU manages to return to userspace
before the system goes down.
The only wrinkle is that if kvm.ko module unload manages to race with
reboot/shutdown, then leaving the notifier registered could lead to
use-after-free due to calling into unloaded kvm.ko module code. But such
a race is only possible on --forced reboot/shutdown, because otherwise
userspace tasks would be frozen before kvm_shutdown() is called, i.e. on a
"normal" reboot/shutdown, it should be impossible for the CPU to return to
userspace after kvm_shutdown().
Furthermore, on a --forced reboot/shutdown, unregistering the user-return
hook from IRQ context doesn't fully guard against use-after-free, because
KVM could immediately re-register the hook, e.g. if the IRQ arrives before
kvm_user_return_register_notifier() is called.
Rather than trying to guard against the IPI in the "normal" user-return
code, which is difficult and noisy, simply leave the user-return notifier
registered on a reboot, and bump the kvm.ko module refcount to defend
against a use-after-free due to kvm.ko unload racing against reboot.
Alternatively, KVM could allow kvm.ko and try to drop the notifiers during
kvm_x86_exit(), but that's also a can of worms as registration is per-CPU,
and so KVM would need to blast an IPI, and doing so while a reboot/shutdown
is in-progress is far risky than preventing userspace from unloading KVM.
Link: https://patch.msgid.link/20251030191528.3380553-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
When freeing the per-CPU user-return MSRs structures, WARN if any CPU has
a registered notifier to help detect and/or debug potential use-after-free
issues. The lifecycle of the notifiers is rather convoluted, and has
several non-obvious paths where notifiers are unregistered, i.e. isn't
exactly the most robust code possible.
The notifiers they are registered on-demand in KVM, on the first WRMSR to
a tracked register. _Usually_ the notifier is unregistered whenever the
CPU returns to userspace. But because any given CPU isn't guaranteed to
return to userspace, e.g. the CPU could be offlined before doing so, KVM
also "drops", a.k.a. unregisters, the notifiers when virtualization is
disabled on the CPU.
Further complicating the unregister path is the fact that the calls to
disable virtualization come from common KVM, and the per-CPU calls are
guarded by a per-CPU flag (to harden _that_ code against bugs, e.g. due to
mishandling reboot). Reboot/shutdown in particular is problematic, as KVM
disables virtualization via IPI function call, i.e. from IRQ context,
instead of using the cpuhp framework, which runs in task context. I.e. on
reboot/shutdown, drop_user_return_notifiers() is called asynchronously.
Forced reboot/shutdown is the most problematic scenario, as userspace tasks
are not frozen before kvm_shutdown() is invoked, i.e. KVM could be actively
manipulating the user-return MSR lists and/or notifiers when the IPI
arrives. To a certain extent, all bets are off when userspace forces a
reboot/shutdown, but KVM should at least avoid a use-after-free, e.g. to
avoid crashing the kernel when trying to reboot.
Link: https://patch.msgid.link/20251030191528.3380553-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Set all user-return MSRs to their post-TD-exit value when preparing to run
a TDX vCPU to ensure the value that KVM expects to be loaded after running
the vCPU is indeed the value that's loaded in hardware. If the TDX-Module
doesn't actually enter the guest, i.e. doesn't do VM-Enter, then it won't
"restore" VMM state, i.e. won't clobber user-return MSRs to their expected
post-run values, in which case simply updating KVM's "cached" value will
effectively corrupt the cache due to hardware still holding the original
value.
In theory, KVM could conditionally update the current user-return value if
and only if tdh_vp_enter() succeeds, but in practice "success" doesn't
guarantee the TDX-Module actually entered the guest, e.g. if the TDX-Module
synthesizes an EPT Violation because it suspects a zero-step attack.
Force-load the expected values instead of trying to decipher whether or
not the TDX-Module restored/clobbered MSRs, as the risk doesn't justify
the benefits. Effectively avoiding four WRMSRs once per run loop (even if
the vCPU is scheduled out, user-return MSRs only need to be reloaded if
the CPU exits to userspace or runs a non-TDX vCPU) is likely in the noise
when amortized over all entries, given the cost of running a TDX vCPU.
E.g. the cost of the WRMSRs is somewhere between ~300 and ~500 cycles,
whereas the cost of a _single_ roundtrip to/from a TDX guest is thousands
of cycles.
Fixes: e0b4f31a3c ("KVM: TDX: restore user ret MSRs")
Cc: stable@vger.kernel.org
Cc: Yan Zhao <yan.y.zhao@intel.com>
Cc: Xiaoyao Li <xiaoyao.li@intel.com>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://patch.msgid.link/20251030191528.3380553-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Fix an interaction between SMM and PV asynchronous #PFs where an #SMI can
cause KVM to drop an async #PF ready event, and thus result in guest tasks
becoming permanently stuck due to the task that encountered the #PF never
being resumed. Specifically, don't clear the completion queue when paging
is disabled, and re-check for completed async #PFs if/when paging is
enabled.
Prior to commit 2635b5c4a0 ("KVM: x86: interrupt based APF 'page ready'
event delivery"), flushing the APF queue without notifying the guest of
completed APF requests when paging is disabled was "necessary", in that
delivering a #PF to the guest when paging is disabled would likely confuse
and/or crash the guest. And presumably the original async #PF development
assumed that a guest would only disable paging when there was no intent to
ever re-enable paging.
That assumption fails in several scenarios, most visibly on an emulated
SMI, as entering SMM always disables CR0.PG (i.e. initially runs with
paging disabled). When the SMM handler eventually executes RSM, the
interrupted paging-enabled is restored, and the async #PF event is lost.
Similarly, invoking firmware, e.g. via EFI runtime calls, might require a
transition through paging modes and thus also disable paging with valid
entries in the competion queue.
To avoid dropping completion events, drop the "clear" entirely, and handle
paging-enable transitions in the same way KVM already handles APIC
enable/disable events: if a vCPU's APIC is disabled, APF completion events
are not kept pending and not injected while APIC is disabled. Once a
vCPU's APIC is re-enabled, KVM raises KVM_REQ_APF_READY so that the vCPU
recognizes any pending pending #APF ready events.
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20251015033258.50974-4-mlevitsk@redhat.com
[sean: rework changelog to call out #PF injection, drop "real mode"
references, expand the code comment]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Fix a semi theoretical race condition related to a lack of memory barriers
when dealing with vcpu->arch.apf.pageready_pending. In theory, the "ready"
side could see a stale pageready_pending and neglect to kick the vCPU, and
thus allow the vCPU to enter the guest with a pending KVM_REQ_APF_READY
and no kick/IPI on the way, in which case the KVM would fail to deliver a
completed async #PF event to the guest in a timely manner as the request
would be recognized only on the next (coincidental) VM-Exit.
kvm_arch_async_page_present_queued() running in workqueue context:
kvm_make_request(KVM_REQ_APF_READY, vcpu);
/* memory barrier is missing here*/
if (!vcpu->arch.apf.pageready_pending)
kvm_vcpu_kick(vcpu);
kvm_set_msr_common() running in task context:
vcpu->arch.apf.pageready_pending = false;
/* memory barrier is missing here*/
And later, vcpu_enter_guest() running in task context:
if (kvm_check_request(KVM_REQ_APF_READY, vcpu))
kvm_check_async_pf_completion(vcpu)
Add missing full memory barriers in both cases to avoid theoretical
case of not kicking the vCPU thread.
Note that the bug is mostly theoretical because kvm_make_request()
uses an atomic operation, which is always serializing on x86, requiring
only for documentation purposes the smp_mb__after_atomic() after it
(smp_mb__after_atomic() is a NOP on x86).
The second missing barrier, between kvm_set_msr_common() and
vcpu_enter_guest(), isn't strictly needed because KVM executes several
barriers in between calling these functions, however it still makes
sense to have an explicit barrier to be on the safe side and to document
the ordering dependencies.
Finally, also use READ_ONCE/WRITE_ONCE.
Thanks a lot to Paolo for the help with this patch.
Link: https://lore.kernel.org/all/7c7a5a75-a786-4a05-a836-4368582ca4c2@redhat.com
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://patch.msgid.link/20251015033258.50974-3-mlevitsk@redhat.com
[sean: explain the race and its impact in more detail]
Signed-off-by: Sean Christopherson <seanjc@google.com>
The Radxa ROCK 5B+/5T USB Type-C port supports Dual Role Data and
should also act as a host. However, currently, when acting as a host,
only self-powered devices work.
Since the ROCK 5B+ supports Dual Role Power, set the power-role
property to "dual" and the try-power-role property to "sink". (along
with related properties)
The ROCK 5T should only support the "source" power-role.
This allows the port to act as a host, supply power to the port, and
allow bus-powered devices to work.
Note that on the ROCK 5T, with this patch applied, it has been
observed that some bus-powered devices do not work correctly. Also,
it has been observed that after connecting a device (and the data-role
switches to host), connecting a host device does not switch the
data-role back to the device role. These issues should be addressed
separately.
Note that there is a separate known issue where USB 3.0 SuperSpeed
devices do not work when oriented in reverse. This issue should also be
addressed separately. (USB 2.0/1.1 devices work in both orientations)
Fixes: 67b2c15d8f ("arm64: dts: rockchip: add USB-C support for ROCK 5B/5B+/5T")
Signed-off-by: FUKAUMI Naoki <naoki@radxa.com>
Link: https://patch.msgid.link/20251104085227.820-1-naoki@radxa.com
Signed-off-by: Heiko Stuebner <heiko@sntech.de>
When the device was first added, there was a problem with the bluetooth
controller that manifested when DMA was enabled for the underlying UART
interface. At some point in the intervening time the problem appears to
have been resolved. Add the UART rx and tx channels back to re-enable
UART.
Signed-off-by: Chris Morgan <macromorgan@hotmail.com>
Link: https://patch.msgid.link/20251105205708.732125-6-macroalpha82@gmail.com
Signed-off-by: Heiko Stuebner <heiko@sntech.de>
SPEC_CTRL is an MSR, i.e. a 64-bit value, but the VMRUN assembly code
assumes bits 63:32 are always zero. The bug is _currently_ benign because
neither KVM nor the kernel support setting any of bits 63:32, but it's
still a bug that needs to be fixed.
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Suggested-by: Sean Christopherson <seanjc@google.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Link: https://patch.msgid.link/20251106191230.182393-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Some of the new field added to socinfo structure with version 21, 22
and 23 which is only used by boot firmware and it is of no use for
Linux.Add reserve field in socinfo so that the structure remain
updated and prepared if we get any new field in future which could
be used by Linux. While at it, also updates switch case for backward
compatibility if the SoC runs with boot firmware which has these
new version added.
Signed-off-by: Mukesh Ojha <mukesh.ojha@oss.qualcomm.com>
Link: https://lore.kernel.org/r/20251104130906.167666-2-mukesh.ojha@oss.qualcomm.com
Signed-off-by: Bjorn Andersson <andersson@kernel.org>
Document the compatible string for the MusePi Pro [1]. It is a 1.8-inch
single board computer based on the SpacemiT K1/M1 RISC-V SoC [2].
Here's a refined list of its core features:
- SoC: SpacemiT M1/K1, 8-core 64-bit RISC-V.
- Memory: LPDDR4X @ 2400MT/s, available in 8GB & 16GB options.
- Storage: Onboard eMMC 5.1 (64GB/128GB options), M.2 M-Key for NVMe
SSD (2230 size), and a microSD slot (UHS-II) for expansion.
- Display: HDMI 1.4 (1080P@60Hz) and 2-lane MIPI DSI FPC (1080P@60Hz).
- Connectivity: Onboard Wi-Fi 6 & Bluetooth 5.2, single Gigabit Ethernet
port (RJ45).
- USB: 4x USB 3.0 Type-A (host) and 1x USB 2.0 Type-C (device/OTG).
- Expansion: Full-size miniPCIe slot and a second M.2 M-Key (2230).
- GPIO: Standard 40-pin GPIO interface.
- MIPI: 1x 4-lane MIPI CSI FPC and 2x MIPI DSI FPC interfaces.
- Clock: Onboard RTC with battery support.
Link: https://developer.spacemit.com/documentation?token=YJtdwnvvViPVcmkoPDpcvwfVnrh&type=pdf [1]
Link: https://www.spacemit.com/en/key-stone-k1 [2]
Acked-by: Conor Dooley <conor.dooley@microchip.com>
Signed-off-by: Troy Mitchell <troy.mitchell@linux.spacemit.com>
Link: https://lore.kernel.org/r/20251023-k1-musepi-pro-dts-v4-1-01836303e10f@linux.spacemit.com
Signed-off-by: Yixun Lan <dlan@gentoo.org>
Inheriting the vDSO from the host is problematic. The values read
from the time functions will not be correct for the UML kernel.
Furthermore the start and end of the vDSO are not stable or
detectable by userspace. Specifically the vDSO datapages start
before AT_SYSINFO_EHDR and the vDSO itself is larger than a single page.
This codepath is only used on 32bit x86 UML. In my testing with both
32bit and 64bit hosts the passthrough functionality has always been
disabled anyways due to the checks against envp in scan_elf_aux().
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Link: https://patch.msgid.link/20251028-uml-remove-32bit-pseudo-vdso-v1-4-e930063eff5f@weissschuh.net
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The openwrt has 3 status leds at the front:
* red: Used as failsafe led by openwrt
* white: Used as boot led by openwrt
* green: Used as running/upgrade led by openwrt
On the back each RJ45 jack has the typical amber/green leds. For the WAN
jack this is hardware controlled by the phy, for LAN these are under
software control and enabled by this patch.
Signed-off-by: Sjoerd Simons <sjoerd@collabora.com>
Signed-off-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
The openwrt one has a SPI NOR flash which from factory is used for:
* Recovery system
* WiFi eeprom data
* ethernet Mac addresses
Describe this following the same partitions as the openwrt configuration
uses.
Signed-off-by: Sjoerd Simons <sjoerd@collabora.com>
Signed-off-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
Add devicetree for Bpi-R4-Pro.
BananaPi R4 Pro is a MT7988A based board which exists in 2 different
hardware versions:
- 4E: 4 GB RAM and using internal 2.5G Phy for WAN-Combo
- 8X: 8 GB RAM and 2x Aeonsemi AS21010P 10G phys
common parts:
- MediaTek MT7988A Quad-core Arm Corex-A73,1.8GHz processor
- 8GB eMMC flash
- 256MB SPI-NAND Flash
- Micro SD card slot
- 1x 10G SFP+ WAN
- 1x 10G SFP+ LAN
- 4x 2.5G RJ45 LAN (MxL86252C)
- 1x 1G RJ45 LAN (MT7988 internal switch)
- 2x miniPCIe slots with PCIe3.0 2lane interface for Wi-Fi NIC
- 2x M.2 M-KEY slots with PCIe3.0 1lane interface for NVME SSD
- 3x M.2 B-KEY slots with USB3.2 for 5G Module (PCIe shared with key-m)
- 1x USB3.2 slot
- 1x USB2.0 slot
- 1x USB TypeC Debug Console
- 2x13 PIN Header for expanding application
https://docs.banana-pi.org/en/BPI-R4_Pro/BananaPi_BPI-R4_Pro
The PCIe is per default in key-m state and can be changed to key-b with
the pcie-overlays.
Signed-off-by: Frank Wunderlich <frank-w@public-files.de>
Signed-off-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
The internal 2.5G phy of mt7988 is only used by some specific board
variants.
Disable it by default and enable it where needed.
Signed-off-by: Frank Wunderlich <frank-w@public-files.de>
Signed-off-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
Following the existing convention of disabling nodes in the SoC file and
enabling only the required ones in the board file, disable "mcu_cpsw" node
in the SoC file "k3-j721s2-mcu-wakeup.dtsi" and enable it in the board
files:
a) k3-am68-phyboard-izar.dts
b) k3-am68-sk-base-board.dts
c) k3-j721s2-common-proc-board.dts
Signed-off-by: Siddharth Vadapalli <s-vadapalli@ti.com>
Link: https://patch.msgid.link/20251015111344.3639415-6-s-vadapalli@ti.com
Signed-off-by: Vignesh Raghavendra <vigneshr@ti.com>
Following the existing convention of disabling nodes in the SoC file and
enabling only the required ones in the board file, disable "mcu_cpsw" node
in the SoC file "k3-j721e-mcu-wakeup.dtsi" and enable it in the board
files:
a) k3-j721e-beagleboneai64.dts
b) k3-j721e-common-proc-board.dts
c) k3-j721e-sk.dts
Signed-off-by: Siddharth Vadapalli <s-vadapalli@ti.com>
Link: https://patch.msgid.link/20251015111344.3639415-5-s-vadapalli@ti.com
Signed-off-by: Vignesh Raghavendra <vigneshr@ti.com>
Following the existing convention of disabling nodes in the SoC file and
enabling only the required ones in the board file, disable "mcu_cpsw" node
in the SoC file "k3-j7200-mcu-wakeup.dtsi" and enable it in the board file
"k3-j7200-common-proc-board.dts".
Signed-off-by: Siddharth Vadapalli <s-vadapalli@ti.com>
Link: https://patch.msgid.link/20251015111344.3639415-4-s-vadapalli@ti.com
Signed-off-by: Vignesh Raghavendra <vigneshr@ti.com>
Following the existing convention of disabling nodes in the SoC file and
enabling only the required ones in the board file, disable "mcu_cpsw" node
in the SoC file "k3-am65-mcu.dtsi" and enable it in the board file
"k3-am654-base-board.dts". Also, now that "mcu_cpsw" is disabled in the
SoC file, disabling it in "k3-am65-iot2050-common.dtsi" is no longer
required. Hence, remove the section corresponding to this change.
Signed-off-by: Siddharth Vadapalli <s-vadapalli@ti.com>
Link: https://patch.msgid.link/20251015111344.3639415-3-s-vadapalli@ti.com
Signed-off-by: Vignesh Raghavendra <vigneshr@ti.com>
Following the existing convention of disabling nodes in the SoC file and
enabling only the required ones in the board file, disable "cpsw3g" node
in the SoC file "k3-am62-main.dtsi" and enable it in the board (or board
include) files:
a) k3-am62-lp-sk.dts
b) k3-am62-phycore-som.dtsi
c) k3-am625-beagleplay.dts
d) k3-am625-sk-common.dtsi
Signed-off-by: Siddharth Vadapalli <s-vadapalli@ti.com>
Link: https://patch.msgid.link/20251015111344.3639415-2-s-vadapalli@ti.com
Signed-off-by: Vignesh Raghavendra <vigneshr@ti.com>
The CANUART pins of mcu_mcan0, mcu_mcan1, mcu_uart0 and wkup_uart0 are
powered during Partial-IO and I/O Only + DDR and are capable of waking
up the system in these states. Specify the states in which these units
can do a wakeup on this board.
Note that the UARTs are not capable of wakeup in Partial-IO because of
of a UART mux on the board not being powered during Partial-IO.
Add pincontrol definitions for mcu_mcan0 and mcu_mcan1 for wakeup from
Partial-IO. Add these as wakeup pinctrl entries for both devices.
Signed-off-by: Markus Schneider-Pargmann (TI.com) <msp@baylibre.com>
Link: https://patch.msgid.link/20251103-topic-am62-dt-partialio-v6-15-v5-6-b8d9ff5f2742@baylibre.com
Signed-off-by: Vignesh Raghavendra <vigneshr@ti.com>
The CANUART pins of mcu_mcan0, mcu_mcan1, mcu_uart0 and wkup_uart0 are
powered during Partial-IO and I/O Only + DDR and are capable of waking
up the system in these states. Specify the states in which these units
can do a wakeup on this board.
Note that the UARTs are not capable of wakeup in Partial-IO because of
of a UART mux on the board not being powered during Partial-IO.
Add pincontrol definitions for mcu_mcan0 and mcu_mcan1 for wakeup from
Partial-IO. Add these as wakeup pinctrl entries for both devices.
Signed-off-by: Markus Schneider-Pargmann (TI.com) <msp@baylibre.com>
Link: https://patch.msgid.link/20251103-topic-am62-dt-partialio-v6-15-v5-5-b8d9ff5f2742@baylibre.com
Signed-off-by: Vignesh Raghavendra <vigneshr@ti.com>
The CANUART pins of mcu_mcan0, mcu_mcan1, mcu_uart0 and wkup_uart0 are
powered during Partial-IO and I/O Only + DDR and are capable of waking
up the system in these states. Specify the states in which these units
can do a wakeup on this board.
Note that the UARTs are not capable of wakeup in Partial-IO because of
of a UART mux on the board not being powered during Partial-IO. As I/O
Only + DDR is not supported on AM62x, the UARTs are not added in this
patch.
Add pincontrol definitions for mcu_mcan0 and mcu_mcan1 for wakeup from
Partial-IO. Add these as wakeup pinctrl entries for both devices.
Signed-off-by: Markus Schneider-Pargmann (TI.com) <msp@baylibre.com>
Link: https://patch.msgid.link/20251103-topic-am62-dt-partialio-v6-15-v5-4-b8d9ff5f2742@baylibre.com
Signed-off-by: Vignesh Raghavendra <vigneshr@ti.com>
Since commit 9dee9cb2df08 ("arm64: dts: ti: k3-j722s-main: fix the audio
refclk source") the clock nodes of the am62p and j722 are the same. Move
them into the commit dtsi.
Please note, that for the j722s the nodes are renamed from clock@ to
clock-controller@.
Suggested-by: Udit Kumar <u-kumar1@ti.com>
Signed-off-by: Michael Walle <mwalle@kernel.org>
Link: https://patch.msgid.link/20251103152826.1608309-1-mwalle@kernel.org
Signed-off-by: Vignesh Raghavendra <vigneshr@ti.com>
The MAC Ports across all of the CPSW instances (CPSW2G, CPSW3G, CPSW5G and
CPSW9G) present in various K3 SoCs only support the 'RGMII-ID' mode. This
correction has been implemented/enforced by the updates to:
a) Device-Tree binding for CPSW [0]
b) Driver for CPSW [1]
c) Driver for CPSW MAC Port's GMII [2]
To complete the transition from 'RGMII-RXID' to 'RGMII-ID', update the
'phy-mode' property for all CPSW ports by replacing 'rgmii-rxid' with
'rgmii-id'.
[0]: commit 9b357ea525 ("dt-bindings: net: ti: k3-am654-cpsw-nuss: update phy-mode in example")
[1]: commit ca13b249f2 ("net: ethernet: ti: am65-cpsw: fixup PHY mode for fixed RGMII TX delay")
[2]: commit a22d3b0d49 ("phy: ti: gmii-sel: Always write the RGMII ID setting")
Signed-off-by: Siddharth Vadapalli <s-vadapalli@ti.com>
Tested-by: Matthias Schiffer <matthias.schiffer@ew.tq-group.com> # k3-am642-tqma64xxl-mbax4xxl
Tested-by: Francesco Dolcini <francesco.dolcini@toradex.com> # Toradex Verdin AM62P
Link: https://patch.msgid.link/20251025073802.1790437-1-s-vadapalli@ti.com
Signed-off-by: Vignesh Raghavendra <vigneshr@ti.com>
Similar to other AM64x-based boards, add boot phase tags to make the
Device Trees usable for firmware/bootloaders without modification.
Supported boot devices are eMMC/SD card, SPI-NOR and USB (both mass
storage and DFU). The I2C EEPROM is included to allow the firmware to
select the correct RAM configuration for different TQMa64xxL variants.
Signed-off-by: Matthias Schiffer <matthias.schiffer@ew.tq-group.com>
Link: https://patch.msgid.link/20251105141726.39579-1-matthias.schiffer@ew.tq-group.com
Signed-off-by: Vignesh Raghavendra <vigneshr@ti.com>
When preprocessing arch/arm/boot/dts/ti/omap/am335x-mba335x.dts with
clang, there are a couple of warnings about '/*' within a block comment.
arch/arm/boot/dts/ti/omap/am335x-mba335x.dts:260:7: warning: '/*' within block comment [-Wcomment]
260 | /* /* gpmc_csn3.gpio2_0 - interrupt */
| ^
arch/arm/boot/dts/ti/omap/am335x-mba335x.dts:267:7: warning: '/*' within block comment [-Wcomment]
267 | /* /* gpmc_ben1.gpio1_28 - interrupt */
| ^
Remove the duplicate '/*' to clear up the warning.
Fixes: 5267fcd180 ("ARM: dts: omap: Add support for TQMa335x/MBa335x")
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Link: https://lore.kernel.org/r/20251105-omap-mba335x-fix-clang-comment-warning-v2-1-f8a0003e1003@kernel.org
Signed-off-by: Kevin Hilman <khilman@baylibre.com>
When emulating L2 instructions, svm_check_intercept() checks whether a
write to CR0 should trigger a synthesized #VMEXIT with
SVM_EXIT_CR0_SEL_WRITE. However, it does not check whether L1 enabled
the intercept for SVM_EXIT_WRITE_CR0, which has higher priority
according to the APM (24593—Rev. 3.42—March 2024, Table 15-7):
When both selective and non-selective CR0-write intercepts are active at
the same time, the non-selective intercept takes priority. With respect
to exceptions, the priority of this intercept is the same as the generic
CR0-write intercept.
Make sure L1 does NOT intercept SVM_EXIT_WRITE_CR0 before checking if
SVM_EXIT_CR0_SEL_WRITE needs to be injected.
Opportunistically tweak the "not CR0" logic to explicitly bail early so
that it's more obvious that only CR0 has a selective intercept, and that
modifying icpt_info.exit_code is functionally necessary so that the call
to nested_svm_exit_handled() checks the correct exit code.
Fixes: cfec82cb7d ("KVM: SVM: Add intercept check for emulated cr accesses")
Cc: stable@vger.kernel.org
Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Link: https://patch.msgid.link/20251024192918.3191141-4-yosry.ahmed@linux.dev
[sean: isolate non-CR0 write logic, tweak comments accordingly]
Signed-off-by: Sean Christopherson <seanjc@google.com>
When emulating L2 instructions, svm_check_intercept() checks whether a
write to CR0 should trigger a synthesized #VMEXIT with
SVM_EXIT_CR0_SEL_WRITE. For MOV-to-CR0, SVM_EXIT_CR0_SEL_WRITE is only
triggered if any bit other than CR0.MP and CR0.TS is updated. However,
according to the APM (24593—Rev. 3.42—March 2024, Table 15-7):
The LMSW instruction treats the selective CR0-write
intercept as a non-selective intercept (i.e., it intercepts
regardless of the value being written).
Skip checking the changed bits for x86_intercept_lmsw and always inject
SVM_EXIT_CR0_SEL_WRITE.
Fixes: cfec82cb7d ("KVM: SVM: Add intercept check for emulated cr accesses")
Cc: stable@vger.kernel.org
Reported-by: Matteo Rizzo <matteorizzo@google.com>
Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Link: https://patch.msgid.link/20251024192918.3191141-3-yosry.ahmed@linux.dev
Signed-off-by: Sean Christopherson <seanjc@google.com>
During vCPU creation, a vCPU may be destroyed immediately after
kvm_arch_vcpu_create() (e.g., due to vCPU id confiliction). However, the
vcpu_load() inside kvm_arch_vcpu_create() may have associate the vCPU to
pCPU via "list_add(&tdx->cpu_list, &per_cpu(associated_tdvcpus, cpu))"
before invoking tdx_vcpu_free().
Though there's no need to invoke tdh_vp_flush() on the vCPU, failing to
dissociate the vCPU from pCPU (i.e., "list_del(&to_tdx(vcpu)->cpu_list)")
will cause list corruption of the per-pCPU list associated_tdvcpus.
Then, a later list_add() during vcpu_load() would detect list corruption
and print calltrace as shown below.
Dissociate a vCPU from its associated pCPU in tdx_vcpu_free() for the vCPUs
destroyed immediately after creation which must be in
VCPU_TD_STATE_UNINITIALIZED state.
kernel BUG at lib/list_debug.c:29!
Oops: invalid opcode: 0000 [#2] SMP NOPTI
RIP: 0010:__list_add_valid_or_report+0x82/0xd0
Call Trace:
<TASK>
tdx_vcpu_load+0xa8/0x120
vt_vcpu_load+0x25/0x30
kvm_arch_vcpu_load+0x81/0x300
vcpu_load+0x55/0x90
kvm_arch_vcpu_create+0x24f/0x330
kvm_vm_ioctl_create_vcpu+0x1b1/0x53
kvm_vm_ioctl+0xc2/0xa60
__x64_sys_ioctl+0x9a/0xf0
x64_sys_call+0x10ee/0x20d0
do_syscall_64+0xc3/0x470
entry_SYSCALL_64_after_hwframe+0x77/0x7f
Fixes: d789fa6efa ("KVM: TDX: Handle vCPU dissociation")
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Yan Zhao <yan.y.zhao@intel.com>
Tested-by: Yan Zhao <yan.y.zhao@intel.com>
Tested-by: Kai Huang <kai.huang@intel.com>
Link: https://patch.msgid.link/20251030200951.3402865-29-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
WARN and terminate the VM if TDH_MR_EXTEND fails, as extending the
measurement should fail if and only if there is a KVM bug, or if the S-EPT
mapping is invalid. Now that KVM makes all state transitions mutually
exclusive via tdx_vm_state_guard, it should be impossible for S-EPT
mappings to be removed between kvm_tdp_mmu_map_private_pfn() and
tdh_mr_extend().
Holding slots_lock prevents zaps due to memslot updates,
filemap_invalidate_lock() prevents zaps due to guest_memfd PUNCH_HOLE,
vcpu->mutex locks prevents updates from other vCPUs, kvm->lock prevents
VM-scoped ioctls from creating havoc (e.g. by creating new vCPUs), and all
usage of kvm_zap_gfn_range() is mutually exclusive with S-EPT entries that
can be used for the initial image.
For kvm_zap_gfn_range(), the call from sev.c is obviously mutually
exclusive, TDX disallows KVM_X86_QUIRK_IGNORE_GUEST_PAT so the same goes
for kvm_noncoherent_dma_assignment_start_or_stop(), and
__kvm_set_or_clear_apicv_inhibit() is blocked by virtue of holding all
VM and vCPU mutexes (and the APIC page has its own KVM-internal memslot
that is never created for TDX VMs, and so can't possibly be used for the
initial image, which means that too is mutually exclusive irrespective of
locking).
Opportunistically return early if the region doesn't need to be measured
in order to reduce line lengths and avoid wraps. Similarly, immediately
and explicitly return if TDH_MR_EXTEND fails to make it clear that KVM
needs to bail entirely if extending the measurement fails.
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Yan Zhao <yan.y.zhao@intel.com>
Tested-by: Yan Zhao <yan.y.zhao@intel.com>
Tested-by: Kai Huang <kai.huang@intel.com>
Link: https://patch.msgid.link/20251030200951.3402865-28-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Acquire kvm->lock, kvm->slots_lock, and all vcpu->mutex locks when
servicing ioctls that (a) transition the TD to a new state, i.e. when
doing INIT or FINALIZE or (b) are only valid if the TD is in a specific
state, i.e. when initializing a vCPU or memory region. Acquiring "all"
the locks fixes several KVM_BUG_ON() situations where a SEAMCALL can fail
due to racing actions, e.g. if tdh_vp_create() contends with either
tdh_mr_extend() or tdh_mr_finalize().
For all intents and purposes, the paths in question are fully serialized,
i.e. there's no reason to try and allow anything remotely interesting to
happen. Smack 'em with a big hammer instead of trying to be "nice".
Acquire kvm->lock to prevent VM-wide things from happening, slots_lock to
prevent kvm_mmu_zap_all_fast(), and _all_ vCPU mutexes to prevent vCPUs
from interefering. Use the recently-renamed kvm_arch_vcpu_unlocked_ioctl()
to service the vCPU-scoped ioctls to avoid a lock inversion problem, e.g.
due to taking vcpu->mutex outside kvm->lock.
See also commit ecf371f8b0 ("KVM: SVM: Reject SEV{-ES} intra host
migration if vCPU creation is in-flight"), which fixed a similar bug with
SEV intra-host migration where an in-flight vCPU creation could race with
a VM-wide state transition.
Define a fancy new CLASS to handle the lock+check => unlock logic with
guard()-like syntax:
CLASS(tdx_vm_state_guard, guard)(kvm);
if (IS_ERR(guard))
return PTR_ERR(guard);
to simplify juggling the many locks.
Note! Take kvm->slots_lock *after* all vcpu->mutex locks, as per KVM's
soon-to-be-documented lock ordering rules[1].
Link: https://lore.kernel.org/all/20251016235538.171962-1-seanjc@google.com [1]
Reported-by: Yan Zhao <yan.y.zhao@intel.com>
Closes: https://lore.kernel.org/all/aLFiPq1smdzN3Ary@yzhao56-desk.sh.intel.com
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Yan Zhao <yan.y.zhao@intel.com>
Tested-by: Yan Zhao <yan.y.zhao@intel.com>
Tested-by: Kai Huang <kai.huang@intel.com>
Link: https://patch.msgid.link/20251030200951.3402865-27-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Unconditionally assert that mmu_lock is held for write when removing S-EPT
entries, not just when removing S-EPT entries triggers certain conditions,
e.g. needs to do TDH_MEM_TRACK or kick vCPUs out of the guest.
Conditionally asserting implies that it's safe to hold mmu_lock for read
when those paths aren't hit, which is simply not true, as KVM doesn't
support removing S-EPT entries under read-lock.
Only two paths lead to remove_external_spte(), and both paths asserts that
mmu_lock is held for write (tdp_mmu_set_spte() via lockdep, and
handle_removed_pt() via KVM_BUG_ON()).
Deliberately leave lockdep assertions in the "no vCPUs" helpers to document
that wait_for_sept_zap is guarded by holding mmu_lock for write, and keep
the conditional assert in tdx_track() as well, but with a comment to help
explain why holding mmu_lock for write matters (above and beyond why
tdx_sept_remove_private_spte()'s requirements).
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Yan Zhao <yan.y.zhao@intel.com>
Tested-by: Yan Zhao <yan.y.zhao@intel.com>
Tested-by: Kai Huang <kai.huang@intel.com>
Link: https://patch.msgid.link/20251030200951.3402865-21-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Add TDX_BUG_ON() macros (with varying numbers of arguments) to deduplicate
the myriad flows that do KVM_BUG_ON()/WARN_ON_ONCE() followed by a call to
pr_tdx_error(). In addition to reducing boilerplate copy+paste code, this
also helps ensure that KVM provides consistent handling of SEAMCALL errors.
Opportunistically convert a handful of bare WARN_ON_ONCE() paths to the
equivalent of KVM_BUG_ON(), i.e. have them terminate the VM. If a SEAMCALL
error is fatal enough to WARN on, it's fatal enough to terminate the TD.
Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Yan Zhao <yan.y.zhao@intel.com>
Tested-by: Yan Zhao <yan.y.zhao@intel.com>
Tested-by: Kai Huang <kai.huang@intel.com>
Link: https://patch.msgid.link/20251030200951.3402865-19-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
When populating the initial memory image for a TDX guest, ADD pages to the
TD as part of establishing the mappings in the mirror EPT, as opposed to
creating the mappings and then doing ADD after the fact. Doing ADD in the
S-EPT callbacks eliminates the need to track "premapped" pages, as the
mirror EPT (M-EPT) and S-EPT are always synchronized, e.g. if ADD fails,
KVM reverts to the previous M-EPT entry (guaranteed to be !PRESENT).
Eliminating the hole where the M-EPT can have a mapping that doesn't exist
in the S-EPT in turn obviates the need to handle errors that are unique to
encountering a missing S-EPT entry (see tdx_is_sept_zap_err_due_to_premap()).
Keeping the M-EPT and S-EPT synchronized also eliminates the need to check
for unconsumed "premap" entries during tdx_td_finalize(), as there simply
can't be any such entries. Dropping that check in particular reduces the
overall cognitive load, as the management of nr_premapped with respect
to removal of S-EPT is _very_ subtle. E.g. successful removal of an S-EPT
entry after it completed ADD doesn't adjust nr_premapped, but it's not
clear why that's "ok" but having half-baked entries is not (it's not truly
"ok" in that removing pages from the image will likely prevent the guest
from booting, but from KVM's perspective it's "ok").
Doing ADD in the S-EPT path requires passing an argument via a scratch
field, but the current approach of tracking the number of "premapped"
pages effectively does the same. And the "premapped" counter is much more
dangerous, as it doesn't have a singular lock to protect its usage, since
nr_premapped can be modified as soon as mmu_lock is dropped, at least in
theory. I.e. nr_premapped is guarded by slots_lock, but only for "happy"
paths.
Note, this approach was used/tried at various points in TDX development,
but was ultimately discarded due to a desire to avoid stashing temporary
state in kvm_tdx. But as above, KVM ended up with such state anyways,
and fully committing to using temporary state provides better access
rules (100% guarded by slots_lock), and makes several edge cases flat out
impossible.
Note #2, continue to extend the measurement outside of mmu_lock, as it's
a slow operation (typically 16 SEAMCALLs per page whose data is included
in the measurement), and doesn't *need* to be done under mmu_lock, e.g.
for consistency purposes. However, MR.EXTEND isn't _that_ slow, e.g.
~1ms latency to measure a full page, so if it needs to be done under
mmu_lock in the future, e.g. because KVM gains a flow that can remove
S-EPT entries during KVM_TDX_INIT_MEM_REGION, then extending the
measurement can also be moved into the S-EPT mapping path (again, only if
absolutely necessary). P.S. _If_ MR.EXTEND is moved into the S-EPT path,
take care not to return an error up the stack if TDH_MR_EXTEND fails, as
removing the M-EPT entry but not the S-EPT entry would result in
inconsistent state!
Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Yan Zhao <yan.y.zhao@intel.com>
Tested-by: Yan Zhao <yan.y.zhao@intel.com>
Tested-by: Kai Huang <kai.huang@intel.com>
Link: https://patch.msgid.link/20251030200951.3402865-17-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Fold tdx_mem_page_record_premap_cnt() into tdx_sept_set_private_spte() as
providing a one-off helper for effectively three lines of code is at best a
wash, and splitting the code makes the comment for smp_rmb() _extremely_
confusing as the comment talks about reading kvm->arch.pre_fault_allowed
before kvm_tdx->state, but the immediately visible code does the exact
opposite.
Opportunistically rewrite the comments to more explicitly explain who is
checking what, as well as _why_ the ordering matters.
No functional change intended.
Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Yan Zhao <yan.y.zhao@intel.com>
Tested-by: Yan Zhao <yan.y.zhao@intel.com>
Tested-by: Kai Huang <kai.huang@intel.com>
Link: https://patch.msgid.link/20251030200951.3402865-16-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Use atomic64_dec_return() when decrementing the number of "pre-mapped"
S-EPT pages to ensure that the count can't go negative without KVM
noticing. In theory, checking for '0' and then decrementing in a separate
operation could miss a 0=>-1 transition. In practice, such a condition is
impossible because nr_premapped is protected by slots_lock, i.e. doesn't
actually need to be an atomic (that wart will be addressed shortly).
Don't bother trying to keep the count non-negative, as the KVM_BUG_ON()
ensures the VM is dead, i.e. there's no point in trying to limp along.
Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Yan Zhao <yan.y.zhao@intel.com>
Tested-by: Yan Zhao <yan.y.zhao@intel.com>
Tested-by: Kai Huang <kai.huang@intel.com>
Link: https://patch.msgid.link/20251030200951.3402865-15-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Fold tdx_sept_drop_private_spte() into tdx_sept_remove_private_spte() as a
step towards having "remove" be the one and only function that deals with
removing/zapping/dropping a SPTE, e.g. to avoid having to differentiate
between "zap", "drop", and "remove". Eliminating the "drop" helper also
gets rid of what is effectively dead code due to redundant checks, e.g. on
an HKID being assigned.
No functional change intended.
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Yan Zhao <yan.y.zhao@intel.com>
Tested-by: Yan Zhao <yan.y.zhao@intel.com>
Tested-by: Kai Huang <kai.huang@intel.com>
Link: https://patch.msgid.link/20251030200951.3402865-11-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Don't explicitly pin pages when mapping pages into the S-EPT, guest_memfd
doesn't support page migration in any capacity, i.e. there are no migrate
callbacks because guest_memfd pages *can't* be migrated. See the WARN in
kvm_gmem_migrate_folio().
Eliminating TDX's explicit pinning will also enable guest_memfd to support
in-place conversion between shared and private memory[1][2]. Because KVM
cannot distinguish between speculative/transient refcounts and the
intentional refcount for TDX on private pages[3], failing to release
private page refcount in TDX could cause guest_memfd to indefinitely wait
on decreasing the refcount for the splitting.
Under normal conditions, not holding an extra page refcount in TDX is safe
because guest_memfd ensures pages are retained until its invalidation
notification to KVM MMU is completed. However, if there're bugs in KVM/TDX
module, not holding an extra refcount when a page is mapped in S-EPT could
result in a page being released from guest_memfd while still mapped in the
S-EPT. But, doing work to make a fatal error slightly less fatal is a net
negative when that extra work adds complexity and confusion.
Several approaches were considered to address the refcount issue, including
- Attempting to modify the KVM unmap operation to return a failure,
which was deemed too complex and potentially incorrect[4].
- Increasing the folio reference count only upon S-EPT zapping failure[5].
- Use page flags or page_ext to indicate a page is still used by TDX[6],
which does not work for HVO (HugeTLB Vmemmap Optimization).
- Setting HWPOISON bit or leveraging folio_set_hugetlb_hwpoison()[7].
Due to the complexity or inappropriateness of these approaches, and the
fact that S-EPT zapping failure is currently only possible when there are
bugs in the KVM or TDX module, which is very rare in a production kernel,
a straightforward approach of simply not holding the page reference count
in TDX was chosen[8].
When S-EPT zapping errors occur, KVM_BUG_ON() is invoked to kick off all
vCPUs and mark the VM as dead. Although there is a potential window that a
private page mapped in the S-EPT could be reallocated and used outside the
VM, the loud warning from KVM_BUG_ON() should provide sufficient debug
information. To be robust against bugs, the user can enable panic_on_warn
as normal.
Link: https://lore.kernel.org/all/cover.1747264138.git.ackerleytng@google.com [1]
Link: https://youtu.be/UnBKahkAon4 [2]
Link: https://lore.kernel.org/all/CAGtprH_ypohFy9TOJ8Emm_roT4XbQUtLKZNFcM6Fr+fhTFkE0Q@mail.gmail.com [3]
Link: https://lore.kernel.org/all/aEEEJbTzlncbRaRA@yzhao56-desk.sh.intel.com [4]
Link: https://lore.kernel.org/all/aE%2Fq9VKkmaCcuwpU@yzhao56-desk.sh.intel.com [5]
Link: https://lore.kernel.org/all/aFkeBtuNBN1RrDAJ@yzhao56-desk.sh.intel.com [6]
Link: https://lore.kernel.org/all/diqzy0tikran.fsf@ackerleytng-ctop.c.googlers.com [7]
Link: https://lore.kernel.org/all/53ea5239f8ef9d8df9af593647243c10435fd219.camel@intel.com [8]
Suggested-by: Vishal Annapurve <vannapurve@google.com>
Suggested-by: Ackerley Tng <ackerleytng@google.com>
Suggested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
[sean: extract out of hugepage series, massage changelog accordingly]
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Yan Zhao <yan.y.zhao@intel.com>
Tested-by: Yan Zhao <yan.y.zhao@intel.com>
Tested-by: Kai Huang <kai.huang@intel.com>
Link: https://patch.msgid.link/20251030200951.3402865-9-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Add and use a new API for mapping a private pfn from guest_memfd into the
TDP MMU from TDX's post-populate hook instead of partially open-coding the
functionality into the TDX code. Sharing code with the pre-fault path
sounded good on paper, but it's fatally flawed as simulating a fault loses
the pfn, and calling back into gmem to re-retrieve the pfn creates locking
problems, e.g. kvm_gmem_populate() already holds the gmem invalidation
lock.
Providing a dedicated API will also removing several MMU exports that
ideally would not be exposed outside of the MMU, let alone to vendor code.
On that topic, opportunistically drop the kvm_mmu_load() export. Leave
kvm_tdp_mmu_gpa_is_mapped() alone for now; the entire commit that added
kvm_tdp_mmu_gpa_is_mapped() will be removed in the near future.
Gate the API on CONFIG_KVM_GUEST_MEMFD=y as private memory _must_ be backed
by guest_memfd. Add a lockdep-only assert to that the incoming pfn is
indeed backed by guest_memfd, and that the gmem instance's invalidate lock
is held (which, combined with slots_lock being held, obviates the need to
check for a stale "fault").
Cc: Michael Roth <michael.roth@amd.com>
Cc: Yan Zhao <yan.y.zhao@intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Vishal Annapurve <vannapurve@google.com>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/all/20250709232103.zwmufocd3l7sqk7y@amd.com
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Yan Zhao <yan.y.zhao@intel.com>
Tested-by: Yan Zhao <yan.y.zhao@intel.com>
Tested-by: Kai Huang <kai.huang@intel.com>
Link: https://patch.msgid.link/20251030200951.3402865-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Drop TDX's sanity check that a mirror EPT mapping isn't zapped between
creating said mapping and doing TDH.MEM.PAGE.ADD, as the check is
simultaneously superfluous and incomplete. Per commit 2608f10576
("KVM: x86/tdp_mmu: Add a helper function to walk down the TDP MMU"), the
justification for introducing kvm_tdp_mmu_gpa_is_mapped() was to check
that the target gfn was pre-populated, with a link that points to this
snippet:
: > One small question:
: >
: > What if the memory region passed to KVM_TDX_INIT_MEM_REGION hasn't been pre-
: > populated? If we want to make KVM_TDX_INIT_MEM_REGION work with these regions,
: > then we still need to do the real map. Or we can make KVM_TDX_INIT_MEM_REGION
: > return error when it finds the region hasn't been pre-populated?
:
: Return an error. I don't love the idea of bleeding so many TDX details into
: userspace, but I'm pretty sure that ship sailed a long, long time ago.
But that justification makes little sense for the final code, as the check
on nr_premapped after TDH.MEM.PAGE.ADD will detect and return an error if
KVM attempted to zap a S-EPT entry (tdx_sept_zap_private_spte() will fail
on TDH.MEM.RANGE.BLOCK due lack of a valid S-EPT entry). And as evidenced
by the "is mapped?" code being guarded with CONFIG_KVM_PROVE_MMU=y, KVM is
NOT relying on the check for general correctness.
The sanity check is also incomplete in the sense that mmu_lock is dropped
between the check and TDH.MEM.PAGE.ADD, i.e. will only detect KVM bugs that
zap SPTEs in a very specific window (note, this also applies to the check
on nr_premapped).
Removing the sanity check will allow removing kvm_tdp_mmu_gpa_is_mapped(),
which has no business being exposed to vendor code, and more importantly
will pave the way for eliminating the "pre-map" approach entirely in favor
of doing TDH.MEM.PAGE.ADD under mmu_lock.
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Yan Zhao <yan.y.zhao@intel.com>
Tested-by: Yan Zhao <yan.y.zhao@intel.com>
Tested-by: Kai Huang <kai.huang@intel.com>
Link: https://patch.msgid.link/20251030200951.3402865-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Rename the "async" ioctl API to "unlocked" so that upcoming usage in x86's
TDX code doesn't result in a massive misnomer. To avoid having to retry
SEAMCALLs, TDX needs to acquire kvm->lock *and* all vcpu->mutex locks, and
acquiring all of those locks after/inside the current vCPU's mutex is a
non-starter. However, TDX also needs to acquire the vCPU's mutex and load
the vCPU, i.e. the handling is very much not async to the vCPU.
No functional change intended.
Acked-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Reviewed-by: Yan Zhao <yan.y.zhao@intel.com>
Tested-by: Yan Zhao <yan.y.zhao@intel.com>
Tested-by: Kai Huang <kai.huang@intel.com>
Link: https://patch.msgid.link/20251030200951.3402865-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Implement kvm_arch_vcpu_async_ioctl() "natively" in x86 and arm64 instead
of relying on an #ifdef'd stub, and drop HAVE_KVM_VCPU_ASYNC_IOCTL in
anticipation of using the API on x86. Once x86 uses the API, providing a
stub for one architecture and having all other architectures opt-in
requires more code than simply implementing the API in the lone holdout.
Eliminating the Kconfig will also reduce churn if the API is renamed in
the future (spoiler alert).
No functional change intended.
Acked-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Reviewed-by: Yan Zhao <yan.y.zhao@intel.com>
Tested-by: Yan Zhao <yan.y.zhao@intel.com>
Tested-by: Kai Huang <kai.huang@intel.com>
Link: https://patch.msgid.link/20251030200951.3402865-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
When the PCIe controller is running in endpoint mode, the controller
initialization is triggered by a PERST# (PCIe reset) GPIO deassertion.
The driver has configured an IRQ to trigger when the PERST# GPIO changes
state. Without the pinctrl definition, we do not get an IRQ when PERST#
is deasserted, so the PCIe controller never gets initialized.
Add the missing definitions, so that the controller actually gets
initialized.
Fixes: ec142c44b0 ("arm64: tegra: Add P2U and PCIe controller nodes to Tegra234 DT")
Fixes: 0580286d0d ("arm64: tegra: Add Tegra234 PCIe C4 EP definition")
Signed-off-by: Niklas Cassel <cassel@kernel.org>
Reviewed-by: Manikanta Maddireddy <mmaddireddy@nvidia.com>
[treding@nvidia.com: add blank lines to separate blocks]
Signed-off-by: Thierry Reding <treding@nvidia.com>
Convert TI OMAP SDHCI Controller binding to YAML format.
Changes during Conversion:
- Define new properties like "clocks", "clock-names",
"pbias-supply" and "power-domains" to resolve dtb_check errors.
- Remove "pinctrl-names" and "pinctrl-<n>"
from required as they are not necessary for all DTS files.
- Remove "ti,hwmods" property entirely from the YAML as the
DTS doesn't contain this property for the given compatibles and the
text binding is misleading.
- Add "clocks", "clock-names" and "max-frequency" to the required
properties based on the compatible and the text binding doesn't mention
these properties as required.
- Add missing strings like "default-rev11", "sdr12-rev11", "sdr25-rev11",
"hs-rev11", "sdr25-rev11" and "sleep" to pinctrl-names string array
to resolve errors detected by dtb_check.
Signed-off-by: Charan Pedumuru <charan.pedumuru@gmail.com>
Reviewed-by: Rob Herring (Arm) <robh@kernel.org>
Link: https://lore.kernel.org/r/20251024-ti-sdhci-omap-v5-3-df5f6f033a38@gmail.com
Signed-off-by: Kevin Hilman <khilman@baylibre.com>
The "ti,twl4030-power-n900" compatible string is obsolete and is not
supported by any in-kernel driver. Currently, the kernel falls back to
the second entry, "ti,twl4030-power-idle-osc-off", to bind a driver to
this node.
Make this fallback explicit by removing the obsolete board-specific
compatible. This preserves the existing functionality while making the
DTS compliant with the new, stricter 'ti,twl.yaml' binding.
Fixes: daebabd578 ("mfd: twl4030-power: Fix PM idle pin configuration to not conflict with regulators")
Signed-off-by: Jihed Chaibi <jihed.chaibi.dev@gmail.com>
Link: https://lore.kernel.org/r/20250914192516.164629-4-jihed.chaibi.dev@gmail.com
Signed-off-by: Kevin Hilman <khilman@baylibre.com>
The "ti,twl4030-power-beagleboard-xm" compatible string is obsolete and
is not supported by any in-kernel driver. Currently, the kernel falls
back to the second entry, "ti,twl4030-power-idle-osc-off", to bind a
driver to this node.
Make this fallback explicit by removing the obsolete board-specific
compatible. This preserves the existing functionality while making the
DTS compliant with the new, stricter 'ti,twl.yaml' binding.
Fixes: 9188883fd6 ("ARM: dts: Enable twl4030 off-idle configuration for selected omaps")
Signed-off-by: Jihed Chaibi <jihed.chaibi.dev@gmail.com>
Link: https://lore.kernel.org/r/20250914192516.164629-3-jihed.chaibi.dev@gmail.com
Signed-off-by: Kevin Hilman <khilman@baylibre.com>
Without a serial console speed specified in chosen/stdout-path in the
DTB, the serial console uses the default speed of the serial driver,
unless explicitly overridden in a legacy console= kernel command-line
parameter.
After dropping "ti,omap3-uart" from the list of compatible values in DT,
AM33xx serial ports can no longer be used with the legacy OMAP serial
driver, but only with the OMAP-flavored 8250 serial driver (which is
mutually-exclusive with the former). However, replacing
CONFIG_SERIAL_OMAP=y by CONFIG_SERIAL_8250_OMAP=y (with/without enabling
CONFIG_SERIAL_8250_OMAP_TTYO_FIXUP) may not be sufficient to restore
serial console functionality: the legacy OMAP serial driver defaults to
115200 bps, while the 8250 serial driver defaults to 9600 bps, causing
no visible output on the serial console when no appropriate console=
kernel command-line parameter is specified.
Fix this for all AM33xx boards by adding ":115200n8" to
chosen/stdout-path. This requires replacing the "&uartN" reference by
the corresponding "serialN" DT alias.
Fixes: ca8be8fc2c ("ARM: dts: am33xx-l4: fix UART compatible")
Fixes: 077e1cde78 ("ARM: omap2plus_defconfig: Enable 8250_OMAP")
Closes: https://lore.kernel.org/CAMuHMdUb7Jb2=GqK3=Rn+Gv5G9KogcQieqDvjDCkJA4zyX4VcA@mail.gmail.com
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Reviewed-by: Matti Vaittinen <mazziesaccount@gmail.com>
Tested-by: Matti Vaittinen <mazziesaccount@gmail.com>
Reviewed-by: Bruno Thomsen <bruno.thomsen@gmail.com>
Link: https://lore.kernel.org/r/63cef5c3643d359e8ec13366ca79377f12dd73b1.1759398641.git.geert+renesas@glider.be
Signed-off-by: Kevin Hilman <khilman@baylibre.com>
Add SMMU-V3 Performance Monitoring Counter Group (PMCG) nodes for
Agilex5 to support SMMU performance event monitoring.
Signed-off-by: Adrian Ng Ho Yin <adrianhoyin.ng@altera.com>
Signed-off-by: Dinh Nguyen <dinguyen@kernel.org>
Add L2 and L3 cache nodes to the device tree to resolve the
"unable to detect cache hierarchy" warning reported by cacheinfo.
Signed-off-by: Adrian Ng Ho Yin <adrianhoyin.ng@altera.com>
Signed-off-by: Dinh Nguyen <dinguyen@kernel.org>
Add the required clock-names property NAND controller. This change corrects
the warning:
socfpga_agilex5_socdk_nand.dtb: nand-controller@10b80000 (cdns,hp-nfc):
'clock-names' is a required property
Signed-off-by: Dinh Nguyen <dinguyen@kernel.org>
Update the compatible to reflect combination of DDIC and panel.
Original compatible describing only the DDIC used, but omit describing
the panel used (Samsung AMS641RW), which we have no way to detect.
There are two additional supplies used by the panel, both are GPIO
controlled and are left enabled by the bootloader for continuous splash.
Previously these were (incorrectly) modelled as pinctrl. Describe them
properly so that the panel can control them.
Fixes: 288ef8a426 ("arm64: dts: sdm845: add oneplus6/6t devices")
Signed-off-by: Casey Connolly <casey.connolly@linaro.org>
Reviewed-by: Dmitry Baryshkov <dmitry.baryshkov@oss.qualcomm.com>
Co-developed-by: David Heidelberg <david@ixit.cz>
Signed-off-by: David Heidelberg <david@ixit.cz>
Link: https://lore.kernel.org/r/20251103-s6e3fc2x01-v6-1-d4eb4abaefa4@ixit.cz
Signed-off-by: Bjorn Andersson <andersson@kernel.org>
HWKM v1 and v2 differ slightly in wrapped key size and the bit fields for
certain status registers and operating mode (legacy or standard).
Add support to select HWKM version based on the major and minor revisions.
Use this HWKM version to select wrapped key size and to configure the bit
fields in registers for operating modes and hardware status.
Support for SCM calls for wrapped keys is being added in the TrustZone for
few SoCs with HWKM v1. Existing check of qcom_scm_has_wrapped_key_support()
API ensures that HWKM is used only if these SCM calls are supported in
TrustZone for that SoC.
Signed-off-by: Neeraj Soni <neeraj.soni@oss.qualcomm.com>
Reviewed-by: Konrad Dybcio <konrad.dybcio@oss.qualcomm.com>
Link: https://lore.kernel.org/r/20251030161012.3391239-1-neeraj.soni@oss.qualcomm.com
Signed-off-by: Bjorn Andersson <andersson@kernel.org>
Rename "guest_paddr" variables in vm_userspace_mem_region_add() and
vm_mem_add() to KVM's de facto standard "gpa", both for consistency and
to shorten line lengths.
Opportunistically fix the indentation of the
vm_userspace_mem_region_add() declaration.
Link: https://patch.msgid.link/20251007223625.369939-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Prevent calling ti_sci_cmd_set_io_isolation() on firmware
that does not support the IO_ISOLATION capability. Add the
MSG_FLAG_CAPS_IO_ISOLATION capability flag and check it before
attempting to set IO isolation during suspend/resume operations.
Without this check, systems with older firmware may experience
undefined behavior or errors when entering/exiting suspend states.
Fixes: ec24643bdd ("firmware: ti_sci: Add system suspend and resume call")
Signed-off-by: Thomas Richard (TI.com) <thomas.richard@bootlin.com>
Reviewed-by: Kevin Hilman <khilman@baylibre.com>
Link: https://patch.msgid.link/20251031-ti-sci-io-isolation-v2-1-60d826b65949@bootlin.com
Signed-off-by: Nishanth Menon <nm@ti.com>
J784S4 SoC has two instances of PCIe which are PCIe0 and PCIe1. J784S4
SoC uses PCIe1 instance for PCIe boot process. To configure PCIe1 at
all boot stages "pcie1_ctrl" also needs to be present at all boot
stages. Thus add the "bootph-all" boot phase tag to "pcie1_ctrl" device
tree node.
Signed-off-by: Hrushikesh Salunke <h-salunke@ti.com>
Link: https://patch.msgid.link/20251017084654.2929945-4-h-salunke@ti.com
Signed-off-by: Vignesh Raghavendra <vigneshr@ti.com>
J784S4 SoC has two instances of PCIe which are PCIe0 and PCIe1. PCIe1
instance is used for PCIe boot process. J784S4 SoC has four instances
of 4-lane SERDES. Out of which SERDES0 is used as PHY for PCIe1. So it
needs to be functional at all stages of PCIe boot process. Thus add the
"bootph-all" boot phase tag to nodes required to enable SERDES0 at all
boot stages.
Signed-off-by: Hrushikesh Salunke <h-salunke@ti.com>
Link: https://patch.msgid.link/20251017084654.2929945-3-h-salunke@ti.com
Signed-off-by: Vignesh Raghavendra <vigneshr@ti.com>
J784S4 SoC has two instances of PCIe which are PCIe0 and PCIe1. J784S4
SoC uses PCIe1 instance for PCIe boot process. So it needs to be in
endpoint mode and it needs to be functional at all stages of PCIe boot
process. Thus add the "bootph-all" boot phase tag to "pcie1_ep" device
tree node.
Signed-off-by: Hrushikesh Salunke <h-salunke@ti.com>
Link: https://patch.msgid.link/20251017084654.2929945-2-h-salunke@ti.com
Signed-off-by: Vignesh Raghavendra <vigneshr@ti.com>
There is currently a problem where, in the specific case of SMEM not
initialized by SBL, any SMEM API wrongly returns PROBE_DEFER
communicating wrong info to any user of this API.
A better way to handle this would be to track the SMEM state and return
a different kind of error than PROBE_DEFER.
Rework the __smem handle to always init it to the error pointer
-EPROBE_DEFER following what is already done by the SMEM API.
If we detect that the SBL didn't initialized SMEM, set the __smem handle
to the error pointer -ENODEV.
Also rework the SMEM API to handle the __smem handle to be an error
pointer and return it appropriately.
This way user of the API can react and return a proper error or use
fallback way for the failing API.
While at it, change the return error when SMEM is not initialized by SBL
also to -ENODEV to make it consistent with the __smem handle and use
dev_err_probe() helper to return the message.
Signed-off-by: Christian Marangi <ansuelsmth@gmail.com>
Link: https://lore.kernel.org/r/20251031130835.7953-3-ansuelsmth@gmail.com
Signed-off-by: Bjorn Andersson <andersson@kernel.org>
Add INIT_ERR_PTR() macro to initialize static variables with error
pointers. This might be useful for specific case where there is a static
variable initialized to an error condition and then later set to the
real handle once probe finish/completes.
This is to handle compilation problems like:
error: initializer element is not constant
where ERR_PTR() can't be used.
Signed-off-by: Christian Marangi <ansuelsmth@gmail.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Link: https://lore.kernel.org/r/20251031130835.7953-2-ansuelsmth@gmail.com
[bjorn: Added () suffix on macro references]
Signed-off-by: Bjorn Andersson <andersson@kernel.org>
The binding for tps65910 has been converted to yaml and instead of the
deprecated regulator-compatible, the node-names are now used to identify
the individual regulators. Also some additional required properties
were added.
Adapt the tps65910 nodes on Rockchip boards to adhere to the updated
binding, which also allows us to drop the tps65910.dtsi include.
Signed-off-by: Johan Jonker <jbx6244@gmail.com>
Link: https://patch.msgid.link/b3d05df4-a916-48e1-8d9e-590782806bd5@gmail.com
Signed-off-by: Heiko Stuebner <heiko@sntech.de>
Agilex5 SoCFPGA 013b is a small form factor development kit.
Supports both tabletop and PCIe add-in card operation. It features
expansion headers for Raspberry Pi 4/5 HATs and Digilent Pmod modules,
enabling integration with popular ecosystems.
Signed-off-by: Niravkumar L Rabara <niravkumarlaxmidas.rabara@altera.com>
Signed-off-by: Dinh Nguyen <dinguyen@kernel.org>
Describe reset controllers for VI, MISC, AP, DSP and AO subsystems. The
one for AO subsystem is marked as reserved, since it may be used by AON
firmware.
Reviewed-by: Drew Fustini <fustini@kernel.org>
Signed-off-by: Yao Zi <ziyao@disroot.org>
Signed-off-by: Drew Fustini <fustini@kernel.org>
Add VGIC maintenance interrupt and interrupt-parent property for
interrupt controller, required to run Linux in virtualized environment.
Signed-off-by: Niravkumar L Rabara <niravkumar.l.rabara@intel.com>
Signed-off-by: Dinh Nguyen <dinguyen@kernel.org>
nand-controller@ffb90000 (altr,socfpga-denali-nand): Unevaluated
properties are not allowed ('flash@0' was unexpected)
Signed-off-by: Dinh Nguyen <dinguyen@kernel.org>
Unevaluated properties are not allowed ('phy-addr' was unexpected)
socfpga_stratix10_swvp.dtb: sysmgr@ffd12000 (altr,sys-mgr-s10):
'interrupts' does not match any of the regexes:
Signed-off-by: Dinh Nguyen <dinguyen@kernel.org>
This devkit is very similar to P3450, except it has less RAM, no display
port, and only 3 USB host ports. Derive from P3450 and disable the
hardware that is unavailable.
GPIO PA6 is used to control the HDMI power rail and needs to be on for
hotplug detect to work. This is mapped to the 3.3V USB hub on P3450.
That USB rail is not used here, so delete the regulator to avoid
conflicts.
Signed-off-by: Aaron Kling <webgeek1234@gmail.com>
Signed-off-by: Thierry Reding <treding@nvidia.com>
- Add the audio devices for the Tegra264 SoC in the tegra264.dtsi file,
which includes sound, HDA and APE(Audio Processing Engine) subsystem
nodes.
APE subsystem includes,
- I/O interfaces such as I2S, DMIC and DSPK (all the available
instances).
- HW accelerators such as ASRC, OPE, MVC, SFC, AMX, ADX and Mixer (all
the available instances).
- ADMA controller and Interrupt controllers.
- Enable the audio nodes in tegra264-p3971.dtsi platform DT file.
Signed-off-by: sheetal <sheetal@nvidia.com>
Signed-off-by: Thierry Reding <treding@nvidia.com>
Add the device tree nodes for the MAIN and AON pin controllers found on
the Tegra186 family of SoCs.
Signed-off-by: Aaron Kling <webgeek1234@gmail.com>
Signed-off-by: Thierry Reding <treding@nvidia.com>
Since only a single flash device is connected to ospi0 retain only the
OSPI0_CSn0 pin configuration and remove the unused CSn1-CSn3 pins from
the default pinctrl. This simplifies the ospi0 pin configuration without
affecting functionality.
Signed-off-by: Paresh Bhagat <p-bhagat@ti.com>
Reviewed-by: Andrew Davis <afd@ti.com>
Link: https://patch.msgid.link/20251029032144.502603-1-p-bhagat@ti.com
Signed-off-by: Vignesh Raghavendra <vigneshr@ti.com>
The I2C pins for some of the instances on J784S4/J742S2/AM69 are
configured as PIN_INPUT_PULLUP while these pins are open-drain type and
do not support internal pull-ups [0][1][2]. The pullup configuration
bits in the corresponding padconfig registers are reserved and any
writes to them have no effect and readback checks on those bits fail.
Update the pinmux settings to use PIN_INPUT instead of PIN_INPUT_PULLUP
to reflect the correct hardware behaviour.
[0]: https://www.ti.com/lit/gpn/tda4ah-q1 (J784S4 Datasheet: Table 5-1. Pin Attributes)
[1]: https://www.ti.com/lit/gpn/tda4ape-q1 (J742S2 Datasheet: Table 5-1. Pin Attributes)
[2]: https://www.ti.com/lit/gpn/am69a (AM69 Datasheet: Table 5-1. Pin Attributes)
Fixes: e20a06aca5 ("arm64: dts: ti: Add support for J784S4 EVM board")
Fixes: 635fb18ba0 ("arch: arm64: dts: Add support for AM69 Starter Kit")
Fixes: 0ec1a48d99 ("arm64: dts: ti: k3-am69-sk: Add pinmux for RPi Header")
Signed-off-by: Aniket Limaye <a-limaye@ti.com>
Reviewed-by: Udit Kumar <u-kumar1@ti.com>
Link: https://patch.msgid.link/20251022122638.234367-1-a-limaye@ti.com
Signed-off-by: Vignesh Raghavendra <vigneshr@ti.com>
The VAR-SOM-AM62P integrates an ADS7846 resistive touchscreen controller.
The controller is physically located on the SOM, and its signals are
routed to the SOM pins, allowing carrier boards to make use of it.
This patch adds the ADS7846 node under the appropriate SPI controller.
Signed-off-by: Stefano Radaelli <stefano.radaelli21@gmail.com>
Link: https://patch.msgid.link/20251003125031.30539-4-stefano.radaelli21@gmail.com
Signed-off-by: Vignesh Raghavendra <vigneshr@ti.com>
The VAR-SOM-AM62P can integrate the WM8904, a high-performance
ultra-low-power stereo codec optimized for portable audio applications.
This patch adds the WM8904 device to the appropriate I2C bus, enables
the McASP1 peripheral, and introduces the sound node to expose the
sound card to the system.
Signed-off-by: Stefano Radaelli <stefano.radaelli21@gmail.com>
Link: https://patch.msgid.link/20251003125031.30539-3-stefano.radaelli21@gmail.com
Signed-off-by: Vignesh Raghavendra <vigneshr@ti.com>
Add device tree support for the Kontron SMARC-sAM67 module, which is
based on a TI AM67A SoC.
The module features:
* Quad-core AM67A94 at 1.4GHz with 8 GiB RAM
* 64 GiB eMMC, 4 MiB SPI flash for failsafe booting
* Dedicated RTC
* Multiple interfaces: 4x UART, 2x USB 2.0/USB 3.2, 2x GBE, QSPI,
7x I2C,
* Display support: 2x LVDS, 1x DSI (*), 1x DP (*)
* Camera support: 4x CSI (*)
* Onboard microcontroller for boot control, failsafe booting and
external watchdog
(*) not yet supported by the kernel
There is a base device tree and overlays which will add optional
features. At the moment there is one full featured variant of that
board whose device tree is generated during build by merging all the
device tree overlays.
Signed-off-by: Michael Walle <mwalle@kernel.org>
Reviewed-by: Udit Kumar <u-kumar1@ti.com>
Link: https://patch.msgid.link/20251017135116.548236-3-mwalle@kernel.org
Signed-off-by: Vignesh Raghavendra <vigneshr@ti.com>
At the moment the clock parent of the audio extclk output is PLL1_HSDIV6
of the main domain. This very clock output is also used among various IP
cores, for example for the USB1 LPM clock. The audio extclock being an
external clock output with a variable frequency, it is likely that a
user of this clock will try to set it's frequency to a different value,
i.e. an audio codec. Because that clock output is used also for other IP
cores, bad things will happen.
Instead of using PLL1_HSDIV6 use the PLL2_HSDIV8 as a sane default, as
this output is exclusively used among other audio peripherals.
Signed-off-by: Michael Walle <mwalle@kernel.org>
Link: https://patch.msgid.link/20251017102228.530517-2-mwalle@kernel.org
Signed-off-by: Vignesh Raghavendra <vigneshr@ti.com>
Describe the "type" struct member using '@type' and move it together
with the rest of the doc for ctl_table_header to avoid a kernel-doc
warning:
Warning: include/linux/sysctl.h:178 Incorrect use of kernel-doc format:
* enum type - Enumeration to differentiate between ctl target types
Fixes: 2f2665c13a ("sysctl: replace child with an enumeration")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Joel Granados <joel.granados@kernel.org>
Due to an implementation detail in this SoC, additional passive
electrical components are required to achieve the maximum rated speed
of the SD controller when paired with a High-Speed SD Card. Without
them, the clock frequency must be limited to 37.5 MHz for link stability.
Because the reference design does not contain these components, most
(derivative) boards do not have them either. To accommodate for that,
apply the frequency limit by default and delegate lifting it to the
odd boards that do contain the necessary onboard hardware.
Signed-off-by: Sarthak Garg <quic_sartgarg@quicinc.com>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Link: https://lore.kernel.org/r/20250908104122.2062653-5-quic_sartgarg@quicinc.com
Signed-off-by: Bjorn Andersson <andersson@kernel.org>
The hwspinlock acquired via hwspin_lock_request_specific() is not
released on several error paths. This results in resource leakage
when probe fails.
Switch to devm_hwspin_lock_request_specific() to automatically
handle cleanup on probe failure. Remove the manual hwspin_lock_free()
in qcom_smem_remove() as devm handles it automatically.
Fixes: 20bb6c9de1 ("soc: qcom: smem: map only partitions used by local HOST")
Signed-off-by: Haotian Zhang <vulab@iscas.ac.cn>
Reviewed-by: Konrad Dybcio <konrad.dybcio@oss.qualcomm.com>
Link: https://lore.kernel.org/r/20251029022733.255-1-vulab@iscas.ac.cn
Signed-off-by: Bjorn Andersson <andersson@kernel.org>
The control-scb and mss-top-sysreg regions on PolarFire SoC both fulfill
multiple purposes. The former is used for mailbox functions in addition
to the temperature & voltage sensor while the latter is used for clocks,
resets, interrupt muxing and pinctrl.
Signed-off-by: Conor Dooley <conor.dooley@microchip.com>
Add EIP76 Random Number Generator (RNG) node within crypto engine for
AM62 and AM62A SoCs. The RNG hardware is integrated in crypto
subsystem at address 0x40910000.
Mark the RNG node with status "reserved" as it is intended for use by
OP-TEE for secure random number generation. If required, this hardware
can also be used through Linux kernel by enabling this node.
Signed-off-by: Shiva Tripathi <s-tripathi1@ti.com>
Link: https://patch.msgid.link/20250926100229.923547-1-s-tripathi1@ti.com
Signed-off-by: Vignesh Raghavendra <vigneshr@ti.com>
The SPDIF TX (called OWA OUT in the datasheet) is available on three
pins. Of those, the PH pin is unlikely to be used since it conflicts
with the first Ethernet controller.
The Radxa Cubie A5E exposes SPDIF TX through the PI pin group on the
40-pin GPIO header.
The Orange Pi 4A exposes SPDIF TX through both the PB and PI pin
groups on the 40-pin GPIO header. The PB pin alternatively would be
used for I2S0 though.
Add pinmux settings for both options so potential users can directly
reference either one.
Acked-by: Jernej Skrabec <jernej.skrabec@gmail.com>
Link: https://patch.msgid.link/20251027125655.793277-10-wens@kernel.org
Signed-off-by: Chen-Yu Tsai <wens@kernel.org>
There are two DMA controllers on the A523, one in the main system area
and the other for the MCU. These are the same as the one found on the
A100. The only difference is the DMA endpoint (DRQ) layout.
Since the number of channels and endpoints are described with additional
generic properties, just add new A523-specific compatible strings and
fallback to the A100 one.
Acked-by: Conor Dooley <conor.dooley@microchip.com>
Link: https://patch.msgid.link/20251027125655.793277-2-wens@kernel.org
Signed-off-by: Chen-Yu Tsai <wens@kernel.org>
The H616 has a NAND controller quite similar to the A10/A23 ones, but
with some register differences, more clocks (for ECC and MBUS), more ECC
strengths, so this requires a new compatible string.
Add the NAND controller node and pins in the device tree.
Signed-off-by: Richard Genoud <richard.genoud@bootlin.com>
Link: https://patch.msgid.link/20251028073534.526992-17-richard.genoud@bootlin.com
[wens@kernel.org: Fixed alignment of clocks in nand-controller node]
Signed-off-by: Chen-Yu Tsai <wens@kernel.org>
Recent changes have added an include of as-layout.h to pgtable.h.
However this introduces a circular dependency during asm-offsets
generation as as-layout.h depends on asm-offsets and pgtable.h is an
input for asm-offsets.
Building from a clean state results in the following error:
CC arch/um/kernel/asm-offsets.s
In file included from arch/um/include/asm/pgtable.h:48,
from include/linux/pgtable.h:6,
from include/linux/mm.h:31,
from include/linux/pid_namespace.h:7,
from include/linux/ptrace.h:10,
from include/linux/audit.h:13,
from arch/um/kernel/asm-offsets.c:8:
arch/um/include/shared/as-layout.h:9:10: fatal error: generated/asm-offsets.h: No such file or directory
9 | #include <generated/asm-offsets.h>
| ^~~~~~~~~~~~~~~~~~~~~~~~~
compilation terminated.
make[4]: *** [scripts/Makefile.build:182: arch/um/kernel/asm-offsets.s] Error 1
As the inclusion of as-layout.h in pgtable.h is not yet needed while
asm-offsets are generated, break the dependency here.
Fixes: a7f7dbae94 ("um: Remove file-based iomem emulation support")
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Reviewed-by: Tiwei Bie <tiwei.btw@antgroup.com>
Link: https://patch.msgid.link/20251028-uml-offsets-circular-v1-1-601c363cfaaa@weissschuh.net
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Add Device Tree nodes to enable a PWM controlled fan and it's associated
thermal management for the Lichee Pi 4A board.
This enables temperature-controlled active cooling for the Lichee Pi 4A
board based on SoC temperature.
Reviewed-by: Drew Fustini <fustini@kernel.org>
Tested-by: Drew Fustini <fustini@kernel.org>
Reviewed-by: Elle Rhumsaa <elle@weathered-steel.dev>
Signed-off-by: Michal Wilczynski <m.wilczynski@samsung.com>
Signed-off-by: Drew Fustini <fustini@kernel.org>
Add SOC special compatible string, remove fallback snps,dwc3 to let flatten
dwc3-layerscape driver to be probed and enable dma-coherence for usb node
since commit add layerscape dwc3 support, which set correct gsbustcfg0
value.
Add iommus property to run at old uboot, which use fixup add iommus by
check compatible string snsp,dwc3 compatible string.
Signed-off-by: Frank Li <Frank.Li@nxp.com>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
Leverage newly introduced 'leds' and 'led-names' properties to pass
indicator's phandle and function to v4l2 subnode. The latter supports
privacy led since couple of years ago under 'privacy-led' designation.
Unlike initially proposed trigger-source based approach, this solution
cannot be easily bypassed from userspace, thus reducing privacy
concerns.
Signed-off-by: Aleksandrs Vinarskis <alex@vinarskis.com>
Tested-by: Steev Klimaszewski <threeway@gmail.com>
Reviewed-by: Konrad Dybcio <konrad.dybcio@oss.qualcomm.com>
Link: https://lore.kernel.org/r/20250910-leds-v5-4-bb90a0f897d5@vinarskis.com
Signed-off-by: Bjorn Andersson <andersson@kernel.org>
The touchscreen properties previously upstreamed was based on downstream
touchscreen driver. We ended up adapting upstream edt_ft5x06 driver to
support the touchscreen controller used in this device. Update the
touchscreen properties to match with the upstream edt_ft5x06
driver.
Also, the touchscreen controller used in this device is ft5452 and not
fts8719. Fix the compatible string accordingly.
The wakeup-source property was removed as it prevents the touchscreen's
regulators and irq from being disabled when the device is suspended and
could lead to unexpected battery drain. Once low power mode and
tap-to-wake functionality is properly implemented and tested to be
working, we can add it back, if needed.
Signed-off-by: Joel Selvaraj <foss@joelselvaraj.com>
Reviewed-by: Konrad Dybcio <konrad.dybcio@oss.qualcomm.com>
Link: https://lore.kernel.org/r/20251021-shift-axolotl-fix-touchscreen-dts-v2-1-e94727f0aa7e@joelselvaraj.com
Signed-off-by: Bjorn Andersson <andersson@kernel.org>
Lantronix SM8550-HDK board may be equipped with a Rear Camera Card PCB
which contains:
* Samsung S3K33D time-of-fligt image sensor connected to CSIPHY0 (TOF),
* Omnivision OV64B40 image sensor connected to CSIPHY1 (uWide),
* Sony IMX766 image sensor connected to CSIPHY2 (Wide),
* Samsung S5K3M5 image sensor connected to CSIPHY3 (Tele),
* two flash leds.
The change adds support of a Samsung S5K3M5 camera image sensor and
two flash leds on the external camera card module.
Signed-off-by: Vladimir Zapolskiy <vladimir.zapolskiy@linaro.org>
Reviewed-by: Konrad Dybcio <konrad.dybcio@oss.qualcomm.com>
Reviewed-by: Neil Armstrong <neil.armstrong@linaro.org>
Link: https://lore.kernel.org/r/20251013235500.1883847-4-vladimir.zapolskiy@linaro.org
Signed-off-by: Bjorn Andersson <andersson@kernel.org>
The existing OPP table for PCIe is shared across different link
configurations such as data rates 8GT/s x2 and 16GT/s x1. These
configurations often operate at the same frequency, allowing them
to reuse the same OPP entries. However, 8GT/s and 16 GT/s may have
different RPMh votes which cannot be represented accurately when
sharing a single OPP.
To address this, introduce an `opp-level` to indicate the PCIe data
rate and uniquely differentiate OPP entries even when the frequenc
is the same.
Although this platform does not currently suffer from this issue, the
change is introduced to support unification across platforms.
Append the opp level to name of the opp node to indicate both frequency
and level.
Signed-off-by: Krishna Chaitanya Chundru <krishna.chundru@oss.qualcomm.com>
Acked-by: Manivannan Sadhasivam <mani@kernel.org>
Link: https://lore.kernel.org/r/20251013-opp_pcie-v5-4-eb64db2b4bd3@oss.qualcomm.com
Signed-off-by: Bjorn Andersson <andersson@kernel.org>
The existing OPP table for PCIe is shared across different link
configurations such as data rates 8GT/s x2 and 16GT/s x1. These
configurations often operate at the same frequency, allowing them
to reuse the same OPP entries. However, 8GT/s and 16 GT/s may have
different RPMh votes which cannot be represented accurately when
sharing a single OPP.
To address this, introduce an `opp-level` to indicate the PCIe data
rate and uniquely differentiate OPP entries even when the frequenc
is the same.
Although this platform does not currently suffer from this issue, the
change is introduced to support unification across platforms.
Append the opp level to name of the opp node to indicate both frequency
and level.
Signed-off-by: Krishna Chaitanya Chundru <krishna.chundru@oss.qualcomm.com>
Acked-by: Manivannan Sadhasivam <mani@kernel.org>
Link: https://lore.kernel.org/r/20251013-opp_pcie-v5-3-eb64db2b4bd3@oss.qualcomm.com
Signed-off-by: Bjorn Andersson <andersson@kernel.org>
The existing OPP table for PCIe is shared across different link
configurations such as data rates 8GT/s x2 and 16GT/s x1. These
configurations often operate at the same frequency, allowing them
to reuse the same OPP entries. However, 8GT/s and 16 GT/s may have
different RPMh votes which cannot be represented accurately when
sharing a single OPP.
To address this, introduce an `opp-level` to indicate the PCIe data
rate and uniquely differentiate OPP entries even when the frequency
is the same.
Although this platform does not currently suffer from this issue, the
change is introduced to support unification across platforms.
Append the opp level to name of the opp node to indicate both frequency
and level.
Signed-off-by: Krishna Chaitanya Chundru <krishna.chundru@oss.qualcomm.com>
Acked-by: Manivannan Sadhasivam <mani@kernel.org>
Link: https://lore.kernel.org/r/20251013-opp_pcie-v5-2-eb64db2b4bd3@oss.qualcomm.com
Signed-off-by: Bjorn Andersson <andersson@kernel.org>
The existing OPP table for PCIe is shared across different link
configurations such as data rates 8GT/s x2 and 16GT/s x1. These
configurations often operate at the same frequency, allowing them
to reuse the same OPP entries. However, 8GT/s and 16 GT/s may have
different RPMh votes which cannot be represented accurately when
sharing a single OPP.
To address this, introduce an `opp-level` to indicate the PCIe data
rate and uniquely differentiate OPP entries even when the frequency
is the same.
Although this platform does not currently suffer from this issue, the
change is introduced to support unification across platforms.
Append the opp level to name of the opp node to indicate both frequency
and level.
Signed-off-by: Krishna Chaitanya Chundru <krishna.chundru@oss.qualcomm.com>
Acked-by: Manivannan Sadhasivam <mani@kernel.org>
Link: https://lore.kernel.org/r/20251013-opp_pcie-v5-1-eb64db2b4bd3@oss.qualcomm.com
Signed-off-by: Bjorn Andersson <andersson@kernel.org>
The commit a41d23142d ("arm64: dts: qcom: x1e80100-dell-xps13-9345:
Add missing pinctrl for eDP HPD") has applied this change to a very
similar machine, so apply it here too.
This allows us not to rely on the boot firmware to set up the pinctrl
for the eDP HPD line of the internal display.
Fixes: e7733b4211 ("arm64: dts: qcom: Add support for Dell Inspiron 7441 / Latitude 7455")
Reviewed-by: Bryan O'Donoghue <bryan.odonoghue@linaro.org>
Signed-off-by: Val Packett <val@packett.cool>
Link: https://lore.kernel.org/r/20251012224706.14311-1-val@packett.cool
Signed-off-by: Bjorn Andersson <andersson@kernel.org>
Add device tree for Huawei MateBook E 2019, which is a 2-in-1 tablet based
on Qualcomm's sdm850 platform.
Supported features:
- ADSP, CDSP and SLPI
- Volume Key
- Power Key
- Tablet Mode Switching
- Display
- Touchscreen
- Stylus
- WiFi [1]
- Bluetooth [2]
- GPU
- USB
- Keyboard
- Touchpad
- UFS
- SD Card
- Audio (right internal mic and headphone mic not working)
- Mobile Network
[1] WiFi probing log:
ath10k_snoc 18800000.wifi: Adding to iommu group 12
ath10k_snoc 18800000.wifi: qmi chip_id 0x30214 chip_family 0x4001 board_id 0xff soc_id 0x40030001
ath10k_snoc 18800000.wifi: qmi fw_version 0x2009856b fw_build_timestamp 2018-07-19 12:28 fw_build_id QC_IMAGE_VERSION_STRING=WLAN.HL.2.0-01387-QCAHLSWMTPLZ-1
ath10k_snoc 18800000.wifi: wcn3990 hw1.0 target 0x00000008 chip_id 0x00000000 sub 0000:0000
ath10k_snoc 18800000.wifi: kconfig debug 1 debugfs 1 tracing 1 dfs 0 testmode 0
ath10k_snoc 18800000.wifi: firmware ver api 5 features wowlan,mgmt-tx-by-reference,non-bmi crc32 b3d4b790
ath10k_snoc 18800000.wifi: htt-ver 3.53 wmi-op 4 htt-op 3 cal file max-sta 32 raw 0 hwcrypto 1
ath10k_snoc 18800000.wifi: invalid MAC address; choosing random
[2] Bluetooth probing log:
Bluetooth: hci0: setting up wcn399x
Bluetooth: hci0: QCA Product ID :0x0000000a
Bluetooth: hci0: QCA SOC Version :0x40010214
Bluetooth: hci0: QCA ROM Version :0x00000201
Bluetooth: hci0: QCA Patch Version:0x00000001
Bluetooth: hci0: QCA controller version 0x02140201
Bluetooth: hci0: QCA Downloading qca/crbtfw21.tlv
Bluetooth: hci0: QCA Downloading qca/crnv21.bin
Bluetooth: hci0: QCA setup on UART is completed
Features not supported yet:
- Panel Backlight
- Lid Detection
- Battery
- EFI Variable Access
- Cameras
1. Panel backlight, lid detection and battery will be supported with the
EC driver upstreamed.
2. EFI variables can only be read with the QSEECOM driver, and will be
enabled when the driver is fixed.
3. Cameras are tested to work with modified downstream driver, and once
drivers for these camera modules are included in the tree, cameras can
be enabled.
Features won't be supported:
- External Display
- Fingerprint
1. To make external display work, more reverse engineering may be required,
but it's beyond my ability.
2. Fingerprint is controlled by TrustZone, meaning direct access to it
isn't possible.
Reviewed-by: Dmitry Baryshkov <dmitry.baryshkov@oss.qualcomm.com>
Signed-off-by: Jingzhou Zhu <newwheatzjz@zohomail.com>
Reviewed-by: Konrad Dybcio <konrad.dybcio@oss.qualcomm.com>
Link: https://lore.kernel.org/r/20251008130052.11427-3-newwheatzjz@zohomail.com
Signed-off-by: Bjorn Andersson <andersson@kernel.org>
The UFS device is ovbiously dma coherent like the other IOMMU devices
like usb, mmc, ... let's fix this by adding the flag.
To be sure an extensive test has been performed to be sure it's
safe, as downstream uses this flag for UFS as well.
As an experiment, I checked how the dma-coherent could impact
the UFS bandwidth, and it happens the max bandwidth on cached
write is slighly highter (up to 10%) while using less cpu time
since cache sync/flush is skipped.
Fixes: 10e0246712 ("arm64: dts: qcom: sm8650: add interconnect dependent device nodes")
Signed-off-by: Neil Armstrong <neil.armstrong@linaro.org>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Reviewed-by: Dmitry Baryshkov <dmitry.baryshkov@oss.qualcomm.com>
Link: https://lore.kernel.org/r/20251007-topic-sm8650-upstream-ufs-dma-coherent-v1-1-f3cfeaee04ce@linaro.org
Signed-off-by: Bjorn Andersson <andersson@kernel.org>
Radxa Dragon Q6A is a single board computer, based on the Qualcomm
QCS6490 platform.
Features enabled and working:
- Configurable I2C/SPI/UART from 40-Pin GPIO
- Three USB-A 2.0 ports
- RTL8111K Ethernet connected to PCIe0
- eMMC module
- SD card
- M.2 M-Key 2230 PCIe 3.0 x2
- Headphone jack
- Onboard thermal sensors
- QSPI controller for updating boot firmware
- ADSP remoteproc (Type-C and charging features disabled in firmware)
- CDSP remoteproc (for AI applications using QNN)
- Venus video encode and decode accelerator
Reviewed-by: Konrad Dybcio <konrad.dybcio@oss.qualcomm.com>
Reviewed-by: Dmitry Baryshkov <dmitry.baryshkov@oss.qualcomm.com>
Signed-off-by: Xilin Wu <sophon@radxa.com>
Link: https://lore.kernel.org/r/20250929-radxa-dragon-q6a-v5-2-aa96ffc352f8@radxa.com
Signed-off-by: Bjorn Andersson <andersson@kernel.org>
The laptop comes in two variants:
* UX3407RA, higher end, FHD+ OLED or WOXGA+ OLED panels
* UX3407QA, lower end, FHD+ OLED or FHD+ LCD panels
Even though all three panels work with "edp-panel", unfortunately the
brightness adjustmenet of LCD panel is PWM based, requiring a dedicated
device-tree. Convert "x1p42100-asus-zenbook-a14.dts" into ".dtsi" to
allow for this split, introduce new LCD variant. Leave current variant
without postfix and with the unchanged model name, as some distros
(eg. Ubuntu) rely on this for automatic device-tree detection during
kernel installation/upgrade.
As dedicated device-tree is required, update compatibles of OLED
variants to correct ones. Keep "edp-panel" as fallback, since it is
enough to make the panels work.
While at it moving .dts, .dtsi around, drop 'model' from the top level
x1-asus-zenbook-a14.dtsi as well.
Co-developed-by: Jens Glathe <jens.glathe@oldschoolsolutions.biz>
Signed-off-by: Jens Glathe <jens.glathe@oldschoolsolutions.biz>
Reviewed-by: Konrad Dybcio <konrad.dybcio@oss.qualcomm.com>
Signed-off-by: Aleksandrs Vinarskis <alex@vinarskis.com>
Link: https://lore.kernel.org/r/20250927-zenbook-improvements-v3-2-d46c7368dc70@vinarskis.com
Signed-off-by: Bjorn Andersson <andersson@kernel.org>
Traditionally, firmware loading for Serial Engines (SE) in the QUP hardware
of Qualcomm SoCs has been managed by TrustZone (TZ). While this approach
ensures secure SE assignment and access control, it limits flexibility for
developers who need to enable various protocols on different SEs.
Add the firmware-name property to QUPv3 nodes in the device tree to enable
firmware loading from the Linux environment. Handle SE assignments and
access control permissions directly within Linux, removing the dependency
on TrustZone.
Signed-off-by: Viken Dadhaniya <viken.dadhaniya@oss.qualcomm.com>
Link: https://lore.kernel.org/r/20250925042605.1388951-1-viken.dadhaniya@oss.qualcomm.com
Signed-off-by: Bjorn Andersson <andersson@kernel.org>
Traditionally, firmware loading for Serial Engines (SE) in the QUP hardware
of Qualcomm SoCs has been managed by TrustZone (TZ). While this approach
ensures secure SE assignment and access control, it limits flexibility for
developers who need to enable various protocols on different SEs.
Add the firmware-name property to QUPv3 nodes in the device tree to enable
firmware loading from the Linux environment. Handle SE assignments and
access control permissions directly within Linux, removing the dependency
on TrustZone.
Signed-off-by: Viken Dadhaniya <viken.dadhaniya@oss.qualcomm.com>
Reviewed-by: Dmitry Baryshkov <dmitry.baryshkov@oss.qualcomm.com>
Acked-by: Mukesh Kumar Savaliya <mukesh.savaliya@oss.qualcomm.com>
Link: https://lore.kernel.org/r/20250924035409.3976652-1-viken.dadhaniya@oss.qualcomm.com
Signed-off-by: Bjorn Andersson <andersson@kernel.org>
Traditionally, firmware loading for Serial Engines (SE) in the QUP hardware
of Qualcomm SoCs has been managed by TrustZone (TZ). While this approach
ensures secure SE assignment and access control, it limits flexibility for
developers who need to enable various protocols on different SEs.
Add the firmware-name property to QUPv3 nodes in the device tree to enable
firmware loading from the Linux environment. Handle SE assignments and
access control permissions directly within Linux, removing the dependency
on TrustZone.
Signed-off-by: Viken Dadhaniya <viken.dadhaniya@oss.qualcomm.com>
Reviewed-by: Dmitry Baryshkov <dmitry.baryshkov@oss.qualcomm.com>
Link: https://lore.kernel.org/r/20250923161107.3541698-1-viken.dadhaniya@oss.qualcomm.com
Signed-off-by: Bjorn Andersson <andersson@kernel.org>
Update min/max voltage settings for regulators below to align
with the HW specification
vreg_l3b_0p504
vreg_l6b_1p2
vreg_l11b_1p504
vreg_l14b_1p08
vreg_l16b_1p1
vreg_l17b_1p7
vreg_s1c_2p19
vreg_l8c_1p62
vreg_l9c_2p96
vreg_l12c_1p65.
While at it, remove RPMH regulator rails (listed below) as
these are not to be used on APPS, and any client accidently
voting on it can potentially cause issues.
vreg_s2b_0p876
vreg_s2c_0p752
vreg_s5c_0p752
vreg_s7c_0p752
vreg_s10c_0p752
vreg_l4b_0p752
vreg_l5b_0p752.
Signed-off-by: Rakesh Kota <rakesh.kota@oss.qualcomm.com>
Link: https://lore.kernel.org/r/20250919-b4-rb3gen2-update-regulator-v1-1-1ea9e70d01cb@oss.qualcomm.com
Signed-off-by: Bjorn Andersson <andersson@kernel.org>
Currently, asm/percpu.h is directly or indirectly included by
some assembly files on x86. Some of them (e.g., checksum_32.S)
are also used on um. But x86 and um provide different versions
of asm/percpu.h -- um uses asm-generic/percpu.h directly.
When SMP is enabled, asm-generic/percpu.h will introduce C code
that cannot be assembled. Since asm-generic/percpu.h currently
is not designed for use in assembly, and these assembly files
do not actually need asm/percpu.h on um, let's add the assembly
guard in asm-generic/percpu.h to fix this issue.
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: linux-arch@vger.kernel.org
Signed-off-by: Tiwei Bie <tiwei.btw@antgroup.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Link: https://patch.msgid.link/20251027001815.1666872-8-tiwei.bie@linux.dev
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Add initial symmetric multi-processing (SMP) support to UML. With
this support enabled, users can tell UML to start multiple virtual
processors, each represented as a separate host thread.
In UML, kthreads and normal threads (when running in kernel mode)
can be scheduled and executed simultaneously on different virtual
processors. However, the userspace code of normal threads still
runs within their respective single-threaded stubs.
That is, SMP support is currently available both within the kernel
and across different processes, but still remains limited within
threads of the same process in userspace.
Signed-off-by: Tiwei Bie <tiwei.btw@antgroup.com>
Link: https://patch.msgid.link/20251027001815.1666872-6-tiwei.bie@linux.dev
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
With SMP and NO_HZ enabled, the CPU may still need to sleep even
if the timer is disarmed. Switch to deciding whether to sleep based
on pending resched. Additionally, because disabling IRQs does not
block SIGALRM, it is also necessary to check for any pending timer
alarms. This is a preparation for adding SMP support.
Signed-off-by: Tiwei Bie <tiwei.btw@antgroup.com>
Link: https://patch.msgid.link/20251027001815.1666872-4-tiwei.bie@linux.dev
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Currently, initial_thread_cb() temporarily disables kmalloc when
it invokes the callback, allowing the callback to bypass kmalloc.
This is unnecessary for the current users of initial_thread_cb(),
and we should avoid memory allocations that are not under the
control of the UML kernel. Therefore, let's stop temporarily
disabling kmalloc in initial_thread_cb().
Signed-off-by: Tiwei Bie <tiwei.btw@antgroup.com>
Link: https://patch.msgid.link/20251027001815.1666872-2-tiwei.bie@linux.dev
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The file-based iomem emulation was introduced to support writing
paravirtualized drivers based on emulated iomem regions. However,
the only driver that makes use of it is an example driver called
mmapper, which was written over two decades ago.
We now have several modern device emulation mechanisms, such as
vhost-user-based virtio-uml. Remove the file-based iomem emulation
support to reduce the maintenance burden.
Signed-off-by: Tiwei Bie <tiwei.btw@antgroup.com>
Link: https://patch.msgid.link/20251027054519.1996090-5-tiwei.bie@linux.dev
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Although UML_ROUND_UP() is defined in a shared header file, it
depends on the PAGE_SIZE and PAGE_MASK macros, so it can only be
used in kernel code. Considering its name is not very clear and
its functionality is the same as PAGE_ALIGN(), replace its usages
with a direct call to PAGE_ALIGN() and remove it.
Signed-off-by: Tiwei Bie <tiwei.btw@antgroup.com>
Link: https://patch.msgid.link/20251027054519.1996090-4-tiwei.bie@linux.dev
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
PCIe ECAM(Enhanced Configuration Access Mechanism) feature requires
maximum of 256MB configuration space.
To enable this feature increase configuration space size to 256MB. If
the config space is increased, the BAR space needs to be truncated as
it resides in the same location. To avoid the bar space truncation move
config space, DBI, ELBI, iATU to upper PCIe region and use lower PCIe
iregion entirely for BAR region.
This depends on the commit: '10ba0854c5e6 ("PCI: qcom: Disable mirroring
of DBI and iATU register space in BAR region")'
Reviewed-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Reviewed-by: Konrad Dybcio <konrad.dybcio@oss.qualcomm.com>
Signed-off-by: Krishna Chaitanya Chundru <krishna.chundru@oss.qualcomm.com>
Link: https://lore.kernel.org/r/20250828-ecam_v4-v8-1-92a30e0fa02d@oss.qualcomm.com
Signed-off-by: Bjorn Andersson <andersson@kernel.org>
The HOSTFS_ATTR_* values were meant to be standalone for
communication between hostfs's kernel and user code parts.
However, it's easy to forget that HOSTFS_ATTR_* should be
used even on the kernel side, and that wasn't consistently
done. As a result, the values need to match ATTR_* values,
which is not useful to maintain by hand. Instead, generate
them via asm-offsets like other constants that UML needs
in user-side code that aren't otherwise available in any
header files that can be included there.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Reviewed-by: Hongbo Li <lihongbo22@huawei.com>
Link: https://patch.msgid.link/20251007071452.367989-3-johannes@sipsolutions.net
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This is currently done in uml_finishsetup(), but e.g. with
KCOV enabled we'll crash because some init code can call
into e.g. memparse(), which has coverage annotations, and
then the checks in check_kcov_mode() crash because current
is NULL.
Simply initialize the cpu_tasks[] array statically, which
fixes the crash. For the later SMP work, it seems to have
not really caused any problems yet, but initialize all of
the entries anyway.
Link: https://patch.msgid.link/20250924113214.c76cd74d0583.I974f691ebb1a2b47915bd2b04cc38e5263b9447f@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The Genio 1200 EVK board has the big and little CPU clusters fed by the
same regulators as MT8195-Cherry boards, so describe them in the same
way as commit 17b33dd9e4 ("arm64: dts: mediatek: cherry: Describe CPU
supplies").
This prevents the system from hanging during boot in the case that the
cpufreq-mediatek-hw driver tries to probe before the drivers for the
regulators have probed (which happens when using the current defconfig).
Fixes: f2b543a191 ("arm64: dts: mediatek: add device-tree for Genio 1200 EVK board")
Signed-off-by: Nícolas F. R. A. Prado <nfraprado@collabora.com>
Signed-off-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
i.MX7ULP pinctrl don't support bias-pull-up property. So remove it to fix
below CHECK_DTBS warnings:
arch/arm/boot/dts/nxp/imx/imx7ulp-evk.dtb: pinctrl@40ac0000 (fsl,imx7ulp-iomuxc1): lpuart4grp: 'bias-pull-up' does not match any of the regexes: '^pinctrl-[0-9]+$'
Signed-off-by: Frank Li <Frank.Li@nxp.com>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
Remove undocumented clock-names for ov5642 to fix below CHECK_DTBS warnings:
arch/arm/boot/dts/nxp/imx/imx6q-sabresd.dtb: camera@3c (ovti,ov5642): 'clock-names' does not match any of the regexes: '^pinctrl-[0-9]+$'
from schema $id: http://devicetree.org/schemas/media/i2c/ovti,ov5642.yaml#
Signed-off-by: Frank Li <Frank.Li@nxp.com>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
Add device_type for memory node to fix below CHECK_DTB warnings:
arch/arm/boot/dts/nxp/imx/imx6dl-b105pv2.dtb: / (ge,imx6dl-b105pv2): memory@10000000: 'device_type' is a required property
Signed-off-by: Frank Li <Frank.Li@nxp.com>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
Add bus type for parallel ov5640 to fix below CHECK_DTBS warnings:
arch/arm/boot/dts/nxp/imx/imx6q-sabrelite.dtb: camera@42 (ovti,ov5642): port:endpoint:hsync-active: False schema does not allow 1
Signed-off-by: Frank Li <Frank.Li@nxp.com>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
Fix typo interrupt, which should be 'interrupts'. Fix below CHECK_DTBS
warnings.
arch/arm/boot/dts/nxp/imx/imx6dl-skov-revc-lt2.dtb: switch@0 (microchip,ksz8873): Unevaluated properties are not allowed ('interrupt', 'pinctrl-names' were unexpected)
from schema $id: http://devicetree.org/schemas/net/dsa/microchip,ksz.yaml
Signed-off-by: Frank Li <Frank.Li@nxp.com>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
Remove redundant linux,phandle to fix below CHECK_DTBS warnings:
arch/arm/boot/dts/nxp/imx/imx6dl-gw560x.dtb: pmic@3c (lltc,ltc3676): regulators:sw3: Unevaluated properties are not allowed ('linux,phandle' was unexpected)
Signed-off-by: Frank Li <Frank.Li@nxp.com>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
Rename power-supply to vcc-supply for touchscreen to fix below CHECK_DTB
warnings:
arch/arm/boot/dts/nxp/imx/imx6ull-dhcom-pdk2.dtb: touchscreen@38 (edt,edt-ft5406): Unevaluated properties are not allowed ('power-supply' was unexpected)
Signed-off-by: Frank Li <Frank.Li@nxp.com>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
Add power-supply for lcd panel to fix below CHECK_DTBS warnings:
arch/arm/boot/dts/nxp/imx/imx6q-evi.dtb: panel (sharp,lq101k1ly04): 'power-supply' is a required property
Signed-off-by: Frank Li <Frank.Li@nxp.com>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
The same displays that can be connected directly to the
imx8mp-phyboard-pollux can also be connected to the expansion board
PEB-AV-10. For displays connected to the expansion board, a second LVDS
channel of the i.MX 8M Plus SoC is used and only a single display
connected to the SoC LVDS display bridge at a given time is supported.
Reviewed-by: Peng Fan <peng.fan@nxp.com>
Signed-off-by: Yannic Moog <y.moog@phytec.de>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
An expansion board (PEB-AV-10) may be connected to the
imx8mp-phyboard-pollux. Its main purpose is to provide multimedia
interfaces, featuring a 3.5mm headphone jack, a USB-A port and LVDS as
well as backlight connectors.
Introduce the expansion board as dtsi, as it may be used standalone as
an expansion board, as well as in combination with display panels. These
display panels will include the dtsi.
Reviewed-by: Peng Fan <peng.fan@nxp.com>
Signed-off-by: Yannic Moog <y.moog@phytec.de>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
imx8mp-phyboard-pollux had a display baked into its board dts file.
However this approach does not truly discribe the hardware and is not
suitable when using different displays.
Move display specific description into an overlay and add the successor
display for the phyboard-pollux as an additional overlay.
Reviewed-by: Teresa Remmet <t.remmet@phytec.de>
Reviewed-by: Peng Fan <peng.fan@nxp.com>
Signed-off-by: Yannic Moog <y.moog@phytec.de>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
Change license from GPL-2.0 to GPL-2.0-or-later OR MIT.
Use syntax as defined in the SPDX standard. Also remove individual
authorship.
Acked-by: Teresa Remmet <t.remmet@phytec.de>
Signed-off-by: Yannic Moog <y.moog@phytec.de>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
In normal case, there is no need to invoke mutex_destroy in error path,
but it is useful when CONFIG_DEBUG_MUTEXES, so use devm_mutex_init().
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Signed-off-by: Peng Fan <peng.fan@nxp.com>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
The SCU driver is critical for system working properly, it should
never be removed and binded again. So suppress the bind attrs
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Signed-off-by: Peng Fan <peng.fan@nxp.com>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
IMX_SC_ERR_NOTFOUND should map with -ENOENT, not -EEXIST. -ENODEV makes
more sense for IMX_SC_ERR_NOPOWER, and -ECOMM makes more sense for
IMX_SC_ERR_IPC.
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Signed-off-by: Peng Fan <peng.fan@nxp.com>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
Since its introduction, this symbol has not been used by any loadable
modules. It remains only referenced within imx-scu.c, which is always built
together with imx-scu-irq.c
As such, exporting imx_scu_enable_general_irq_channel is unnecessary, so
remove the export.
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Signed-off-by: Peng Fan <peng.fan@nxp.com>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
mu_resource_id is referenced in imx_scu_irq_get_status() and
imx_scu_irq_group_enable() which could be used by other modules, so
need to set correct value before using imx_sc_irq_ipc_handle in
SCU API call.
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Signed-off-by: Peng Fan <peng.fan@nxp.com>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
With mailbox channel requested, there is possibility that interrupts may
come in, so need to make sure the workqueue is initialized before
the queue is scheduled by mailbox rx callback.
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Signed-off-by: Peng Fan <peng.fan@nxp.com>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
The IRQ mailbox is an optional channel and does not need to be kept until
driver removal when an error occurs. Free the allocated memory in the
error path.
Add 'goto free_cl' when mbox_request_channel_byname() fails, to keep free
at one place.
Signed-off-by: Peng Fan <peng.fan@nxp.com>
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
imx_scu_enable_general_irq_channel() calls of_parse_phandle_with_args(),
but does not release the OF node reference. Add a of_node_put() call
to release the reference.
Fixes: 851826c756 ("firmware: imx: enable imx scu general irq function")
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Signed-off-by: Peng Fan <peng.fan@nxp.com>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
Rename i2c<n>mux to i2c to fix below CHECK_DTBS warnings:
arch/arm/boot/dts/nxp/imx/imx6q-nitrogen6_max.dtb: i2c2mux (i2c-mux-gpio): $nodename:0: 'i2c2mux' does not match '^(i2c-?)?mux'
Signed-off-by: Frank Li <Frank.Li@nxp.com>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
Remove extra space in " jedec,spi-nor" to fix below CHECK_DTBS warnings:
arch/arm/boot/dts/nxp/imx/imx6ull-phytec-tauri-emmc.dtb: /soc/bus@2000000/spba-bus@2000000/spi@2008000/flash@2: failed to match any schema with compatible: [' jedec,spi-nor']
Signed-off-by: Frank Li <Frank.Li@nxp.com>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
Add device_type, bus-range, ranges for pci nodes. Rename intel,i211 to
ethernet to fix below CHECK_DTBS warnings:
arch/arm/boot/dts/nxp/imx/imx6q-utilite-pro.dtb: pcie@0,0: 'device_type' is a required property
from schema $id: http://devicetree.org/schemas/pci/pci-bus-common.yaml#
arch/arm/boot/dts/nxp/imx/imx6q-utilite-pro.dtb: pcie@0,0: 'ranges' is a required property
from schema $id: http://devicetree.org/schemas/pci/pci-bus-common.yaml
arm/boot/dts/nxp/imx/imx6q-utilite-pro.dtb: pcie@0,0: 'intel,i211@pcie0,0' does not match any of the regexes: '.*-names$', '.*-supply$', '^#.*-cells$', '^#[a-zA-Z0-9,+\\-._]{0,63}$', '^[a-zA-Z0-9][a-zA-Z0-9#,+\\-._]{0,63}$', '^[a-zA-Z0-9][a-zA-Z0-9,+\\-._]{0,63}@[0-9a-fA-F]+(,[0-9a-fA-F]+)*$', '^__.*__$', 'pinctrl-[0-9]+'
from schema $id: http://devicetree.org/schemas/dt-core.yaml
Signed-off-by: Frank Li <Frank.Li@nxp.com>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
Remove redundant pinctrl-name since pinctrl-0 doesn't exist to fix below
CHECK_DTBS warnings:
arch/arm/boot/dts/nxp/imx/imx6q-pistachio.dtb: pinctrl@20e0000 (fsl,imx6q-iomuxc): 'pinctrl-0' is a dependency of 'pinctrl-names'
from schema $id: http://devicetree.org/schemas/pinctrl/pinctrl-consumer.yaml
Signed-off-by: Frank Li <Frank.Li@nxp.com>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
Rename ngpio to ngpios to fix below CHECK_DTBS warnings:
arch/arm/boot/dts/nxp/imx/imx6dl-ts4900.dtb: gpio@28 (technologic,ts4900-gpio): 'ngpio' does not match any of the regexes: '^(hog-[0-9]+|.+-hog(-[0-9]+)?)$', '^pinctrl-[0-9]+$'
from schema $id: http://devicetree.org/schemas/trivial-gpio.yaml
Signed-off-by: Frank Li <Frank.Li@nxp.com>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
rename m95m02 to eeprom to fix below CHECK_DTBS warnings:
arch/arm/boot/dts/nxp/imx/imx6q-evi.dtb: m95m02@1 (st,m95m02): $nodename: 'anyOf' conditional failed, one must be fixed:
'm95m02@1' does not match '^eeprom@[0-9a-f]{1,2}$'
'm95m02@1' does not match '^fram@[0-9a-f]{1,2}$'
Signed-off-by: Frank Li <Frank.Li@nxp.com>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
Rename touch-thermal0 to touch-0-thermal to fix below CHECK_DTBS warnings:
arch/arm/boot/dts/nxp/imx/imx6dl-plym2m.dtb: thermal-zones: 'touch-thermal0', 'touch-thermal1' do not match any of the regexes: '^[a-zA-Z][a-zA-Z0-9\\-]{1,10}-thermal$', 'pinctrl-[0-9]+'
from schema $id: http://devicetree.org/schemas/thermal/thermal-zones.yaml
Signed-off-by: Frank Li <Frank.Li@nxp.com>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
Rename stmpgpio to gpio and add gpio-controller and interrupt-controller.
Rename stmpe_adc to adc. Move interrupt-controller and gpio-controller
under gpio node.
to fix below CHECK_DTBS warnings:
/home/lizhi/source/linux-upstream-pci/arch/arm/boot/dts/nxp/imx/imx6q-dmo-edmqmx6.dtb: stmpe1601@40 (st,stmpe1601): gpio: 'interrupt-controller' is a required property
from schema $id: http://devicetree.org/schemas/mfd/st,stmpe.yaml#
/home/lizhi/source/linux-upstream-pci/arch/arm/boot/dts/nxp/imx/imx6q-dmo-edmqmx6.dtb: gpio (st,stmpe-gpio): 'interrupt-controller' is a required property
from schema $id: http://devicetree.org/schemas/gpio/st,stmpe-gpio.yaml#
/home/lizhi/source/linux-upstream-pci/arch/arm/boot/dts/nxp/imx/imx6q-dmo-edmqmx6.dtb: stmpe1601@44 (st,stmpe1601): gpio: 'interrupt-controller' is a required property
from schema $id: http://devicetree.org/schemas/mfd/st,stmpe.yaml#
/home/lizhi/source/linux-upstream-pci/arch/arm/boot/dts/nxp/imx/imx6q-dmo-edmqmx6.dtb: gpio (st,stmpe-gpio): 'interrupt-controller' is a required property
from schema $id: http://devicetree.org/schemas/gpio/st,stmpe-gpio.yaml#
Signed-off-by: Frank Li <Frank.Li@nxp.com>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
Describe the two SFP+ cages present on the LS1046AQDS board and their
associated I2C buses and GPIO lines.
Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
Describe the two SFP+ cages found on the LX2160ARDB board with their
respective I2C buses and GPIO lines.
Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
The QIXIS FPGA node is extended so that it describes the GPIO controller
responsible for all the status presence lines on both SFP+ cages as well
as the IO SLOTs present on the board.
Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
Describe the FPGA present on the LX2160ARDB board as a simple-mfd I2C
device. The FPGA presents registers that deal with power-on-reset
timing, muxing, SFP cage monitoring and control etc.
Also add the two GPIO controllers responsible for monitoring and
controlling the SFP+ cages used for MAC5 and MAC6.
Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
Extend the list of accepted child nodes with the QIXIS FPGA based GPIO
controller and explicitly list its compatible string
fsl,ls1046aqds-fpga-gpio-stat-pres2 as the only one accepted.
Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Acked-by: Rob Herring (Arm) <robh@kernel.org>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
Extend the list of supported compatible strings with fsl,lx2160ardb-fpga.
Since the register map exposed by the LX2160ARDB's FPGA also contains
two GPIO controllers, accept the necessary GPIO pattern property.
At the same time, add the #address-cells and #size-cells properties as
valid ones so that the child nodes of the fsl,lx2160ardb-fpga node are
addressable.
This is necessary because when defining child devices such as the GPIO
controller described in the added example, the child device needs a the
reg property to properly identify its register location in the parent
I2C device address space.
Impose this restriction for the new compatible through an if-statement.
The feature set exposed by these QIXIS FPGA devices is highly dependent
on the board type, meaning that even though the FPGA found on the
LX2160AQDS board (fsl,lx2160aqds-fpga) works in the same way in terms of
access over I2C as the one found on the LX2160ARDB (fsl,lx2160ardb-fpga
added here), the register map inside the device space is different since
there are different on-board devices to be controlled.
Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Acked-by: Conor Dooley <conor.dooley@microchip.com>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
Add devicetree for the Protonic PRT8ML.
The board is similar to the Protonic PRT8MM but i.MX8MP based.
Some features have been removed as the drivers haven't been mainlined
yet or other issues where encountered:
- Stepper motors to be controlled using motion control subsystem
- MIPI/DSI to eDP USB alt-mode
- Onboard T1 ethernet (10BASE-T1L+PoDL, 100BASE-T1+PoDL, 1000BASE-T1)
Signed-off-by: David Jander <david@protonic.nl>
Signed-off-by: Lucas Stach <l.stach@pengutronix.de>
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Signed-off-by: Jonas Rebmann <jre@pengutronix.de>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
Idle-inject up to 50% of all cpu's time in order to help cpufreq
to keep the temperature below the trip points.
Signed-off-by: Martin Kepplinger-Novaković <martink@posteo.de>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
The thermal framework can use the cpu-idle-states as
described for imx8mp as an alternative or in parallel to
cpufreq.
Add the DT node to the cpu so the cooling devices will be present
and the thermal zone descriptions can use them.
Signed-off-by: Martin Kepplinger-Novaković <martink@posteo.de>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
The RTC inside mc34708 is supported by RTC_DRV_MC13XXX since v3.6-rc1.
Enable the PMIC RTC on the imx53-qsrb. Without a battery the RTC may be
powered via the micro-USB connector when main 5V power is not available.
Signed-off-by: Alexander Kurz <akurz@blala.de>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
Disable Hysteresis Enable Field in pad ctl register for USB OC pin
as this is more appropriate for the signal form in our case.
Signed-off-by: Teresa Remmet <t.remmet@phytec.de>
Reviewed-by: Peng Fan <peng.fan@nxp.com>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
Move nmi_watchdog into the watchdog_sysctls array to prevent it from
unnecessary modification. This move effectively moves it inside the
.rodata section.
Initially moved out into its own non-const array in commit 9ec272c586
("watchdog/hardlockup: keep kernel.nmi_watchdog sysctl as 0444 if probe
fails"), which made it writable only when watchdog_hardlockup_available
was true. Moving it back to watchdog_sysctl keeps this behavior as
writing to nmi_watchdog still fails when watchdog_hardlockup_available
is false.
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Joel Granados <joel.granados@kernel.org>
In the commit referenced by the Fixes tag, devm_clk_get_enabled() was
introduced to replace devm_clk_get() and clk_prepare_enable(). While
the clk_disable_unprepare() call in the error path was correctly
removed, the one in the remove function was overlooked, leading to a
double disable issue.
Remove the redundant clk_disable_unprepare() call from gsbi_remove()
to fix this issue. Since all resources are now managed by devres
and will be automatically released, the remove function serves no purpose
and can be deleted entirely.
Fixes: 489d7a8cc2 ("soc: qcom: use devm_clk_get_enabled() in gsbi_probe()")
Signed-off-by: Haotian Zhang <vulab@iscas.ac.cn>
Reviewed-by: Konrad Dybcio <konrad.dybcio@oss.qualcomm.com>
Link: https://lore.kernel.org/stable/20251020160215.523-1-vulab%40iscas.ac.cn
Link: https://lore.kernel.org/r/20251020160215.523-1-vulab@iscas.ac.cn
Signed-off-by: Bjorn Andersson <andersson@kernel.org>
System On Chip Control Processor (SOCCP) is a subsystem that can have
battery management firmware running on it to support Type-C/PD and
battery charging. SOCCP does not have multiple PDs and hence PDR is not
supported. So, if the subsystem comes up/down, rpmsg driver would be
probed or removed. Use that for notifying clients of pmic_glink for
PDR events.
Add support for battery management FW running on SOCCP by adding the
"PMIC_RTR_SOCCP_APPS" channel name to the rpmsg_match list and
updating notify_clients logic.
Signed-off-by: Anjelique Melendez <anjelique.melendez@oss.qualcomm.com>
Link: https://lore.kernel.org/r/20250919175025.2988948-1-anjelique.melendez@oss.qualcomm.com
Signed-off-by: Bjorn Andersson <andersson@kernel.org>
NSS clock controller provides the clocks and resets to the networking
blocks such as PPE (Packet Process Engine) and UNIPHY (PCS) on IPQ5424
devices.
Add support for the compatible string "qcom,ipq5424-nsscc" based on the
existing IPQ9574 NSS clock controller Device Tree binding. Additionally,
update the clock names for PPE and NSS for newer SoC additions like
IPQ5424 to use generic and reusable identifiers "nss" and "ppe" without
the clock rate suffix.
Also add master/slave ids for IPQ5424 networking interfaces, which is
used by nss-ipq5424 driver for providing interconnect services using
icc-clk framework.
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Signed-off-by: Luo Jie <quic_luoj@quicinc.com>
Link: https://lore.kernel.org/r/20251014-qcom_ipq5424_nsscc-v7-7-081f4956be02@quicinc.com
Signed-off-by: Bjorn Andersson <andersson@kernel.org>
The Networking Subsystem (NSS) clock controller acts as both a clock
provider and an interconnect provider. The #interconnect-cells property
is needed in the Device Tree Source (DTS) to ensure that client drivers
such as the PPE driver can correctly acquire ICC clocks from the NSS ICC
provider.
Add the #interconnect-cells property to the IPQ9574 Device Tree binding
example to complete it.
Fixes: 28300ecedc ("dt-bindings: clock: Add ipq9574 NSSCC clock and reset definitions")
Acked-by: Rob Herring (Arm) <robh@kernel.org>
Signed-off-by: Luo Jie <quic_luoj@quicinc.com>
Link: https://lore.kernel.org/r/20251014-qcom_ipq5424_nsscc-v7-2-081f4956be02@quicinc.com
Signed-off-by: Bjorn Andersson <andersson@kernel.org>
The memory map lists each clock module unit as having a size of
0x10000. Additionally there are some undocumented registers in this region
that need to be used for automatic clock gating mode. Some of those
registers also need to be saved/restored on suspend & resume.
Fixes: 86124c7668 ("arm64: dts: exynos: gs101: enable cmu-hsi2 clock controller")
Fixes: 4982a4a209 ("arm64: dts: exynos: gs101: enable cmu-hsi0 clock controller")
Fixes: 7d66d98b5b ("arm64: dts: exynos: gs101: enable cmu-peric1 clock controller")
Fixes: e62c706f3a ("arm64: dts: exynos: gs101: enable cmu-peric0 clock controller")
Fixes: ea89fdf24f ("arm64: dts: exynos: google: Add initial Google gs101 SoC support")
Signed-off-by: Peter Griffin <peter.griffin@linaro.org>
Reviewed-by: André Draszik <andre.draszik@linaro.org>
Link: https://patch.msgid.link/20251013-automatic-clocks-v1-4-72851ee00300@linaro.org
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
"mss-top-sysreg" contains clocks, pinctrl, resets, an interrupt controller
and more. At this point, only the reset controller child is described as
that's all that is described by the existing bindings.
The clock controller already has a dedicated node, and will retain it as
there are other clock regions, so like the mailbox, a compatible-based
lookup of the syscon is sufficient to keep the clock driver working as
before, so no child is needed. There's also an interrupt multiplexing
service provided by this syscon, for which there is work in progress at
[1].
Link: https://lore.kernel.org/linux-gpio/20240723-uncouple-enforcer-7c48e4a4fefe@wendy/ [1]
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Signed-off-by: Conor Dooley <conor.dooley@microchip.com>
According to Documentation/devicetree/bindings/spi/spi-nxp-fspi.yaml,
imx93/imx95 should use it's own compatible string and fallback
compatible with imx8mm.
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Signed-off-by: Haibo Chen <haibo.chen@nxp.com>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
Add fan-supply regulator to pwm-fan node to specify power source.
Fixes: e3e8b199af ("arm64: dts: imx95: Add imx95-15x15-evk support")
Signed-off-by: Joy Zou <joy.zou@nxp.com>
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
SDHC1 on the GW702x SOM routes to a connector for use on a baseboard
and as such are defined in the baseboard device-trees.
Remove it from the gw702x SOM device-tree.
Fixes: 0d5b288c21 ("arm64: dts: freescale: Add imx8mp-venice-gw7905-2x")
Signed-off-by: Tim Harvey <tharvey@gateworks.com>
Reviewed-by: Peng Fan <peng.fan@nxp.com>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
UART1 and UART3 go to a connector for use on a baseboard and as such are
defined in the baseboard device-trees. Remove them from the gw702x SOM
device-tree.
Fixes: 0d5b288c21 ("arm64: dts: freescale: Add imx8mp-venice-gw7905-2x")
Signed-off-by: Tim Harvey <tharvey@gateworks.com>
Reviewed-by: Peng Fan <peng.fan@nxp.com>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
The SDHC1 interface is not used on the imx8mm-venice-gw72xx. Remove the
unused pinctrl_usdhc1 iomux node.
Fixes: 6f30b27c5e ("arm64: dts: imx8mm: Add Gateworks i.MX 8M Mini Development Kits")
Signed-off-by: Tim Harvey <tharvey@gateworks.com>
Reviewed-by: Peng Fan <peng.fan@nxp.com>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
The i.MX8M Mini FEC RGMII tracelength is less than 1in and does not
require a x6 drive strength. Reduce the CLK drive strength to x1 for
lower emissions. Additionally since TXC is not a high frequency clock,
use slow slew rate (FSEL=0) for lower emmissions and improved signal
quality.
Signed-off-by: Tim Harvey <tharvey@gateworks.com>
Reviewed-by: Peng Fan <peng.fan@nxp.com>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
The i.MX8M Plus EQOS RGMII tracelength is less than 1in and does not
require a x6 drive strength. Reduce the CLK drive strength to x1 for
lower emissions. Additionally since TXC is not a high frequency clock,
use slow slew rate (FSEL=0) for lower emmissions and improved signal
quality.
Signed-off-by: Tim Harvey <tharvey@gateworks.com>
Reviewed-by: Peng Fan <peng.fan@nxp.com>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
Disable the unused refclk output for the TI DP83867 PHY used on
Gateworks Venice boards.
Signed-off-by: Tim Harvey <tharvey@gateworks.com>
Reviewed-by: Peng Fan <peng.fan@nxp.com>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
The board (micro)controller[1] is responsible for functions such as
power supply sequencing, SoC reset as well as serial/MAC address storage,
bootcount and scratch registers.
There is currently no Linux kernel driver for this controller, however,
there is a driver in U-Boot which utilises this binding.
[1] https://ten64doc.traverse.com.au/hardware/microcontroller/
Signed-off-by: Mathew McBride <matt@traverse.com.au>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
Add device tree binding for the board (micro)controller on Ten64 family
boards[1].
The schema is simple and is (presently) only consumed by U-Boot, but it
is possible nvmem, watchdog and other features could be described in
the future, as well as extension to future Traverse boards.
[1] https://ten64doc.traverse.com.au/hardware/microcontroller/
Signed-off-by: Mathew McBride <matt@traverse.com.au>
Acked-by: Conor Dooley <conor.dooley@microchip.com>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
Prepare for Orange Pi RV using jh7110-common.dtsi having GPIO62 assignment
different than mmc0 reset by splitting this out into each board dts.
Signed-off-by: E Shattow <e@freeshell.de>
Signed-off-by: Conor Dooley <conor.dooley@microchip.com>
Add the approach to set up a combination of Enclustra's SoM on a carrier
board and corresponding boot-mode as single device-tree target.
Signed-off-by: Lothar Rubusch <l.rubusch@gmail.com>
Signed-off-by: Dinh Nguyen <dinguyen@kernel.org>
Remove the binding for the generic Mercury+ AA1 on PE1 carrier board.
The removed Mercury+ AA1 on PE1 carrier board is just a particular
setup case, which is actually replaced by the set of generic Mercury+
AA1 combinations patch.
In other words a combination of a Mercury+ AA1 on a PE1 base board,
with boot mode SD card is already covered by the generic AA1
combinations. There is no further reason to keep this particular case
now in a redundantly. Thus the redundant DT setup is removed.
Signed-off-by: Lothar Rubusch <l.rubusch@gmail.com>
Acked-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Signed-off-by: Dinh Nguyen <dinguyen@kernel.org>
Remove the older socfpga_arria10_mercury_pe1.dts, since it is duplicate,
the hardware is covered by the combination of Enclustra's .dtsi files.
The older .dts was limited to only the case of having an Enclustra
Mercury+ AA1 on a Mercury+ PE1 base board, booting from sdmmc. This
functionality is provided also by the generic Enclustra dtsi and dts
files, in particular socfpga_arria10_mercury_aa1_pe1_sdmmc.dts. Since
both .dts files cover the same, the older one is to e replaced in
favor of the more modularized approach.
Signed-off-by: Lothar Rubusch <l.rubusch@gmail.com>
Acked-by: Steffen Trumtrar <s.trumtrar@pengutronix.de>
Signed-off-by: Dinh Nguyen <dinguyen@kernel.org>
Update binding with combined .dts for the Mercury+ PE1, PE3 and ST1
carrier boards with the Mercury+ AA1 SoM.
Signed-off-by: Lothar Rubusch <l.rubusch@gmail.com>
Acked-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Signed-off-by: Dinh Nguyen <dinguyen@kernel.org>
Introduce support for Enclustra's Mercury+ AA1 SoM, based on Intel
Arria10. This is a flexible approach to allow for combining SoM
with carrier board .dtsi and boot-mode .dtsi in a device-tree file.
Signed-off-by: Andreas Buerkler <andreas.buerkler@enclustra.com>
Signed-off-by: Lothar Rubusch <l.rubusch@gmail.com>
Signed-off-by: Dinh Nguyen <dinguyen@kernel.org>
Add generic Enclustra base-board support for the Mercury+ PE1, the
Mercury+ PE3 and the Mercury+ ST1 board. The carrier boards can be
freely combined with the SoMs Mercury+ AA1, Mercury SA1 and
Mercury+ SA2.
Signed-off-by: Andreas Buerkler <andreas.buerkler@enclustra.com>
Signed-off-by: Lothar Rubusch <l.rubusch@gmail.com>
Signed-off-by: Dinh Nguyen <dinguyen@kernel.org>
Add generic boot-mode support to Enclustra Arria10 and Cyclone5 boards.
Some Enclustra carrier boards need hardware adjustments specific to the
selected boot-mode.
Enclustra's Arria10 SoMs allow for booting from different media. By
muxing certain IO pins, the media can be selected. This muxing can be
done by gpios at runtime e.g. when flashing QSPI from off the
bootloader. But also to have statically certain boot media available,
certain adjustments to the DT are needed:
- SD: QSPI must be disabled
- eMMC: QSPI must be disabled, bus width can be doubled to 8 byte
- QSPI: any mmc is disabled, QSPI then defaults to be enabled
The boot media must be accessible to the bootloader, e.g. to load a
bitstream file, but also to the system to mount the rootfs and to use
the specific performance.
Signed-off-by: Andreas Buerkler <andreas.buerkler@enclustra.com>
Signed-off-by: Lothar Rubusch <l.rubusch@gmail.com>
Signed-off-by: Dinh Nguyen <dinguyen@kernel.org>
Rename guest_test_{phys,virt}_mem to g{p,v}a in the pre-fault memory test
to shorten line lengths and to use standard terminology.
Opportunsitically use "base_gva" in the guest code instead of "base_gpa"
to match the host side code, which now passes in "gva" (and because
referencing the virtual address avoids having to know that the data is
identity mapped).
No functional change intended.
Cc: Yan Zhao <yan.y.zhao@intel.com>
Link: https://lore.kernel.org/r/20251007224515.374516-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Running mmu_stress_test on a system with only one CPU is not a recipe for
success. However, there's no clear-cut reason why it absolutely
shouldn't work, so the test shouldn't completely reject such a platform.
At present, the *3/4 calculation will return zero on these platforms and
the test fails. So, instead just skip that calculation.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Brendan Jackman <jackmanb@google.com>
Link: https://lore.kernel.org/r/20251007-b4-kvm-mmu-stresstest-1proc-v1-1-8c95aa0e30b6@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Agilex5 includes an ARM SMMU v3 (System Memory Management Unit) to provide
address translation and memory protection for DMA-capable devices such as
PCIe, USB, and other peripherals.
This commit adds the SMMU node to the Agilex5 device tree with compatible
string "arm,smmu-v3", along with its register space and interrupts.
The SMMU is required to:
- Enable DMA address translation for devices that cannot directly access
the full physical memory space.
- Provide isolation and memory protection by restricting device access
to specific regions of memory, improving system security.
- Support virtualization use cases by enabling safe and isolated device
passthrough to guest VMs.
- Align with ARM platform architecture requirements for IOMMU support.
By describing the SMMU in the device tree, the Linux IOMMU framework
can probe and initialize it during boot. Devices in the system can then
bind to the SMMU via the `iommus` property, enabling memory translation
and protection features as expected.
The following devices are updated to reference the SMMU:
- NAND controller
- DMA controller
- SPI controller
This change is a necessary step toward full enablement high-speed
peripherals on Agilex5.
Signed-off-by: Adrian Ng Ho Yin <adrianhoyin.ng@altera.com>
Signed-off-by: Khairul Anuar Romli <khairul.anuar.romli@altera.com>
Signed-off-by: Dinh Nguyen <dinguyen@kernel.org>
Agilex5 integrates an ARM SMMU v3 (System Memory Management Unit) with
dedicated Translation Buffer Units (TBUs) assigned to various peripherals,
including the Synopsys DesignWare AXI DMA controller.
Each TBU handles address translation for its associated device by mapping
stream IDs to memory access permissions and virtual-to-physical address
mappings via the SMMU core.
The DesignWare AXI DMAC instances on Agilex5 are connected to their
respective TBUs. These TBUs forward DMA transactions from the controller
through the SMMU, enabling IOMMU-based features such as:
- Address translation for DMA operations
- Isolation and protection of memory regions accessed by the DMA controller
- Support for secure and virtualized environments through enforced access
control
To support this configuration, the `iommus` property must be added to the
binding schema for `snps,dw-axi-dmac`. This allows the device tree to
associate each DMA controller with the correct SMMU stream ID, enabling
the Linux IOMMU framework to configure translation contexts at runtime.
This change documents the IOMMU support for the DMA controller on Agilex5
and allows proper integration with the SMMUv3 hardware.
Signed-off-by: Adrian Ng Ho Yin <adrianhoyin.ng@altera.com>
Signed-off-by: Khairul Anuar Romli <khairul.anuar.romli@altera.com>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Signed-off-by: Dinh Nguyen <dinguyen@kernel.org>
Agilex5 integrates an ARM SMMU (System Memory Management Unit) with
Translation Buffer Units (TBUs) assigned to various peripherals,
including the NAND controller.
The Cadence HP NAND controller ("cdns,hp-nfc") on Agilex5 is behind a
TBU connected to the system's SMMUv3. To support this, the controller
requires an `iommus` property in the device tree to properly configure
address translation through the IOMMU framework.
Adding the `iommus` property to the binding schema allows the OS
to associate the NAND controller with its corresponding SMMU stream ID.
This enables:
- DMA address translation between the controller and system memory
- Memory protection for NAND operations
- Proper functioning of the IOMMU framework in secure or virtualized
environments
This change documents the IOMMU integration for the NAND controller
on platforms like Agilex5 where such hardware is present.
Signed-off-by: Adrian Ng Ho Yin <adrianhoyin.ng@altera.com>
Signed-off-by: Khairul Anuar Romli <khairul.anuar.romli@altera.com>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Signed-off-by: Dinh Nguyen <dinguyen@kernel.org>
Add a CLASS to handle getting and putting a guest_memfd file given a
memslot to reduce the amount of related boilerplate, and more importantly
to minimize the chances of forgetting to put the file (thankfully the bug
that prompted this didn't escape initial testing).
Define a CLASS instead of using __free(fput) as _free() comes with subtle
caveats related to FILO ordering (objects are freed in the order in which
they are declared), and the recommended solution/workaround (declare file
pointers exactly when they are initialized) is visually jarring relative
to KVM's (and the kernel's) overall strict adherence to not mixing
declarations and code. E.g. the use in kvm_gmem_populate() would be:
slot = gfn_to_memslot(kvm, start_gfn);
if (!kvm_slot_has_gmem(slot))
return -EINVAL;
struct file *file __free(fput) = kvm_gmem_get_file(slot;
if (!file)
return -EFAULT;
filemap_invalidate_lock(file->f_mapping);
Note, using CLASS() still declares variables in the middle of code, but
the syntactic sugar obfuscates the declaration, i.e. hides the anomaly to
a large extent.
No functional change intended.
Link: https://lore.kernel.org/r/20251007222356.348349-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Add tests for NUMA memory policy binding and NUMA aware allocation in
guest_memfd. This extends the existing selftests by adding proper
validation for:
- KVM GMEM set_policy and get_policy() vm_ops functionality using
mbind() and get_mempolicy()
- NUMA policy application before and after memory allocation
Run the NUMA mbind() test with and without INIT_SHARED, as KVM should allow
doing mbind(), madvise(), etc. on guest-private memory, e.g. so that
userspace can set NUMA policy for CoCo VMs.
Run the NUMA allocation test only for INIT_SHARED, i.e. if the host can't
fault-in memory (via direct access, madvise(), etc.) as move_pages()
returns -ENOENT if the page hasn't been faulted in (walks the host page
tables to find the associated folio)
[sean: don't skip entire test when running on non-NUMA system, test mbind()
with private memory, provide more info in assert messages]
Signed-off-by: Shivank Garg <shivankg@amd.com>
Tested-by: Ashish Kalra <ashish.kalra@amd.com>
Link: https://lore.kernel.org/r/20251016172853.52451-12-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Add NUMA helpers to probe for support/availability and to check if the
test is running on a multi-node system. The APIs will be used to verify
guest_memfd NUMA support.
Signed-off-by: Shivank Garg <shivankg@amd.com>
[sean: land helpers in numaif.h, add comments, tweak names]
Link: https://lore.kernel.org/r/20251016172853.52451-11-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Drop the KVM's re-definitions of MPOL_xxx flags in numaif.h as they are
defined by the already-included, kernel-provided mempolicy.h. The only
reason the duplicate definitions don't cause compiler warnings is because
they are identical, but only on x86-64! The syscall numbers in particular
are subtly x86_64-specific, i.e. will cause problems if/when numaif.h is
used outsize of x86.
Opportunistically clean up the file comment as the license information is
covered by the SPDX header, the path is superfluous, and as above the
comment about the contents is flat out wrong.
Fixes: 346b59f220 ("KVM: selftests: Add missing header file needed by xAPIC IPI tests")
Reviewed-by: Shivank Garg <shivankg@amd.com>
Tested-by: Shivank Garg <shivankg@amd.com>
Link: https://lore.kernel.org/r/20251016172853.52451-10-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Add APIs for all syscalls defined in the kernel's mm/mempolicy.c to match
those that would be provided by linking to libnuma. Opportunistically use
the recently inroduced KVM_SYSCALL_DEFINE() builders to take care of the
boilerplate, and to fix a flaw where the two existing wrappers would
generate multiple symbols if numaif.h were to be included multiple times.
Reviewed-by: Ackerley Tng <ackerleytng@google.com>
Tested-by: Ackerley Tng <ackerleytng@google.com>
Reviewed-by: Shivank Garg <shivankg@amd.com>
Tested-by: Shivank Garg <shivankg@amd.com>
Link: https://lore.kernel.org/r/20251016172853.52451-9-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Register handlers for signals for all selftests that are likely happen due
to test (or kernel) bugs, and explicitly fail tests on unexpected signals
so that users get a stack trace, i.e. don't have to go spelunking to do
basic triage.
Register the handlers as early as possible, to catch as many unexpected
signals as possible, and also so that the common code doesn't clobber a
handler that's installed by test (or arch) code.
Tested-by: Ackerley Tng <ackerleytng@google.com>
Reviewed-by: Ackerley Tng <ackerleytng@google.com>
Reviewed-by: Shivank Garg <shivankg@amd.com>
Tested-by: Shivank Garg <shivankg@amd.com>
Link: https://lore.kernel.org/r/20251016172853.52451-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Add kvm_<sycall> wrappers for munmap(), close(), fallocate(), and
ftruncate() to cut down on boilerplate code when a sycall is expected
to succeed, and to make it easier for developers to remember to assert
success.
Implement and use a macro framework similar to the kernel's SYSCALL_DEFINE
infrastructure to further cut down on boilerplate code, and to drastically
reduce the probability of typos as the kernel's syscall definitions can be
copy+paste almost verbatim.
Provide macros to build the raw <sycall>() wrappers as well, e.g. to
replace hand-coded wrappers (NUMA) or pure open-coded calls.
Reviewed-by: Ackerley Tng <ackerleytng@google.com>
Tested-by: Ackerley Tng <ackerleytng@google.com>
Reviewed-by: Shivank Garg <shivankg@amd.com>
Tested-by: Shivank Garg <shivankg@amd.com>
Link: https://lore.kernel.org/r/20251016172853.52451-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Previously, guest-memfd allocations followed local NUMA node id in absence
of process mempolicy, resulting in arbitrary memory allocation.
Moreover, mbind() couldn't be used by the VMM as guest memory wasn't
mapped into userspace when allocation occurred.
Enable NUMA policy support by implementing vm_ops for guest-memfd mmap
operation. This allows the VMM to use mmap()+mbind() to set the desired
NUMA policy for a range of memory, and provides fine-grained control over
guest memory allocation across NUMA nodes.
Note, using mmap()+mbind() works even for PRIVATE memory, as mbind()
doesn't require the memory to be faulted in. However, get_mempolicy()
and other paths that require the userspace page tables to be populated
may return incorrect information for PRIVATE memory (though under the hood,
KVM+guest_memfd will still behave correctly).
Store the policy in the inode structure, gmem_inode, as a shared memory
policy, so that the policy is a property of the physical memory itself,
i.e. not bound to the VMA. In guest_memfd, KVM is the primary MMU and any
VMAs are secondary, i.e. using mbind() on a VMA to set policy is a means
to an end, e.g. to avoid having to add a file-based equivalent to mbind().
Similarly, retrieve the policy via mpol_shared_policy_lookup(), not
get_vma_policy(), even when allocating to fault in memory for userspace
mappings, so that the policy stored in gmem_inode is always the source of
true.
Apply policy changes only to future allocations, i.e. do not migrate
existing memory in the guest_memfd instance. This matches mbind(2)'s
default behavior, which affects only new allocations unless overridden
with MPOL_MF_MOVE/MPOL_MF_MOVE_ALL flags (which are not supported by
guest_memfd as guest_memfd memory is unmovable).
Suggested-by: David Hildenbrand <david@redhat.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Shivank Garg <shivankg@amd.com>
Tested-by: Ashish Kalra <ashish.kalra@amd.com>
Link: https://lore.kernel.org/all/e9d43abc-bcdb-4f9f-9ad7-5644f714de19@amd.com
[sean: fold in fixup (see Link above), massage changelog]
Link: https://lore.kernel.org/r/20251016172853.52451-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Add a dedicated gmem_inode structure and a slab-allocated inode cache for
guest memory backing, similar to how shmem handles inodes.
This adds the necessary allocation/destruction functions and prepares
for upcoming guest_memfd NUMA policy support changes. Using a dedicated
structure will also allow for additional cleanups, e.g. to track flags in
gmem_inode instead of i_private.
Signed-off-by: Shivank Garg <shivankg@amd.com>
Tested-by: Ashish Kalra <ashish.kalra@amd.com>
[sean: s/kvm_gmem_inode_info/gmem_inode, name init_once()]
Reviewed-by: Ackerley Tng <ackerleytng@google.com>
Tested-by: Ackerley Tng <ackerleytng@google.com>
Link: https://lore.kernel.org/r/20251016172853.52451-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
guest_memfd's inode represents memory the guest_memfd is
providing. guest_memfd's file represents a struct kvm's view of that
memory.
Using a custom inode allows customization of the inode teardown
process via callbacks. For example, ->evict_inode() allows
customization of the truncation process on file close, and
->destroy_inode() and ->free_inode() allow customization of the inode
freeing process.
Customizing the truncation process allows flexibility in management of
guest_memfd memory and customization of the inode freeing process
allows proper cleanup of memory metadata stored on the inode.
Memory metadata is more appropriately stored on the inode (as opposed
to the file), since the metadata is for the memory and is not unique
to a specific binding and struct kvm.
Acked-by: David Hildenbrand <david@redhat.com>
Co-developed-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Shivank Garg <shivankg@amd.com>
Tested-by: Ashish Kalra <ashish.kalra@amd.com>
[sean: drop helpers, open code logic in __kvm_gmem_create()]
Link: https://lore.kernel.org/r/20251016172853.52451-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Rename the "kvm_gmem" structure to "gmem_file" in anticipation of using
dedicated guest_memfd inodes instead of anonyomous inodes, at which point
the "kvm_gmem" nomenclature becomes quite misleading. In guest_memfd,
inodes are effectively the raw underlying physical storage, and will be
used to track properties of the physical memory, while each gmem file is
effectively a single VM's view of that storage, and is used to track assets
specific to its associated VM, e.g. memslots=>gmem bindings.
Using "kvm_gmem" suggests that the per-VM/per-file structures are _the_
guest_memfd instance, which almost the exact opposite of reality.
Opportunistically rename local variables from "gmem" to "f", again to
avoid confusion once guest_memfd specific inodes come along.
No functional change intended.
Reviewed-by: Ackerley Tng <ackerleytng@google.com>
Tested-by: Ackerley Tng <ackerleytng@google.com>
Reviewed-by: Shivank Garg <shivankg@amd.com>
Tested-by: Shivank Garg <shivankg@amd.com>
Link: https://lore.kernel.org/r/20251016172853.52451-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Drop the local "int err" that's buried in the middle guest_memfd's user
fault handler to avoid the potential for variable shadowing, e.g. if an
"err" variable were also declared at function scope.
No functional change intended.
Link: https://lore.kernel.org/r/20251007222733.349460-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
KVM guest_memfd wants to implement support for NUMA policies just like
shmem already does using the shared policy infrastructure. As
guest_memfd currently resides in KVM module code, we have to export the
relevant symbols.
In the future, guest_memfd might be moved to core-mm, at which point the
symbols no longer would have to be exported. When/if that happens is
still unclear.
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Shivank Garg <shivankg@amd.com>
Tested-by: Ashish Kalra <ashish.kalra@amd.com>
Link: https://lore.kernel.org/r/20250827175247.83322-6-shivankg@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Extend __filemap_get_folio() to support NUMA memory policies by
renaming the implementation to __filemap_get_folio_mpol() and adding
a mempolicy parameter. The original function becomes a static inline
wrapper that passes NULL for the mempolicy.
This infrastructure will enable future support for NUMA-aware page cache
allocations in guest_memfd memory backend KVM guests.
Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Shivank Garg <shivankg@amd.com>
Tested-by: Ashish Kalra <ashish.kalra@amd.com>
Link: https://lore.kernel.org/r/20250827175247.83322-5-shivankg@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
7.43 has been assigned twice, make
KVM_CAP_ARM_CACHEABLE_PFNMAP_SUPPORTED 7.44.
Fixes: f55ce5a6cd ("KVM: arm64: Expose new KVM cap for cacheable PFNMAP")
Reviewed-by: Ankit Agrawal <ankita@nvidia.com>
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
Replace sprintf() with snprintf() when formatting debug names to prevent
potential buffer overflow. The debug_name buffer is 16 bytes, and while
unlikely to overflow with current PIDs, using snprintf() provides proper
bounds checking.
Signed-off-by: Josephine Pfeiffer <hi@josie.lol>
[frankja@linux.ibm.com: Fixed subject prefix]
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
Enable the HDMI output on the Debix SOM A board, using the HDMI encoder
present in the i.MX8MP SoC.
Enable and configure all nodes required for the HDMI port usage.
Signed-off-by: Kieran Bingham <kieran.bingham@ideasonboard.com>
Reviewed-by: Marco Felsch <m.felsch@pengutronix.de>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
Add USB vbus regulators to silence the following kernel warnings:
usb_phy_generic usbphynop1: dummy supplies not allowed for exclusive requests (id=vbus)
usb_phy_generic usbphynop2: dummy supplies not allowed for exclusive requests (id=vbus)
Because generic USB PHY driver requires exclusive vbus regulators since
commit 75fd6485cc ("usb: phy: generic: Get the vbus supply").
Signed-off-by: Primoz Fiser <primoz.fiser@norik.com>
Reviewed-by: Peng Fan <peng.fan@nxp.com>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
Add USB vbus regulators to silence the following kernel warnings:
usb_phy_generic usbphynop1: dummy supplies not allowed for exclusive requests (id=vbus)
usb_phy_generic usbphynop2: dummy supplies not allowed for exclusive requests (id=vbus)
Because generic USB PHY driver requires exclusive vbus regulators since
commit 75fd6485cc ("usb: phy: generic: Get the vbus supply").
Signed-off-by: Primoz Fiser <primoz.fiser@norik.com>
Reviewed-by: Peng Fan <peng.fan@nxp.com>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
On i.MX6Q/6DL the following subnodes exist to describe the CSI port muxing:
- ipu1_csi0_mux
- ipu1_csi1_mux
- ipu2_csi0_mux
- ipu2_csi1_mux
As they were not documented, dt-schema emits warnings like:
'ipu1_csi0_mux', 'ipu1_csi1_mux' do not match any of the regexes:
'^pinctrl-[0-9]+$'
Add a top-level patternProperties entry for these CSI mux subnodes
and restrict it to i.MX6Q.
Signed-off-by: Fabio Estevam <festevam@gmail.com>
Acked-by: Rob Herring (Arm) <robh@kernel.org>
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
This board is similar to the already upstream
imx8mp-skov-recv-tian-g07017.dts but uses a different 10" panel with a
different touch controller.
Signed-off-by: Steffen Trumtrar <s.trumtrar@pengutronix.de>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
In preparation for adding a new device tree variant with a different
panel, describe the DT compatible in the binding.
Signed-off-by: Steffen Trumtrar <s.trumtrar@pengutronix.de>
Acked-by: Rob Herring (Arm) <robh@kernel.org>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
Replace verbatim license text with a `SPDX-License-Identifier`.
The comment header mis-attributes this license to be "X11", but the
license text does not include the last line "Except as contained in this
notice, the name of the X Consortium shall not be used in advertising or
otherwise to promote the sale, use or other dealings in this Software
without prior written authorization from the X Consortium.". Therefore,
this license is actually equivalent to the SPDX "MIT" license (confirmed
by text diffing).
Cc: Andrej Rosano <andrej@inversepath.com>
Signed-off-by: Bence Csókás <csokas.bence@prolan.hu>
Acked-by: Andrej Rosano <andrej.rosano@reversec.com>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
The mass production lx2160 rev2 use designware PCIe Controller. Old Rev1
which use mobivel PCIe controller was not supported. Although uboot
fixup can change compatible string fsl,lx2160a-pcie to fsl,ls2088a-pcie
since 2019, it is quite confused and should correctly reflect hardware
status in dtb. Change freescale's board to use rev2's dtsi firstly.
Signed-off-by: Frank Li <Frank.Li@nxp.com>
Signed-off-by: Shawn Guo <shawnguo@kernel.org>
kvm_arch_vcpu_ioctl_set_fpu() always returns 0 and the local return
variable 'ret' is not used anymore. Remove it.
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
Since we are no longer switching from a BSCA to a ESCA we can completely
get rid of the sca_lock. The write lock was only taken for that
conversion.
After removal of the lock some local code cleanups are possible.
Signed-off-by: Christoph Schlameuss <schlameuss@linux.ibm.com>
Suggested-by: Janosch Frank <frankja@linux.ibm.com>
[frankja@linux.ibm.com: Added suggested-by tag as discussed on list]
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
All modern IBM Z and Linux One machines do offer support for the
Extended System Control Area (ESCA). The ESCA is available since the
z114/z196 released in 2010.
KVM needs to allocate and manage the SCA for guest VMs. Prior to this
change the SCA was setup as Basic SCA only supporting a maximum of 64
vCPUs when initializing the VM. With addition of the 65th vCPU the SCA
was needed to be converted to a ESCA.
Instead of allocating a BSCA and upgrading it for PV or when adding the
65th cpu we can always allocate the ESCA directly upon VM creation
simplifying the code in multiple places as well as completely removing
the need to convert an existing SCA.
In cases where the ESCA is not supported (z10 and earlier) the use of
the SCA entries and with that SIGP interpretation are disabled for VMs.
This increases the number of exits from the VM in multiprocessor
scenarios and thus decreases performance.
The same is true for VSIE where SIGP is currently disabled and thus no
SCA entries are used.
The only downside of the change is that we will always allocate 4 pages
for a 248 cpu ESCA instead of a single page for the BSCA per VM.
In return we can delete a bunch of checks and special handling depending
on the SCA type as well as the whole BSCA to ESCA conversion.
With that behavior change we are no longer referencing a bsca_block in
kvm->arch.sca. This will always be esca_block instead.
By specifying the type of the sca as esca_block we can simplify access
to the sca and get rid of some helpers while making the code clearer.
KVM_MAX_VCPUS is also moved to kvm_host_types to allow using this in
future type definitions.
Reviewed-by: Janosch Frank <frankja@linux.ibm.com>
Signed-off-by: Christoph Schlameuss <schlameuss@linux.ibm.com>
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
Refresh the defconfig for Renesas ARM systems:
- Drop CONFIG_SCHED_MC=y (auto-enabled since commit 7bd291abe2
("sched: Unify the SCHED_{SMT,CLUSTER,MC} Kconfig")),
- Disable CONFIG_SCHED_SMT (auto-enabled since commit 7bd291abe2
("sched: Unify the SCHED_{SMT,CLUSTER,MC} Kconfig")),
- Restore CONFIG_ARM_GT_INITIAL_PRESCALER_VAL=1 (default changed to
zero (auto-detect) in commit 1c4b87c921
("clocksource/drivers/arm_global_timer: Add auto-detection for
initial prescaler values")),
- Disable CONFIG_RPCSEC_GSS_KRB5 (auto-enabled since commit
d8e97cc476 ("SUNRPC: Make RPCSEC_GSS_KRB5 select CRYPTO
instead of depending on it")).
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Link: https://patch.msgid.link/d0fcc82fb294021bf96f8a490234165e15aadb43.1760530468.git.geert+renesas@glider.be
Fix kernel-doc warnings by using correct kernel-doc syntax and
formatting to prevent warnings:
Warning: include/linux/firmware/qcom/qcom_tzmem.h:25 Enum value
'QCOM_TZMEM_POLICY_STATIC' not described in enum 'qcom_tzmem_policy'
Warning: ../include/linux/firmware/qcom/qcom_tzmem.h:25 Enum value
'QCOM_TZMEM_POLICY_MULTIPLIER' not described in enum 'qcom_tzmem_policy'
Warning: ../include/linux/firmware/qcom/qcom_tzmem.h:25 Enum value
'QCOM_TZMEM_POLICY_ON_DEMAND' not described in enum 'qcom_tzmem_policy'
Fixes: 84f5a7b67b ("firmware: qcom: add a dedicated TrustZone buffer allocator")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Link: https://lore.kernel.org/r/20251017191323.1820167-1-rdunlap@infradead.org
Signed-off-by: Bjorn Andersson <andersson@kernel.org>
Add device tree source describing the Tenstorrent Blackhole SoC and the
Blackhole P100 and P150 PCIe cards. There are no differences between
the P100 and P150 cards from the perspective of an OS kernel like Linux
running on the X280 cores.
There is a virtual UART implemented in OpenSBI firmware that allows a
console program on the PCIe host to communicate through shared memory
with Linux running on the Blackhole card. CONFIG_HVC_RISCV_SBI needs to
be enabled. The boot script on the host adds 'console=hvc0' so that the
full boot output appears in the console program on the host.
Link: https://github.com/tenstorrent/opensbi/
Link: https://github.com/tenstorrent/tt-bh-linux
Reviewed-by: Joel Stanley <jms@oss.tenstorrent.com>
Signed-off-by: Drew Fustini <dfustini@oss.tenstorrent.com>
Accessing non-existent PMU registers causes an SError, halting the
system.
regmap can help us with that by allowing to pass the list of valid
registers as part of the config during creation. When this driver
creates a new regmap itself rather than relying on
syscon_node_to_regmap(), it's therefore easily possible to hook in
custom access tables for valid read and write registers.
Specifying access tables avoids SErrors for invalid registers and
instead the regmap core can just return an error. Outside drivers, this
is also helpful when using debugfs to access the regmap.
Make it possible for drivers to specify read and write tables to be
used on creation of the secure regmap by adding respective fields to
struct exynos_pmu_data. Also add kerneldoc to same struct while
updating it.
Reviewed-by: Sam Protsenko <semen.protsenko@linaro.org>
Signed-off-by: André Draszik <andre.draszik@linaro.org>
Link: https://patch.msgid.link/20251009-gs101-pmu-regmap-tables-v2-1-2d64f5261952@linaro.org
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Add sysreg compatible strings for the Exynos7870 SoC. Two sysregs are
added, used for the SoC MIPI PHY's CSIS and DSIM blocks.
Acked-by: Rob Herring (Arm) <robh@kernel.org>
Signed-off-by: Kaustabh Chakraborty <kauschluss@disroot.org>
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Document the samsung,exynos8890-chipid compatible. The registers are
entirely compatible with "samsung,exynos4210-chipid".
Signed-off-by: Ivaylo Ivanov <ivo.ivanov.ivanov1@gmail.com>
Acked-by: Rob Herring (Arm) <robh@kernel.org>
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Add exynos8890-pmu compatible to the bindings documentation. Since
Samsung, as usual, reuses devices from older designs, use the
samsung,exynos7-pmu compatible.
Signed-off-by: Ivaylo Ivanov <ivo.ivanov.ivanov1@gmail.com>
Acked-by: Rob Herring (Arm) <robh@kernel.org>
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
With AVIC support for 4k vCPUs, the maximum supported physical ID in
x2AVIC mode is 4095. Since this is no longer fixed, introduce a variable
(x2avic_max_physical_id) to capture the maximum supported physical ID on
the current platform and use that in place of the existing macro
(X2AVIC_MAX_PHYSICAL_ID).
With AVIC support for 4k vCPUs, the AVIC Physical ID table is no
longer a single page and can occupy up to 8 contiguous 4k pages. Since
AVIC hardware accesses of the physical ID table are limited by the
physical max index programmed in the VMCB, it is sufficient to allocate
only as many pages as are required to have a physical table entry for
the max guest APIC ID. Since the guest APIC mode is not available at
this point, provision for the maximum possible x2AVIC ID. For this
purpose, add a variant of avic_get_max_physical_id() that works with a
NULL vCPU pointer and returns the max x2AVIC ID. Wrap this in a new
helper for obtaining the allocation order.
To make it easy to identify support for 4k vCPUs in x2AVIC mode, update
the message printed to the kernel log to print the maximum number of
vCPUs supported. Do this on all platforms supporting x2AVIC since it is
useful to know what is supported on a specific platform.
Co-developed-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Naveen N Rao (AMD) <naveen@kernel.org>
Link: https://lore.kernel.org/r/7fc5962f6da028f7dd3c79dbbd5c574fa02c99dd.1757009416.git.naveen@kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Add CPUID feature bit for x2AVIC extension that enables AMD SVM to
support up to 4096 vCPUs in x2AVIC mode. The primary change is in the
size of the AVIC Physical ID table, which can now go up to 8 contiguous
4k pages. The number of pages allocated is controlled by the maximum
APIC ID for a guest, and that controls the number of pages to allocate
for the AVIC Physical ID table. AVIC hardware is enhanced to look up
Physical ID table entries for vCPUs > 512 for locating the target APIC
backing page and the host APIC ID of the physical core on which the
guest vCPU is running.
Signed-off-by: Naveen N Rao (AMD) <naveen@kernel.org>
Acked-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/e5c9c471ab99a130bf9b728b77050ab308cf8624.1757009416.git.naveen@kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
With support for 4k vCPUs in x2AVIC, the size of the AVIC Physical ID
table is expanded from a single 4k page to a maximum of 8 contiguous 4k
pages. The actual number of pages allocated depends on the maximum
possible APIC ID in the guest, which is only known by the time the first
vCPU is created. In preparation for supporting a dynamic AVIC Physical
ID table size, move its allocation to vcpu_precreate().
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Naveen N Rao (AMD) <naveen@kernel.org>
Link: https://lore.kernel.org/r/7dc764e0af7f01440bbac3d9215ed174027c2384.1757009416.git.naveen@kernel.org
[sean: drop enable_apicv check from svm_vcpu_precreate()]
Signed-off-by: Sean Christopherson <seanjc@google.com>
In the latest APM describing AVIC support for 4k vCPUs, VMCB
AVIC_PHYSICAL_MAX_INDEX (Offset 0xF8) and EXITINFO2.Index are both
updated from 9-bit wide to 12-bit wide fields unconditionally (i.e.,
regardless of AVIC support for 4k vCPUs). Expand
AVIC_PHYSICAL_MAX_INDEX_MASK accordingly.
While AVIC_PHYSICAL_MAX_INDEX_MASK is updated to a 12-bit field, KVM
will limit the max vCPU/APIC ID based on the maximum supported on a
specific processor and enforce that limit during vCPU creation. I.e.,
KVM doesn't need to rely on the mask to ensure that the max APIC ID being
programmed in the VMCB is in range. The additional bits (11:9) were
previously marked reserved and were never set/read by older processors.
Signed-off-by: Naveen N Rao (AMD) <naveen@kernel.org>
Link: https://lore.kernel.org/r/a24ae953cea716bf9c56c136f7ca4bf5e97b1080.1757009416.git.naveen@kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
The lower 9-bit field in EXITINFO2 represents an index into the AVIC
Physical/Logical APIC ID table for a AVIC_INCOMPLETE_IPI #VMEXIT. Since
the index into the Logical APIC ID table is just 8 bits, this field is
actually bound by the bit-width of the index into the AVIC Physical ID
table which is represented by AVIC_PHYSICAL_MAX_INDEX_MASK. So, use that
macro to mask EXITINFO2.Index instead of hard coding 0x1FF in
avic_incomplete_ipi_interception().
Co-developed-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Naveen N Rao (AMD) <naveen@kernel.org>
Link: https://lore.kernel.org/r/95795f449c68bffcb3e1789ee2b0b7393711d37d.1757009416.git.naveen@kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
KVM allows VMMs to specify the maximum possible APIC ID for a virtual
machine through KVM_CAP_MAX_VCPU_ID capability so as to limit data
structures related to APIC/x2APIC. Utilize the same to set the AVIC
physical max index in the VMCB, similar to VMX. This helps hardware
limit the number of entries to be scanned in the physical APIC ID table
speeding up IPI broadcasts for virtual machines with smaller number of
vCPUs.
Unlike VMX, SVM AVIC requires a single page to be allocated for the
Physical APIC ID table and the Logical APIC ID table, so retain the
existing approach of allocating those during VM init.
Signed-off-by: Naveen N Rao (AMD) <naveen@kernel.org>
Link: https://lore.kernel.org/r/adb07ccdb3394cd79cb372ba6bcc69a4e4d4ef54.1757009416.git.naveen@kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Add an off-by-default param, "warn_on_missed_cc", to have KVM WARN on a
missed VMX Consistency Check on nested VM-Enter, specifically so that KVM
developers and maintainers can more easily detect missing checks. KVM's
goal/intent is that KVM detect *all* VM-Fail conditions in software, as
relying on hardware leads to false passes when KVM's nested support is a
subset of hardware support, e.g. see commit 095686e6fc ("KVM: nVMX:
Check vmcs12->guest_ia32_debugctl on nested VM-Enter").
With one notable exception, KVM now detects all VM-Fail scenarios for
which there is known test coverage, i.e. KVM developers can enable the
param and expect a clean run, and thus can use the param to detect missed
checks, e.g. when enabling new features, when writing new tests, etc.
The one exception is an unfortunate consistency check on vTPR. Because
the vTPR for L2 comes from the virtual APIC page provided by L1, L2's vTPR
is fully writable at all times, i.e. is inherently subject to TOCTOU
issues with respect to checks in software versus consumption in hardware.
Further complicating matters is KVM's deferred handling of vmcs12 pages
when loading nested state; KVM flat out cannot check vTPR during
KVM_SET_NESTED_STATE without breaking setups that do on-demand paging,
e.g. for live migration and/or live update.
To fudge around the vTPR issue, add a "late" controls check for vTPR and
also treat an invalid virtual APIC as VM-Fail, but gate the check on
warn_on_missed_cc being enabled to avoid unwanted false positives, i.e. to
avoid breaking KVM in production.
Cc: Jim Mattson <jmattson@google.com>
Link: https://lore.kernel.org/r/20250919005955.1366256-10-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Remove nested_early_check and all associated code, as it's quite obviously
not being used or tested (it's been broken for 4+ years without a single
bug report). More importantly, KVM's software-based consistency checks
have matured since the option to do hardware-based checks was added; KVM
appears to be missing only _one_ consistency check, on vTPR. And even
*more* importantly, that consistency check can't be prevented by an early
hardware check due to L1 being able to modify the virtual APIC at any
time, i.e. there's an inherent TOCTOU flaw that could cause KVM to "miss"
a consistency check VM-Fail, regardless of whether the check is performed
by software or by hardware.
In other words, KVM _must_ be able to unwind from a late VM-Fail (which
was a big motivation for doing early checks). I.e. now that KVM provides
(almost) all necessary consistency checks, what's really needed is a way
to detect missing checks in KVM, not a way to avoid having to unwind from
a late VM-Fail. And that can be done much more simply, e.g. by an simple
module param to guard a WARN (which, sadly, must be off-by-default to
avoid splats due to the aforementioned TOCTOU issue).
For all intents and purposes, this reverts commit 52017608da ("KVM:
nVMX: add option to perform early consistency checks via H/W").
Link: https://lore.kernel.org/r/20250919005955.1366256-9-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
If KVM is doing "early" nested VM-Enter consistency checks and TSC scaling
is supported, stuff vmcs02's TSC Multiplier early on to avoid getting a
false positive VM-Fail due to trying to do VM-Enter with TSC_MULTIPLIER=0.
To minimize complexity around L1 vs. L2 TSC, KVM sets the actual TSC
Multiplier rather late during VM-Entry, i.e. may have '0' at the time of
early consistency checks.
If vmcs12 has TSC Scaling enabled, use the multiplier from vmcs12 so that
nested early checks actually check vmcs12 state, otherwise throw in an
arbitrary value of '1' (anything non-zero is legal).
Fixes: d041b5ea93 ("KVM: nVMX: Enable nested TSC scaling")
Link: https://lore.kernel.org/r/20250919005955.1366256-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Add a missing consistency check on the TPR Threshold. Per the SDM
If the "use TPR shadow" VM-execution control is 1 and the "virtual-
interrupt delivery" VM-execution control is 0, bits 31:4 of the TPR
threshold VM-execution control field must be 0.
Note, nested_vmx_check_tpr_shadow_controls() bails early if "use TPR
shadow" is 0.
Link: https://lore.kernel.org/r/20250919005955.1366256-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Use the role for the to-be-loaded/invalidated EPT root to compute the
root's level and A/D enablement instead of pulling the information from
the vCPU (e.g. by passing in the root level and querying vmcs12). Not
making unnecessary assumptions about the root will allow invalidating
arbitrary EPT roots (which sadly requires a full EPTP) at any given time.
No functional change intended (the end result should be the same).
Link: https://lore.kernel.org/r/20250919005955.1366256-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Move the helpers to get/query a dummy root from mmu_internal.h to spte.h
so that VMX can detect and handle dummy roots when constructing EPTPs.
This will allow using the root's role to build the EPTP instead of pulling
equivalent information out of the vCPU structure.
No functional change intended.
Link: https://lore.kernel.org/r/20250919005955.1366256-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Hardcode the dummy EPTP used for "early" consistency checks as there's no
need to use 5-level EPT based on the guest.MAXPHYADDR (the EPTP just needs
to be valid, it's never truly consumed).
This will allow breaking construct_eptp()'s dependency on having access to
the vCPU, which in turn will (much further in the future) allow for eliding
per-root TLB flushes when a vCPU is migrated between pCPUs (a flush is
need if and only if that particular pCPU hasn't already flushed the vCPU's
roots).
Link: https://lore.kernel.org/r/20250919005955.1366256-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Move construct_eptp() further up in vmx.c so that it's above
vmx_flush_tlb_current(), its "first" user in vmx.c. This will allow a
future patch to opportunistically make construct_eptp() local to vmx.c.
No functional change intended.
Link: https://lore.kernel.org/r/20250919005955.1366256-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
th1520 support Zfh ISA extension.
It supports the same RISC-V extensions as SG2042.
commit cb074bed11 ("riscv: dts: sophgo: add zfh for sg2042")
Signed-off-by: Han Gao <rabenda.cn@gmail.com>
Reviewed-by: Drew Fustini <fustini@kernel.org>
Signed-off-by: Drew Fustini <fustini@kernel.org>
Existing rv64 hardware conforms to the rva20 profile.
Ziccrse is an additional extension required by the rva20 profile, so
th1520 has this extension.
Signed-off-by: Han Gao <rabenda.cn@gmail.com>
Reviewed-by: Drew Fustini <fustini@kernel.org>
Signed-off-by: Drew Fustini <fustini@kernel.org>
arm64: Xilinx SOC changes for 6.18
firmware:
- Add debugfs interface
- Wire versal-net compatible string
- Change SOC family detection
* tag 'zynqmp-soc-for-6.18' of https://github.com/Xilinx/linux-xlnx:
drivers: firmware: xilinx: Switch to new family code in zynqmp_pm_get_family_info()
drivers: firmware: xilinx: Add unique family code for all platforms
firmware: xilinx: Add Versal NET platform compatible string
firmware: xilinx: Add debugfs support for PM_GET_NODE_STATUS
Enable AMD APML related features
- add amd sbrmi node for SoC power reading
- add amd sbtsi node for SoC temperature reading
- rename the P0_I3C_APML_ALERT_L GPIO to align with the naming
convention expected by the AMD APML tool
Signed-off-by: Fred Chen <fredchen.openbmc@gmail.com>
Signed-off-by: Andrew Jeffery <andrew@codeconstruct.com.au>
Add GPIO line name for userspace control or monitoring
- Add leak-related line names to report chassis leak event
- Add debug-card-mux to control debug card access
- Add FM_MAIN_PWREN_RMC_EN_ISO_R to receive RMC power control signal
Signed-off-by: Fred Chen <fredchen.openbmc@gmail.com>
Signed-off-by: Andrew Jeffery <andrew@codeconstruct.com.au>
Add a 'bmc_ready_noled' LED on GPIOB3 with GPIO_TRANSITORY to ensure its
state resets on BMC reboot.
Signed-off-by: Fred Chen <fredchen.openbmc@gmail.com>
Signed-off-by: Andrew Jeffery <andrew@codeconstruct.com.au>
Add the mctp-controller property and MCTP node to enable frontend NIC
management via PLDM over MCTP.
Signed-off-by: Fred Chen <fredchen.openbmc@gmail.com>
Signed-off-by: Andrew Jeffery <andrew@codeconstruct.com.au>
Add missing blank lines between DT nodes to follow the devicetree coding
style and improve readability.
No functional changes.
Signed-off-by: Fred Chen <fredchen.openbmc@gmail.com>
Signed-off-by: Andrew Jeffery <andrew@codeconstruct.com.au>
Add device tree for the Meta (Facebook) Yosemite5 compute node,
based on the AST2600 BMC.
The Yosemite5 platform provides monitoring of voltages, power,
temperatures, and other critical parameters across the motherboard,
CXL board, E1.S expansion board, and NIC components. The BMC also
logs relevant events and performs appropriate system actions in
response to abnormal conditions.
Signed-off-by: Kevin Tung <kevin.tung.openbmc@gmail.com>
Signed-off-by: Andrew Jeffery <andrew@codeconstruct.com.au>
Add the set of power domain IDs available on the Tegra264 SoC so that
they can be used in device tree files.
Acked-by: Rob Herring (Arm) <robh@kernel.org>
Signed-off-by: Thierry Reding <treding@nvidia.com>
Enable support for the CEC interface of the Synopsys DesignWare HDMI QP
IP block.
This is used by all boards based on RK3588 & RK3576 SoCs.
Signed-off-by: Cristian Ciocaltea <cristian.ciocaltea@collabora.com>
Signed-off-by: Heiko Stuebner <heiko@sntech.de>
Modern AMD CPUs do not support segment limit checks in 64-bit mode
(i.e. EFER.LMSLE must be zero). Do not allow a guest to set EFER.LMSLE
on a CPU that requires the bit to be zero.
For backwards compatibility, allow EFER.LMSLE to be set on CPUs that
support segment limit checks in 64-bit mode, even though KVM's
implementation of the feature is incomplete (e.g. KVM's emulator does
not enforce segment limits in 64-bit mode).
Fixes: eec4b140c9 ("KVM: SVM: Allow EFER.LMSLE to be set with nested svm")
Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Nikunj A Dadhania <nikunj@amd.com>
Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Link: https://lore.kernel.org/r/20251001001529.1119031-3-jmattson@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Enable the GMAC0 node for the Agilex5 device when using the NAND
daughter card.
Signed-off-by: Boon Khai Ng <boon.khai.ng@altera.com>
Signed-off-by: Dinh Nguyen <dinguyen@kernel.org>
Add spi-tx-bus-width and spi-rx-bus-width properties with value 4 to the
agilex5 device tree. This update configures the SPI controller to use a
4-bit bus width for both transmission and reception, potentially improving
SPI throughput and matching the hardware capabilities more closely.
Signed-off-by: Fong, Yan Kei <yan.kei.fong@altera.com>
Reviewed-by: Khairul Anuar Romli <khairul.anuar.romli@altera.com>
Reviewed-by: Matthew Gerlach <matthew.gerlach@altera.com>
Signed-off-by: Dinh Nguyen <dinguyen@kernel.org>
Add spi-tx-bus-width and spi-rx-bus-width properties with value 4 to the
agilex device tree. This update configures the SPI controller to use a
4-bit bus width for both transmission and reception, potentially improving
SPI throughput and matching the hardware capabilities more closely.
Signed-off-by: Fong, Yan Kei <yan.kei.fong@altera.com>
Reviewed-by: Khairul Anuar Romli <khairul.anuar.romli@altera.com>
Reviewed-by: Matthew Gerlach <matthew.gerlach@altera.com>
Signed-off-by: Dinh Nguyen <dinguyen@kernel.org>
Add spi-tx-bus-width and spi-rx-bus-width properties with value 4 to the
stratix10 device tree. This update configures the SPI controller to use a
4-bit bus width for both transmission and reception, potentially improving
SPI throughput and matching the hardware capabilities more closely.
Signed-off-by: Fong, Yan Kei <yan.kei.fong@altera.com>
Reviewed-by: Khairul Anuar Romli <khairul.anuar.romli@altera.com>
Reviewed-by: Matthew Gerlach <matthew.gerlach@altera.com>
Signed-off-by: Dinh Nguyen <dinguyen@kernel.org>
Add spi-tx-bus-width and spi-rx-bus-width properties with value 4 to the
n5x device tree. This update configures the SPI controller to use a 4-bit
bus width for both transmission and reception, potentially improving SPI
throughput and matching the hardware capabilities more closely.
Signed-off-by: Fong, Yan Kei <yan.kei.fong@altera.com>
Reviewed-by: Khairul Anuar Romli <khairul.anuar.romli@altera.com>
Reviewed-by: Matthew Gerlach <matthew.gerlach@altera.com>
Signed-off-by: Dinh Nguyen <dinguyen@kernel.org>
CPUID.80000008H:EBX.EferLmsleUnsupported[bit 20] is a defeature
bit. When this bit is clear, EFER.LMSLE is supported. When this bit is
set, EFER.LMLSE is unsupported. KVM has never _emulated_ EFER.LMSLE, so
KVM cannot truly support a 0-setting of this bit.
However, KVM has allowed the guest to enable EFER.LMSLE in hardware
since commit eec4b140c9 ("KVM: SVM: Allow EFER.LMSLE to be set with
nested svm"), i.e. KVM partially virtualizes long-mode segment limits _if_
they are supported by the underlying hardware.
Pass through the bit in KVM_GET_SUPPORTED_CPUID to advertise the
unavailability of EFER.LMSLE to userspace based on the raw underlying
hardware. Attempting to enable EFER.LSMLE on such CPUs simply doesn't
work, e.g. immediately crashes on VMRUN.
Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Nikunj A Dadhania <nikunj@amd.com>
Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Link: https://lore.kernel.org/r/20251001001529.1119031-2-jmattson@google.com
[sean: add context about partial virtualization, use PASSTHROUGH_F]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Mark the VMCB_NPT bit as dirty in nested_vmcb02_prepare_save()
on every nested VMRUN.
If L1 changes the PAT MSR between two VMRUN instructions on the same
L1 vCPU, the g_pat field in the associated vmcb02 will change, and the
VMCB_NPT clean bit should be cleared.
Fixes: 4bb170a543 ("KVM: nSVM: do not mark all VMCB02 fields dirty on nested vmexit")
Cc: stable@vger.kernel.org
Signed-off-by: Jim Mattson <jmattson@google.com>
Link: https://lore.kernel.org/r/20250922162935.621409-3-jmattson@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Mark the VMCB_PERM_MAP bit as dirty in nested_vmcb02_prepare_control()
on every nested VMRUN.
If L1 changes MSR interception (INTERCEPT_MSR_PROT) between two VMRUN
instructions on the same L1 vCPU, the msrpm_base_pa in the associated
vmcb02 will change, and the VMCB_PERM_MAP clean bit should be cleared.
Fixes: 4bb170a543 ("KVM: nSVM: do not mark all VMCB02 fields dirty on nested vmexit")
Reported-by: Matteo Rizzo <matteorizzo@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jim Mattson <jmattson@google.com>
Link: https://lore.kernel.org/r/20250922162935.621409-2-jmattson@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
The rk3588 evb2 board has a full size DisplayPort connector, enable
for it.
Signed-off-by: Chaoyi Chen <chaoyi.chen@rock-chips.com>
Signed-off-by: Heiko Stuebner <heiko@sntech.de>
The NanoPi R76S (as "R76S") is an open-sourced mini IoT gateway
device with two 2.5G, designed and developed by FriendlyElec.
Specification:
- Rockchip RK3576
- 2/4GB LPDDR4X RAM
- 2x 2500Base-T (PCIe, rtl8125b)
- 3x LEDs (Power, LAN, WAN)
- 32GB eMMC
- MicroSD Slot
- MDMI 1.4/2.0 OUT
- M.2 E-Key SDIO slot
- USB 3.0 Port
- USB Type-C 5V Power
Signed-off-by: Tianling Shen <cnsztl@gmail.com>
Signed-off-by: Heiko Stuebner <heiko@sntech.de>
The NanoPi R76S (as "R76S") is an open-sourced mini IoT gateway
device with two 2.5G, designed and developed by FriendlyElec.
Add devicetree binding documentation for the FriendlyElec NanoPi R76S
board.
Acked-by: Rob Herring (Arm) <robh@kernel.org>
Signed-off-by: Tianling Shen <cnsztl@gmail.com>
Signed-off-by: Heiko Stuebner <heiko@sntech.de>
Enable Rockchip specific extensions for Synopsys DesignWare DisplayPort
driver. This is used to provide DisplayPort output support for many boards
based on RK3588 SoC.
Signed-off-by: Andy Yan <andyshrk@163.com>
Signed-off-by: Heiko Stuebner <heiko@sntech.de>
Add the Designware MIPI DSI controller and it's port nodes.
Signed-off-by: WeiHao Li <cn.liweihao@gmail.com>
[removed endpoint address, as there is only one vop leading to DSI]
Signed-off-by: Heiko Stuebner <heiko@sntech.de>
RK3368 has a InnoSilicon D-PHY which supports DSI/LVDS/TTL with maximum
trasnfer rate of 1 Gbps per lane.
Signed-off-by: WeiHao Li <cn.liweihao@gmail.com>
Signed-off-by: Heiko Stuebner <heiko@sntech.de>
The syscon compatible for the hifsys node for Mediatek MT2701/MT7623 SoC
was wrongly added following the pattern of other clock node but it's
actually not needed as the register are not used by other device on the
SoC.
On top of this it's against the schema for hifsys YAML and causes a
dtbs_check warning.
Drop the "syscon" compatible to mute the warning and reflect the
compatible property described in the mediatek,mt2701-hifsys.yaml schema.
Reviewed-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
Signed-off-by: Christian Marangi <ansuelsmth@gmail.com>
Signed-off-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
Use get_user() to retrieve the number of entries instead of allocating
memory for 'init_vm' with the maximum size, copying 'cmd->data' to it,
only to then read the actual entry count 'cpuid.nent' from the copy.
Use memdup_user() to allocate just enough memory to fit all entries and
to copy 'cmd->data' from userspace. Use struct_size() instead of
manually calculating the number of bytes to allocate and copy.
No functional changes intended.
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Link: https://lore.kernel.org/r/20250916213129.2535597-2-thorsten.blum@linux.dev
[sean: s/user_init_vm/user_data]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Explicitly request the use of per-CPU queues for the irqfd cleanup
workqueue in preparation for changing the default behavior of
alloc_workqueue() from per-CPU to unbound, which will in turn allow for
the removal of WQ_UNBOUND. See commit 930c2ea566 ("workqueue: Add new
WQ_PERCPU flag") for details.
No functional change intended.
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Link: https://lore.kernel.org/r/20250905091139.110677-2-marco.crivellari@suse.com
[sean: rewrite changelog to tailor it to the KVM]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Holding a reference to a device does not prevent its driver data from
going away so there is no point in keeping the reference after looking
up the sart device.
Signed-off-by: Johan Hovold <johan@kernel.org>
Reviewed-by: Neal Gompa <neal@gompa.dev>
Signed-off-by: Sven Peter <sven@kernel.org>
Make sure to drop the reference taken to the mbox platform device when
looking up its driver data.
Note that holding a reference to a device does not prevent its driver
data from going away so there is no point in keeping the reference.
Fixes: 6e1457fcad ("soc: apple: mailbox: Add ASC/M3 mailbox driver")
Cc: stable@vger.kernel.org # 6.8
Signed-off-by: Johan Hovold <johan@kernel.org>
Reviewed-by: Neal Gompa <neal@gompa.dev>
Signed-off-by: Sven Peter <sven@kernel.org>
Update the APMU clock controller's compatible to allow the new power
domain driver to probe. Also add the first two power domain consumers:
IOMMU (fixes probing) and framebuffer.
Signed-off-by: Duje Mihanović <duje@dujemihanovic.xyz>
The board is known to have 1 GiB of DRAM with the first 16 MiB unusable.
Instead of relying on the bootloader to fill in the memory node, do it
ourselves.
Signed-off-by: Duje Mihanović <duje@dujemihanovic.xyz>
Most of the memory marked as reserved is actually usable. Delete its
reserved-memory nodes so that the memory can be used.
Signed-off-by: Duje Mihanović <duje@dujemihanovic.xyz>
The ramoops memory region is the same for all boards based on the SoC.
Move its node to the appropriate dtsi.
Signed-off-by: Duje Mihanović <duje@dujemihanovic.xyz>
The board has a vibrator hooked up to PWM3. Add a node for it and its
associated pinctrl configuration.
Signed-off-by: Duje Mihanović <duje@dujemihanovic.xyz>
Commit a41fcca4b3 ("mmc: sdhci-pxav3: set NEED_RSP_BUSY capability")
fixed eMMC probing on this board. Enable the eMMC and add its pinctrl.
Signed-off-by: Duje Mihanović <duje@dujemihanovic.xyz>
Right now, the CD GPIO is defined as active high with a cd-inverted
property. Just define the GPIO as active low instead.
Signed-off-by: Duje Mihanović <duje@dujemihanovic.xyz>
Set some basic properties of the SDIO card of the samsung,coreprimevelte
smartphone.
The SDIO is used as an interface for WiFi, Bluetooth and FM radio
serviced by the Marvell 88W8777 (SD8777) chipset. Support for this
chipset is currently not in-tree because the firmware is not available
in linux-firmware, however it is possible to trivially run it
out-of-tree using the mwifiex and Marvell Bluetooth drivers with some
caveats.
Link: https://lore.kernel.org/r/20231029111807.19261-1-balejk@matfyz.cz/
Signed-off-by: Karel Balej <balejk@matfyz.cz>
Reviewed-by: Duje Mihanović <duje@dujemihanovic.xyz>
[Duje: fix formatting of pins_0 and fast_pins_1 pin arrays]
Signed-off-by: Duje Mihanović <duje@dujemihanovic.xyz>
Bind touchscreen for the samsung,coreprimevelte smartphone. The
downstream code sets the VDD voltage to the exact value of 3.1 V,
however it's been empirically verified that the lower bound used here
sufficies for the proper operation of the chip and is thus used for
power-saving purposes.
Signed-off-by: Karel Balej <balejk@matfyz.cz>
Reviewed-by: Duje Mihanović <duje@dujemihanovic.xyz>
Signed-off-by: Duje Mihanović <duje@dujemihanovic.xyz>
Bind power management chip to the samsung,coreprimevelte smartphone.
This enables support for onkey and RTC as well as for regulators two of
which are explicitly bound here to the SD card.
Signed-off-by: Karel Balej <balejk@matfyz.cz>
Reviewed-by: Duje Mihanović <duje@dujemihanovic.xyz>
Signed-off-by: Duje Mihanović <duje@dujemihanovic.xyz>
This smartphone uses a MediaTek MT6582 system-on-chip with 512MB of RAM.
It can currently boot into initramfs with a working UART and
Simple Framebuffer using already initialized panel by the bootloader.
Signed-off-by: Cristian Cozzolino <cristian_ci@protonmail.com>
Reviewed-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
Signed-off-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
Popular cheap PWM fans for this machine, like the ones coming in
heatsink+fan combos will not work properly at the currently defined
medium speed. Trying different pwm setting using a command
echo $value > /sys/devices/platform/pwm-fan/hwmon/hwmon1/pwm1
I found:
pwm1 value fan rotation speed cpu temperature notes
-----------------------------------------------------------------
0 maximal 31.5 Celsius too noisy
40 optimal 35.2 Celsius no noise hearable
95 minimal
above 95 does not rotate 55.5 Celsius
-----------------------------------------------------------------
Thus only cpu-active-high and cpu-active-low modes are usable.
I think this is wrong.
This patch fixes cpu-active-medium settings for bpi-r3 board.
I know, the patch is not ideal as it can break pwm fan for some users.
Likely this is the only official mt7986-bpi-r3 heatsink+fan solution
available on the market.
This patch may not be enough. Users may wants to tweak their thermal_zone0
trip points, thus tuning fan rotation speed depending on cpu temperature.
That can be done on the base of the following example:
=== example =========
# cpu temperature below 25 Celsius degrees, no rotation
echo 25000 > /sys/class/thermal/thermal_zone0/trip_point_4_temp
# cpu temperature in [25..32] Celsius degrees, normal rotation speed
echo 32000 > /sys/class/thermal/thermal_zone0/trip_point_3_temp
# cpu temperature above 50 Celsius degrees, max rotation speed
echo 50000 > /sys/class/thermal/thermal_zone0/trip_point_2_temp
=====================
Signed-off-by: Mikhail Kshevetskiy <mikhail.kshevetskiy@iopsys.eu>
Acked-by: Frank Wunderlich <frank-w@public-files.de>
Reviewed-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
Signed-off-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
Enable the UFS related settings to support Genio 1200 EVK UFS board.
This board uses UFS as the boot device and also the main storage.
This includes support for:
- CONFIG_SCSI_UFS_MEDIATEK
Signed-off-by: Macpaul Lin <macpaul.lin@mediatek.com>
Reviewed-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
Signed-off-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
The efuse of the MediaTek MT7988 contains a 16-byte unique identifier.
Add a 'soc-uuid' cell covering those 16 bytes to the nvmem defininition
of the efuse to allow easy access from userspace, eg. to generate a
persistent random MAC address on boards like the BananaPi R4 which
doesn't have any factory-assigned addresses.
Reviewed-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Signed-off-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
The efuse of the MediaTek MT7981 contains a 16-byte unique identifier.
Add a 'soc-uuid' cell covering those 16 bytes to the nvmem defininition
of the efuse to allow easy access from userspace, eg. to generate a
persistent random MAC address on boards which don't have any
factory-assigned addresses.
Reviewed-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Signed-off-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
The efuse of the MediaTek MT7986 contains an 8-byte unique identifier.
Add a 'soc-uuid' cell covering those 8 bytes to the nvmem defininition
of the efuse to allow easy access from userspace, eg. to generate a
persistent random MAC address on boards like the BananaPi R3 which
doesn't have any factory-assigned addresses.
Reviewed-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Signed-off-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
The efuse of the MediaTek MT7622 contains an 8-byte unique identifier.
Add a 'soc-uuid' cell covering those 8 bytes to the nvmem defininition
of the efuse to allow easy access from userspace, eg. to generate a
persistent random MAC address on boards like the BananaPi R64 which
doesn't have any factory-assigned addresses.
Reviewed-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
Signed-off-by: Daniel Golle <daniel@makrotopia.org>
Signed-off-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
Use the new uart0 label for the console and make the speed explicit by
setting stdout-path = "serial0:115200n8" under /chosen. This keeps the
DTS OS-agnostic: no bootargs or distribution-specific properties are
added.
Drop the 'current-speed' property from uart0 as it is not allowed by the
mediatek UART binding. The baud is already provided via stdout-path.
Verification: Boot-tested with mainline Image+DTB via U-Boot on OpenWrt
One (MT7981B). Serial console active at 115200, DTB decompile
confirms serial0 alias and stdout-path set correctly.
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/r/202509211032.0rJjPoYE-lkp@intel.com/
Signed-off-by: Bryan Hinton <bryan@bryanhinton.com>
Reviewed-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
Signed-off-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
Add stable labels (uart0, uart1, uart2) to the MT7981B SoC UART nodes so
board DTS files can reference them directly. This change is purely
cosmetic and introduces no functional differences.
Verification: Built dtbs and boot-tested mainline Image+DTB via U-Boot on
MT7981B hardware; decompiled DT shows the uart0 label present and the
serial0 alias (or absolute path) resolves to serial@11002000.
Signed-off-by: Bryan Hinton <bryan@bryanhinton.com>
Reviewed-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
Signed-off-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
Add a basic device-tree (mt8395-genio-1200-evk-ufs.dts) in order to be able
to use UFS storage as the main storage on Genio 1200 EVK board.
This board is the origin Genio 1200 EVK already mounted two main storages,
one is eMMC, and the other is UFS. The system automatically prioritizes
between eMMC and UFS via BROM detection, so user could not use both storage
types simultaneously. As a result, mt8395-evk-ufs must be treated as a
separate board.
It use mt8395-genio-common.dtsi file to use common definitions.
Signed-off-by: Macpaul Lin <macpaul.lin@mediatek.com>
Reviewed-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
Signed-off-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
Add a compatible string for the MediaTek mt8395-evk-ufs board.
This board is the origin Genio 1200 EVK already mounted two main storages,
one is eMMC, and the other is UFS. The system automatically prioritizes
between eMMC and UFS via BROM detection, so user could not use both storage
types simultaneously. As a result, mt8395-evk-ufs must be treated as a
separate board.
Signed-off-by: Macpaul Lin <macpaul.lin@mediatek.com>
Acked-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Reviewed-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
Signed-off-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
The DT schema for "mediatek,mt8183-audiosys" expects an
audio-controller node inside the audiosys block. Rename
the nested AFE node from "mt8183-afe-pcm" to
"audio-controller" accordingly.
Also rename the audiosys node itself from "audio-controller" to
"clock-controller" to better reflect its function.
Reviewed-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
Signed-off-by: Julien Massot <julien.massot@collabora.com>
Signed-off-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
On the Orangepi 4A board, the second Ethernet controller, aka the GMAC200,
is connected to an external Motorcomm YT8531 PHY. The PHY uses an external
25MHz crystal, has the SoC's PI15 pin connected to its reset pin, and
the PI16 pin for its interrupt pin.
Enable it.
Acked-by: Jernej Skrabec <jernej.skrabec@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20250923140247.2622602-7-wens@kernel.org
Signed-off-by: Chen-Yu Tsai <wens@csie.org>
On the Avaota A1 board, the second Ethernet controller, aka the GMAC200,
is connected to a second external RTL8211F-CG PHY. The PHY uses an
external 25MHz crystal, and has the SoC's PJ16 pin connected to its
reset pin.
Enable the second Ethernet port. Also fix up the label for the existing
external PHY connected to the first Ethernet port.
Acked-by: Jernej Skrabec <jernej.skrabec@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20250923140247.2622602-6-wens@kernel.org
Signed-off-by: Chen-Yu Tsai <wens@csie.org>
On the Radxa Cubie A5E board, the second Ethernet controller, aka the
GMAC200, is connected to a second external Maxio MAE0621A PHY. The PHY
uses an external 25MHz crystal, and has the SoC's PJ16 pin connected to
its reset pin.
Enable the second Ethernet port. Also fix up the label for the existing
external PHY connected to the first Ethernet port. An enable delay for the
PHY supply regulator is added to make sure the PHY's internal regulators
are fully powered and the PHY is operational.
Acked-by: Jernej Skrabec <jernej.skrabec@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20250923140247.2622602-5-wens@kernel.org
Signed-off-by: Chen-Yu Tsai <wens@csie.org>
On Pixel 6 (and Pro), a Samsung S2MPG10 is used as main PMIC, which
contains the following functional blocks:
* common / speedy interface
* regulators
* 3 clock outputs
* RTC
* power meters
This change adds a node for the clock outputs which are used as inputs
as follows:
* RTC clock for AP
* GNSS receiver, WLAN, Bluetooth
* vibrator, modem
The names have been chosen to match the schematic.
Signed-off-by: André Draszik <andre.draszik@linaro.org>
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Annotate functions writing to PMU registers to online and offline CPUs
as __must_hold() the necessary spinlock for code correctness. These are
static functions so possibility of mistakes is low here, but
__must_hold() serves as self-documenting code.
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Exynos9610's product ID is "0xE9610000". Add this ID to the IDs
table along with the name of the SoC.
Signed-off-by: Alexandru Chimac <alex@chimac.ro>
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Add a compatible for the "samsung,exynos9610-chipid" node, used by
Exynos9610 platforms.
Signed-off-by: Alexandru Chimac <alex@chimac.ro>
Acked-by: Rob Herring (Arm) <robh@kernel.org>
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Add syscon nodes for PERIC0 and PERIC1 blocks.
These are required for configuring the USI, SPI and I2C controllers.
Signed-off-by: Denzeel Oliva <wachiturroxd150@gmail.com>
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Define a GPIO expander pin for the HDD LED and expose it via the
LED subsystem. This allows the BMC to control the front panel
HDD activity LED.
Signed-off-by: Leo Wang <leo.jt.wang@gmail.com>
Signed-off-by: Andrew Jeffery <andrew@codeconstruct.com.au>
The Balcones system is similar to Bonnell but with a POWER11 processor.
Like POWER10, the POWER11 is a dual-chip module, so a dual chip FSI
tree is needed. Therefore, split up the quad chip FSI tree.
Signed-off-by: Eddie James <eajames@linux.ibm.com>
Signed-off-by: Andrew Jeffery <andrew@codeconstruct.com.au>
The Facebook Harma BMC uses I2C1 as an MCTP (Management Component
Transport Protocol) bus. This patch enables the controller by
adding the `mctp-i2c-controller` node under I2C1, with multi-master
support.
Signed-off-by: Daniel Hsu <Daniel-Hsu@quantatw.com>
Signed-off-by: Andrew Jeffery <andrew@codeconstruct.com.au>
Reserve a ramoops memory region in the Yosemite4 device tree so that
kernel panic logs can be preserved across reboots. This helps with
post-mortem debugging and crash analysis.
Signed-off-by: Zane Li <zane_li@wiwynn.com>
Signed-off-by: Andrew Jeffery <andrew@codeconstruct.com.au>
Add the 'shunt-resistor-micro-ohms' property to the LM5066i power
monitors on I2C1 for the Meta Clemente BMC board. This accurately
describes the hardware and is required for proper power monitoring.
Signed-off-by: Leo Wang <leo.jt.wang@gmail.com>
Signed-off-by: Andrew Jeffery <andrew@codeconstruct.com.au>
Add the bus-width property in &mmc0 node. The Exynos DWMMC driver
assumes bus width to be 8 if not present in devicetree, so at least with
respect to the Linux kernel, this doesn't introduce any functional
changes. But other drivers referring to it may not. Either way, without
the bus-width property the hardware description remains incomplete.
Signed-off-by: Kaustabh Chakraborty <kauschluss@disroot.org>
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Add the bus-width property in &mmc0 node. The Exynos DWMMC driver
assumes bus width to be 8 if not present in devicetree, so at least with
respect to the Linux kernel, this doesn't introduce any functional
changes. But other drivers referring to it may not. Either way, without
the bus-width property the hardware description remains incomplete.
Signed-off-by: Kaustabh Chakraborty <kauschluss@disroot.org>
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Add the bus-width property in &mmc0 node. The Exynos DWMMC driver
assumes bus width to be 8 if not present in devicetree, so at least with
respect to the Linux kernel, this doesn't introduce any functional
changes. But other drivers referring to it may not. Either way, without
the bus-width property the hardware description remains incomplete.
Signed-off-by: Kaustabh Chakraborty <kauschluss@disroot.org>
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Replace "tegra_emc" with "tegra30_emc" in all functions to:
1. Avoid name clashing with other Tegra EMC drivers which makes it
easier to jump to function definitions,
2. Decode the calltraces a bit easier,
3. Unify with other Tegra MC and EMC drivers, which use the SoC model
prefixes.
No functional impact.
Reviewed-by: Jon Hunter <jonathanh@nvidia.com>
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Replace "tegra_emc" with "tegra20_emc" in all functions to:
1. Avoid name clashing with other Tegra EMC drivers which makes it
easier to jump to function definitions,
2. Decode the calltraces a bit easier,
3. Unify with other Tegra MC and EMC drivers, which use the SoC model
prefixes.
No functional impact.
Reviewed-by: Jon Hunter <jonathanh@nvidia.com>
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Replace "tegra_emc" with "tegra186_emc" in all functions to:
1. Avoid name clashing with other Tegra EMC drivers which makes it
easier to jump to function definitions,
2. Decode the calltraces a bit easier,
3. Unify with other Tegra MC and EMC drivers, which use the SoC model
prefixes.
No functional impact.
Reviewed-by: Jon Hunter <jonathanh@nvidia.com>
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Replace "tegra_emc" with "tegra124_emc" in all functions to:
1. Avoid name clashing with other Tegra EMC drivers which makes it
easier to jump to function definitions,
2. Decode the calltraces a bit easier,
3. Unify with other Tegra MC and EMC drivers, which use the SoC model
prefixes.
No functional impact.
Reviewed-by: Jon Hunter <jonathanh@nvidia.com>
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Certain calls, like clk_get, can cause probe deferral and driver should
handle it. Use dev_err_probe() to fix that and also change other
non-deferred errors cases to make the code simpler.
Reviewed-by: Jon Hunter <jonathanh@nvidia.com>
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Certain calls, like clk_get, can cause probe deferral and driver should
handle it. Use dev_err_probe() to fix that and also change other
non-deferred errors cases to make the code simpler.
Reviewed-by: Jon Hunter <jonathanh@nvidia.com>
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Certain calls, like clk_get, can cause probe deferral and driver should
handle it. Use dev_err_probe() to fix that and also change other
non-deferred errors cases to make the code simpler.
Also fix missing new line in error message of devm_devfreq_add_device().
Reviewed-by: Jon Hunter <jonathanh@nvidia.com>
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Certain calls, like clk_get, can cause probe deferral and driver should
handle it. Use dev_err_probe() to fix that and also change other
non-deferred errors cases to make the code simpler.
Reviewed-by: Jon Hunter <jonathanh@nvidia.com>
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
icc_node_create() is alloc-like function, so no need to print error
messages on its failure. Dropping one label makes the code a bit
simpler.
Reviewed-by: Jon Hunter <jonathanh@nvidia.com>
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
icc_node_create() is alloc-like function, so no need to print error
messages on its failure. Dropping one label makes the code a bit
simpler.
Reviewed-by: Jon Hunter <jonathanh@nvidia.com>
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
icc_node_create() is alloc-like function, so no need to print error
messages on its failure. Dropping one label makes the code a bit
simpler.
Reviewed-by: Jon Hunter <jonathanh@nvidia.com>
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
icc_node_create() is alloc-like function, so no need to print error
messages on its failure. Dropping one label makes the code a bit
simpler.
Reviewed-by: Jon Hunter <jonathanh@nvidia.com>
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Currently, the family code and subfamily code are derived from the
PMC_TAP_IDCODE register. Versal, Versal NET share the same family
code. Also some platforms share the same subfamily code, making it
difficult to distinguish between platforms. Update
zynqmp_pm_get_family_info() to use IDs derived from the compatible
string instead of silicon ID codes derived from PMC_TAP_IDCODE register.
Signed-off-by: Jay Buddhabhatti <jay.buddhabhatti@amd.com>
Link: https://lore.kernel.org/r/20250701123851.1314531-4-jay.buddhabhatti@amd.com
Signed-off-by: Michal Simek <michal.simek@amd.com>
The family code is currently derived from the PMC_TAP_IDCODE register
value, but there are issues where Versal, Versal NET, and future
platforms share the same family code. Additionally for some platforms
have identical subfamily code, making it challenging to differentiate
between platforms based on the family and subfamily codes. To resolve
this, a new family code member is added to the platform data, initialized
with unique values. This change enables better platform distinction via
the compatible string.
Signed-off-by: Jay Buddhabhatti <jay.buddhabhatti@amd.com>
Link: https://lore.kernel.org/r/20250701123851.1314531-3-jay.buddhabhatti@amd.com
Signed-off-by: Michal Simek <michal.simek@amd.com>
For UHS devices which require tuning, the device tree should have a "cpu_thermal" node which maps to the appropriate thermal zone. This is used to get the temperature of the zone during tuning.
Required properties:
- compatible: Should be "ti,omap2430-sdhci" for omap2430 controllers
Should be "ti,omap3-sdhci" for omap3 controllers
Should be "ti,omap4-sdhci" for omap4 and ti81 controllers
Should be "ti,omap5-sdhci" for omap5 controllers
Should be "ti,dra7-sdhci" for DRA7 and DRA72 controllers
Should be "ti,k2g-sdhci" for K2G
Should be "ti,am335-sdhci" for am335x controllers
Should be "ti,am437-sdhci" for am437x controllers
- ti,hwmods: Must be "mmc<n>", <n> is controller instance starting 1
(Not required for K2G).
- pinctrl-names: Should be subset of "default", "hs", "sdr12", "sdr25", "sdr50",
"ddr50-rev11", "sdr104-rev11", "ddr50", "sdr104",
"ddr_1_8v-rev11", "ddr_1_8v" or "ddr_3_3v", "hs200_1_8v-rev11",
"hs200_1_8v",
- pinctrl-<n> : Pinctrl states as described in bindings/pinctrl/pinctrl-bindings.txt
Optional properties:
- dmas: List of DMA specifiers with the controller specific format as described
in the generic DMA client binding. A tx and rx specifier is required.
- dma-names: List of DMA request names. These strings correspond 1:1 with the
DMA specifiers listed in dmas. The string naming is to be "tx"
and "rx" for TX and RX DMA requests, respectively.
Deprecated properties:
- ti,non-removable: Compatible with the generic non-removable property
Example:
mmc1: mmc@4809c000 {
compatible = "ti,dra7-sdhci";
reg = <0x4809c000 0x400>;
ti,hwmods = "mmc1";
bus-width = <4>;
vmmc-supply = <&vmmc>; /* phandle to regulator node */
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.