Compare commits

...

490 Commits

Author SHA1 Message Date
Linus Torvalds
44fc84337b Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 updates from Catalin Marinas:
 "These are the arm64 updates for 6.19.

  The biggest part is the Arm MPAM driver under drivers/resctrl/.
  There's a patch touching mm/ to handle spurious faults for huge pmd
  (similar to the pte version). The corresponding arm64 part allows us
  to avoid the TLB maintenance if a (huge) page is reused after a write
  fault. There's EFI refactoring to allow runtime services with
  preemption enabled and the rest is the usual perf/PMU updates and
  several cleanups/typos.

  Summary:

  Core features:

   - Basic Arm MPAM (Memory system resource Partitioning And Monitoring)
     driver under drivers/resctrl/ which makes use of the fs/rectrl/ API

  Perf and PMU:

   - Avoid cycle counter on multi-threaded CPUs

   - Extend CSPMU device probing and add additional filtering support
     for NVIDIA implementations

   - Add support for the PMUs on the NoC S3 interconnect

   - Add additional compatible strings for new Cortex and C1 CPUs

   - Add support for data source filtering to the SPE driver

   - Add support for i.MX8QM and "DB" PMU in the imx PMU driver

  Memory managemennt:

   - Avoid broadcast TLBI if page reused in write fault

   - Elide TLB invalidation if the old PTE was not valid

   - Drop redundant cpu_set_*_tcr_t0sz() macros

   - Propagate pgtable_alloc() errors outside of __create_pgd_mapping()

   - Propagate return value from __change_memory_common()

  ACPI and EFI:

   - Call EFI runtime services without disabling preemption

   - Remove unused ACPI function

  Miscellaneous:

   - ptrace support to disable streaming on SME-only systems

   - Improve sysreg generation to include a 'Prefix' descriptor

   - Replace __ASSEMBLY__ with __ASSEMBLER__

   - Align register dumps in the kselftest zt-test

   - Remove some no longer used macros/functions

   - Various spelling corrections"

* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (94 commits)
  arm64/mm: Document why linear map split failure upon vm_reset_perms is not problematic
  arm64/pageattr: Propagate return value from __change_memory_common
  arm64/sysreg: Remove unused define ARM64_FEATURE_FIELD_BITS
  KVM: arm64: selftests: Consider all 7 possible levels of cache
  KVM: arm64: selftests: Remove ARM64_FEATURE_FIELD_BITS and its last user
  arm64: atomics: lse: Remove unused parameters from ATOMIC_FETCH_OP_AND macros
  Documentation/arm64: Fix the typo of register names
  ACPI: GTDT: Get rid of acpi_arch_timer_mem_init()
  perf: arm_spe: Add support for filtering on data source
  perf: Add perf_event_attr::config4
  perf/imx_ddr: Add support for PMU in DB (system interconnects)
  perf/imx_ddr: Get and enable optional clks
  perf/imx_ddr: Move ida_alloc() from ddr_perf_init() to ddr_perf_probe()
  dt-bindings: perf: fsl-imx-ddr: Add compatible string for i.MX8QM, i.MX8QXP and i.MX8DXL
  arm64: remove duplicate ARCH_HAS_MEM_ENCRYPT
  arm64: mm: use untagged address to calculate page index
  MAINTAINERS: new entry for MPAM Driver
  arm_mpam: Add kunit tests for props_mismatch()
  arm_mpam: Add kunit test for bitmap reset
  arm_mpam: Add helper to reset saved mbwu state
  ...
2025-12-02 17:03:55 -08:00
Linus Torvalds
2547f79b0b Merge tag 's390-6.19-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 updates from Heiko Carstens:

 - Provide a new interface for dynamic configuration and deconfiguration
   of hotplug memory, allowing with and without memmap_on_memory
   support. This makes the way memory hotplug is handled on s390 much
   more similar to other architectures

 - Remove compat support. There shouldn't be any compat user space
   around anymore, therefore get rid of a lot of code which also doesn't
   need to be tested anymore

 - Add stackprotector support. GCC 16 will get new compiler options,
   which allow to generate code required for kernel stackprotector
   support

 - Merge pai_crypto and pai_ext PMU drivers into a new driver. This
   removes a lot of duplicated code. The new driver is also extendable
   and allows to support new PMUs

 - Add driver override support for AP queues

 - Rework and extend zcrypt and AP trace events to allow for tracing of
   crypto requests

 - Support block sizes larger than 65535 bytes for CCW tape devices

 - Since the rework of the virtual kernel address space the module area
   and the kernel image are within the same 4GB area. This eliminates
   the need of weak per cpu variables. Get rid of
   ARCH_MODULE_NEEDS_WEAK_PER_CPU

 - Various other small improvements and fixes

* tag 's390-6.19-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (92 commits)
  watchdog: diag288_wdt: Remove KMSG_COMPONENT macro
  s390/entry: Use lay instead of aghik
  s390/vdso: Get rid of -m64 flag handling
  s390/vdso: Rename vdso64 to vdso
  s390: Rename head64.S to head.S
  s390/vdso: Use common STABS_DEBUG and DWARF_DEBUG macros
  s390: Add stackprotector support
  s390/modules: Simplify module_finalize() slightly
  s390: Remove KMSG_COMPONENT macro
  s390/percpu: Get rid of ARCH_MODULE_NEEDS_WEAK_PER_CPU
  s390/ap: Restrict driver_override versus apmask and aqmask use
  s390/ap: Rename mutex ap_perms_mutex to ap_attr_mutex
  s390/ap: Support driver_override for AP queue devices
  s390/ap: Use all-bits-one apmask/aqmask for vfio in_use() checks
  s390/debug: Update description of resize operation
  s390/syscalls: Switch to generic system call table generation
  s390/syscalls: Remove system call table pointer from thread_struct
  s390/uapi: Remove 31 bit support from uapi header files
  s390: Remove compat support
  tools: Remove s390 compat support
  ...
2025-12-02 16:37:00 -08:00
Linus Torvalds
4a21d1b33f Merge tag 'm68k-for-v6.19-tag1' of git://git.kernel.org/pub/scm/linux/kernel/git/geert/linux-m68k
Pull m68k update from Geert Uytterhoeven:

 - defconfig update

* tag 'm68k-for-v6.19-tag1' of git://git.kernel.org/pub/scm/linux/kernel/git/geert/linux-m68k:
  m68k: defconfig: Update defconfigs for v6.18-rc1
2025-12-02 16:32:02 -08:00
Linus Torvalds
d61f1cc5db Merge tag 'x86_cpu_for_6.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 CPU feature updates from Dave Hansen:
 "The biggest thing of note here is Linear Address Space Separation
  (LASS). It represents the first time I can think of that the
  upper=>kernel/lower=>user address space convention is actually
  recognized by the hardware on x86. It ensures that userspace can not
  even get the hardware to _start_ page walks for the kernel address
  space. This, of course, is a really nice generic side channel defense.

  This is really only a down payment on LASS support. There are still
  some details to work out in its interaction with EFI calls and
  vsyscall emulation. For now, LASS is disabled if either of those
  features is compiled in (which is almost always the case).

  There's also one straggler commit in here which converts an
  under-utilized AMD CPU feature leaf into a generic Linux-defined leaf
  so more feature can be packed in there.

  Summary:

   - Enable Linear Address Space Separation (LASS)

   - Change X86_FEATURE leaf 17 from an AMD leaf to Linux-defined"

* tag 'x86_cpu_for_6.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/cpu: Enable LASS during CPU initialization
  selftests/x86: Update the negative vsyscall tests to expect a #GP
  x86/traps: Communicate a LASS violation in #GP message
  x86/kexec: Disable LASS during relocate kernel
  x86/alternatives: Disable LASS when patching kernel code
  x86/asm: Introduce inline memcpy and memset
  x86/cpu: Add an LASS dependency on SMAP
  x86/cpufeatures: Enumerate the LASS feature bits
  x86/cpufeatures: Make X86_FEATURE leaf 17 Linux-specific
2025-12-02 14:48:08 -08:00
Linus Torvalds
a7610b8465 Merge tag 'x86_entry_for_6.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 entry update from Dave Hansen:
 "This one is pretty trivial: fix a badly-named FRED data structure
  member"

* tag 'x86_entry_for_6.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/fred: Fix 64bit identifier in fred_ss
2025-12-02 14:24:21 -08:00
Linus Torvalds
e2aa39b368 Merge tag 'x86_misc_for_6.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull misc x86 updates from Dave Hansen:
 "The most significant are some changes to ensure that symbols exported
  for KVM are used only by KVM modules themselves, along with some
  related cleanups.

  In true x86/misc fashion, the other patch is completely unrelated and
  just enhances an existing pr_warn() to make it clear to users how they
  have tainted their kernel when something is mucking with MSRs.

  Summary:

   - Make MSR-induced taint easier for users to track down

   - Restrict KVM-specific exports to KVM itself"

* tag 'x86_misc_for_6.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86: Restrict KVM-induced symbol exports to KVM modules where obvious/possible
  x86/mm: Drop unnecessary export of "ptdump_walk_pgd_level_debugfs"
  x86/mtrr: Drop unnecessary export of "mtrr_state"
  x86/bugs: Drop unnecessary export of "x86_spec_ctrl_base"
  x86/msr: Add CPU_OUT_OF_SPEC taint name to "unrecognized" pr_warn(msg)
2025-12-02 14:16:42 -08:00
Linus Torvalds
54de197c9a Merge tag 'x86_sgx_for_6.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 SGX updates from Dave HansenL
 "The main content here is adding support for the new EUPDATESVN SGX
  ISA. Before this, folks who updated microcode had to reboot before
  enclaves could attest to the new microcode. The new functionality lets
  them do this without a reboot.

  The rest are some nice, but relatively mundane comment and kernel-doc
  fixups.

  Summary:

   - Allow security version (SVN) updates so enclaves can attest to new
     microcode

   - Fix kernel docs typos"

* tag 'x86_sgx_for_6.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/sgx: Fix a typo in the kernel-doc comment for enum sgx_attribute
  x86/sgx: Remove superfluous asterisk from copyright comment in asm/sgx.h
  x86/sgx: Document structs and enums with '@', not '%'
  x86/sgx: Add kernel-doc descriptions for params passed to vDSO user handler
  x86/sgx: Add a missing colon in kernel-doc markup for "struct sgx_enclave_run"
  x86/sgx: Enable automatic SVN updates for SGX enclaves
  x86/sgx: Implement ENCLS[EUPDATESVN]
  x86/sgx: Define error codes for use by ENCLS[EUPDATESVN]
  x86/cpufeatures: Add X86_FEATURE_SGX_EUPDATESVN feature flag
  x86/sgx: Introduce functions to count the sgx_(vepc_)open()
2025-12-02 14:03:05 -08:00
Linus Torvalds
c76431e3b5 Merge tag 'x86_mm_for_v6.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 mm updates from Borislav Petkov:

 - Use the proper accessors when reading CR3 as part of the page level
   transitions (5-level to 4-level, the use case being kexec) so that
   only the physical address in CR3 is picked up and not flags which are
   above the physical mask shift

 - Clean up and unify __phys_addr_symbol() definitions

* tag 'x86_mm_for_v6.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  efi/libstub: Fix page table access in 5-level to 4-level paging transition
  x86/boot: Fix page table access in 5-level to 4-level paging transition
  x86/mm: Unify __phys_addr_symbol()
2025-12-02 13:32:52 -08:00
Linus Torvalds
a9a10e920e Merge tag 'x86_bugs_for_v6.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 CPU mitigation updates from Borislav Petkov:

 - Convert the tsx= cmdline parsing to use early_param()

 - Cleanup forward declarations gunk in bugs.c

* tag 'x86_bugs_for_v6.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/bugs: Get rid of the forward declarations
  x86/tsx: Get the tsx= command line parameter with early_param()
  x86/tsx: Make tsx_ctrl_state static
2025-12-02 13:27:09 -08:00
Linus Torvalds
cb502f0e5e Merge tag 'x86_sev_for_v6.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 SEV updates from Borislav Petkov:

 - Largely cleanups along with a change to save XSS to the GHCB
   (Guest-Host Communication Block) in SEV-ES guests so that the
   hypervisor can determine the guest's XSAVES buffer size properly
   and thus support shadow stacks in AMD confidential guests

* tag 'x86_sev_for_v6.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/cc: Fix enum spelling to fix kernel-doc warnings
  x86/boot: Drop unused sev_enable() fallback
  x86/coco/sev: Convert has_cpuflag() to use cpu_feature_enabled()
  x86/sev: Include XSS value in GHCB CPUID request
  x86/boot: Move boot_*msr helpers to asm/shared/msr.h
2025-12-02 13:07:53 -08:00
Linus Torvalds
d748981834 Merge tag 'x86_cleanups_for_v6.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 cleanups from Borislav Petkov:

 - The mandatory pile of cleanups the cat drags in every merge window

* tag 'x86_cleanups_for_v6.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/boot: Clean up whitespace in a20.c
  x86/mm: Delete disabled debug code
  x86/{boot,mtrr}: Remove unused function declarations
  x86/percpu: Use BIT_WORD() and BIT_MASK() macros
  x86/cpufeatures: Correct LKGS feature flag description
  x86/idtentry: Add missing '*' to kernel-doc lines
2025-12-02 12:17:47 -08:00
Linus Torvalds
2ae20d6510 Merge tag 'x86_cache_for_v6.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 resource control updates from Borislav Petkov:

 - Add support for AMD's Smart Data Cache Injection feature which allows
   for direct insertion of data from I/O devices into the L3 cache, thus
   bypassing DRAM and saving its bandwidth; the resctrl side of the
   feature allows the size of the L3 used for data injection to be
   controlled

 - Add Intel Clearwater Forest to the list of CPUs which support
   Sub-NUMA clustering

 - Other fixes and cleanups

* tag 'x86_cache_for_v6.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  fs/resctrl: Update bit_usage to reflect io_alloc
  fs/resctrl: Introduce interface to modify io_alloc capacity bitmasks
  fs/resctrl: Modify struct rdt_parse_data to pass mode and CLOSID
  fs/resctrl: Introduce interface to display io_alloc CBMs
  fs/resctrl: Add user interface to enable/disable io_alloc feature
  fs/resctrl: Introduce interface to display "io_alloc" support
  x86,fs/resctrl: Implement "io_alloc" enable/disable handlers
  x86,fs/resctrl: Detect io_alloc feature
  x86/resctrl: Add SDCIAE feature in the command line options
  x86/cpufeatures: Add support for L3 Smart Data Cache Injection Allocation Enforcement
  fs/resctrl: Consider sparse masks when initializing new group's allocation
  x86/resctrl: Support Sub-NUMA Cluster (SNC) mode on Clearwater Forest
2025-12-02 11:55:58 -08:00
Linus Torvalds
2a47c26e55 Merge tag 'x86_microcode_for_v6.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 microcode loading updates from Borislav Petkov:

 - Add microcode staging support on Intel: it moves the sole microcode
   blobs loading to a non-critical path so that microcode loading
   latencies are kept at minimum. The actual "directing" the hardware to
   load microcode is the only step which is done on the critical path.

   This scheme is also opportunistic as in: on a failure, the machinery
   falls back to normal loading

 - Add the capability to the AMD side of the loader to select one of two
   per-family/model/stepping patches: one is pre-Entrysign and the other
   is post-Entrysign; with the goal to take care of machines which
   haven't updated their BIOS yet - something they should absolutely do
   as this is the only proper Entrysign fix

 - Other small cleanups and fixlets

* tag 'x86_microcode_for_v6.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/microcode: Mark early_parse_cmdline() as __init
  x86/microcode/AMD: Select which microcode patch to load
  x86/microcode/intel: Enable staging when available
  x86/microcode/intel: Support mailbox transfer
  x86/microcode/intel: Implement staging handler
  x86/microcode/intel: Define staging state struct
  x86/microcode/intel: Establish staging control logic
  x86/microcode: Introduce staging step to reduce late-loading time
  x86/cpu/topology: Make primary thread mask available with SMP=n
2025-12-02 11:35:49 -08:00
Linus Torvalds
a61288200e Merge tag 'ras_core_for_v6.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 RAS updates from Borislav Petkov:

 - The second part of the AMD MCA interrupts rework after the
   last-minute show-stopper from the last merge window was sorted out.
   After this, the AMD MCA deferred errors, thresholding and corrected
   errors interrupt handlers use common MCA code and are tightly
   integrated into the core MCA code, thereby getting rid of
   considerable duplication. All culminating into allowing CMCI error
   thresholding storms to be detected at AMD too, using the common
   infrastructure

 - Add support for two new MCA bank bits on AMD Zen6 which denote
   whether the error address logged is a system physical address, which
   obviates the need for it to be translated before further error
   recovery can be done

* tag 'ras_core_for_v6.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/mce: Handle AMD threshold interrupt storms
  x86/mce: Do not clear bank's poll bit in mce_poll_banks on AMD SMCA systems
  x86/mce: Add support for physical address valid bit
  x86/mce: Save and use APEI corrected threshold limit
  x86/mce/amd: Define threshold restart function for banks
  x86/mce/amd: Remove redundant reset_block()
  x86/mce/amd: Support SMCA Corrected Error Interrupt
  x86/mce/amd: Enable interrupt vectors once per-CPU on SMCA systems
  x86/mce: Unify AMD DFR handler with MCA Polling
  x86/mce: Unify AMD THR handler with MCA Polling
2025-12-02 11:04:37 -08:00
Linus Torvalds
49219bba01 Merge tag 'edac_updates_for_v6.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras
Pull EDAC updates from Borislav Petkov:

 - imh_edac: Add a new EDAC driver for Intel Diamond Rapids and future
   incarnations of this memory controllers architecture

 - amd64_edac: Remove the legacy csrow sysfs interface which has been
   deprecated and unused (we assume) for at least a decade

 - Add the capability to fallback to BIOS-provided address translation
   functionality (ACPI PRM) which can be used on systems unsupported by
   the current AMD address translation library

 - The usual fixes, fixlets, cleanups and improvements all over the
   place

* tag 'edac_updates_for_v6.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras:
  RAS/AMD/ATL: Replace bitwise_xor_bits() with hweight16()
  EDAC/igen6: Fix error handling in igen6_edac driver
  EDAC/imh: Setup 'imh_test' debugfs testing node
  EDAC/{skx_comm,imh}: Detect 2-level memory configuration
  EDAC/skx_common: Extend the maximum number of DRAM chip row bits
  EDAC/{skx_common,imh}: Add EDAC driver for Intel Diamond Rapids servers
  EDAC/skx_common: Prepare for skx_set_hi_lo()
  EDAC/skx_common: Prepare for skx_get_edac_list()
  EDAC/{skx_common,skx,i10nm}: Make skx_register_mci() independent of pci_dev
  EDAC/ghes: Replace deprecated strcpy() in ghes_edac_report_mem_error()
  EDAC/ie31200: Fix error handling in ie31200_register_mci
  RAS/CEC: Replace use of system_wq with system_percpu_wq
  EDAC: Remove the legacy EDAC sysfs interface
  EDAC/amd64: Remove NUM_CONTROLLERS macro
  EDAC/amd64: Generate ctl_name string at runtime
  RAS/AMD/ATL: Require PRM support for future systems
  ACPI: PRM: Add acpi_prm_handler_available()
  RAS/AMD/ATL: Return error codes from helper functions
2025-12-02 10:45:50 -08:00
Linus Torvalds
7f8d5f70ff Merge tag 'core-core-2025-12-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull core irq cleanup from Thomas Gleixner:
 "Tree wide cleanup of the remaining users of in_irq() which got
  replaced by in_hardirq() and marked deprecated in 2020"

* tag 'core-core-2025-12-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  treewide: Remove in_irq()
2025-12-02 10:18:49 -08:00
Linus Torvalds
d42e504a55 Merge tag 'timers-core-2025-11-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer core updates from Thomas Gleixner:

 - Prevent a thundering herd problem when the timekeeper CPU is delayed
   and a large number of CPUs compete to acquire jiffies_lock to do the
   update. Limit it to one CPU with a separate "uncontended" atomic
   variable.

 - A set of improvements for the timer migration mechanism:

     - Support imbalanced NUMA trees correctly

     - Support dynamic exclusion of CPUs from the migrator duty to allow
       the cpuset/isolation mechanism to exclude them from handling
       timers of remote idle CPUs

 - The usual small updates, cleanups and enhancements

* tag 'timers-core-2025-11-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  timers/migration: Exclude isolated cpus from hierarchy
  cpumask: Add initialiser to use cleanup helpers
  sched/isolation: Force housekeeping if isolcpus and nohz_full don't leave any
  cgroup/cpuset: Rename update_unbound_workqueue_cpumask() to update_isolation_cpumasks()
  timers/migration: Use scoped_guard on available flag set/clear
  timers/migration: Add mask for CPUs available in the hierarchy
  timers/migration: Rename 'online' bit to 'available'
  selftests/timers/nanosleep: Add tests for return of remaining time
  selftests/timers: Clean up kernel version check in posix_timers
  time: Fix a few typos in time[r] related code comments
  time: tick-oneshot: Add missing Return and parameter descriptions to kernel-doc
  hrtimer: Store time as ktime_t in restart block
  timers/migration: Remove dead code handling idle CPU checking for remote timers
  timers/migration: Remove unused "cpu" parameter from tmigr_get_group()
  timers/migration: Assert that hotplug preparing CPU is part of stable active hierarchy
  timers/migration: Fix imbalanced NUMA trees
  timers/migration: Remove locking on group connection
  timers/migration: Convert "while" loops to use "for"
  tick/sched: Limit non-timekeeper CPUs calling jiffies update
2025-12-02 09:58:33 -08:00
Linus Torvalds
5028f42416 Merge tag 'timers-clocksource-2025-11-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull clocksource updates from Thomas Gleixner:
 "Updates for clocksource and clockevent drivers:

   - A new driver for the Realtel system timer

   - Prevent the unbinding of timers when the drivers do not support
     that

   - Expand the timer counter readout for the SPRD driver to 64 bit
     to allow IOT devices suspend times of more than 36 hours, which
     is the current limit of the 32-bi readout

   - The usual small cleanups, fixes and enhancements all over the
     place"

* tag 'timers-clocksource-2025-11-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  clocksource/drivers: Add Realtek system timer driver
  dt-bindings: timer: Add Realtek SYSTIMER
  clocksource/drivers/stm32-lp: Drop unused module alias
  clocksource/drivers/rda: Add sched_clock_register for RDA8810PL SoC
  clocksource/drivers/nxp-stm: Prevent driver unbind
  clocksource/drivers/nxp-pit: Prevent driver unbind
  clocksource/drivers/arm_arch_timer_mmio: Prevent driver unbind
  clocksource/drivers/nxp-stm: Fix section mismatches
  clocksource/drivers/sh_cmt: Always leave device running after probe
  clocksource/drivers/stm: Fix double deregistration on probe failure
  clocksource/drivers/ralink: Fix resource leaks in init error path
  clocksource/drivers/timer-sp804: Fix read_current_timer() issue when clock source is not registered
  clocksource/drivers/sprd: Enable register for timer counter from 32 bit to 64 bit
2025-12-02 09:54:27 -08:00
Linus Torvalds
9ce62ebbb7 Merge tag 'irq-msi-2025-11-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull MSI updates from Thomas Gleixner:
 "Updates for [PCI] MSI related code:

   - Remove one variant of PCI/MSI management as all users have been
     converted to use per device domains. That reduces the variants to
     two:

     The modern and the real archaic legacy variant, which keeps the
     usual suspects in the museum category alive.

   - Rework the platform MSI device ID detection mechanism in the ARM
     GIC world to address resource leaks, duplicated code and other
     details. This requires a corresponding preparatory step in the
     PCI/iproc driver.

   - Trivial core code cleanups"

* tag 'irq-msi-2025-11-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  irqchip/gic-its: Rework platform MSI deviceID detection
  PCI: iproc: Implement MSI controller node detection with of_msi_xlate()
  genirq/msi: Slightly simplify msi_domain_alloc()
  PCI/MSI: Delete pci_msi_create_irq_domain()
2025-12-02 09:35:59 -08:00
Linus Torvalds
15b87bec89 Merge tag 'irq-drivers-2025-11-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq driver updates from Thomas Gleixner:
 "Boring updates for interrupt drivers:

   - Support for a couple of new ARM64 and RISCV SoC variants and their
     magic interrupt controllers which either can reuse existing code or
     require quirks due to a botched hardware implementation

   - More section mismatch fixes

   - The usual cleanups and fixes all over the place"

* tag 'irq-drivers-2025-11-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (32 commits)
  irqchip/meson-gpio: Add support for Amlogic S6 S7 and S7D SoCs
  dt-bindings: interrupt-controller: Add support for Amlogic S6 S7 and S7D SoCs
  dt-bindings: interrupt-controller: aspeed,ast2700: Correct #interrupt-cells and interrupts count
  irqchip/aclint-sswi: Add Nuclei UX900 support
  dt-bindings: interrupt-controller: Add Anlogic DR1V90 ACLINT SSWI
  dt-bindings: interrupt-controller: Add Anlogic DR1V90 ACLINT MSWI
  dt-bindings: interrupt-controller: Add Anlogic DR1V90 PLIC
  irqchip/irq-bcm7038-l1: Remove unused reg_mask_status()
  irqchip/sifive-plic: Fix call to __plic_toggle() in M-Mode code path
  irqchip/sifive-plic: Add support for UltraRISC DP1000 PLIC
  irqchip/sifive-plic: Cache the interrupt enable state
  dt-bindings: interrupt-controller: Add UltraRISC DP1000 PLIC
  dt-bindings: vendor-prefixes: Add UltraRISC
  irqchip/qcom-irq-combiner: Rename driver structure
  irqchip/riscv-imsic: Inline imsic_vector_from_local_id()
  irqchip/riscv-imsic: Embed the vector array in lpriv
  irqchip/riscv-imsic: Remove redundant irq_data lookups
  irqchip/ts4800: Drop unused module alias
  irqchip/mvebu-pic: Drop unused module alias
  irqchip/meson-gpio: Drop unused module alias
  ...
2025-12-02 09:32:53 -08:00
Linus Torvalds
6863c8385c Merge tag 'irq-core-2025-11-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq core updates from Thomas Gleixner:
 "Updates for the interrupt core and treewide cleanups:

   - Rework of the Per Processor Interrupt (PPI) management on ARM[64]

     PPI support was built under the assumption that the systems are
     homogenous so that the same CPU local device types are connected to
     them. That's unfortunately wishful thinking and created horrible
     workarounds.

     This rework provides affinity management for PPIs so that they can
     be individually configured in the firmware tables and mops up the
     related drivers all over the place.

   - Prevent CPUSET/isolation changes to arbitrarily affine interrupt
     threads to random CPUs, which ignores user or driver settings.

   - Plug a harmless race in the interrupt affinity proc interface,
     which allows to see a half updated mask

   - Adjust the priority of secondary interrupt threads on RT, so that
     the combination of primary and secondary thread emulates the
     hardware interrupt plus thread scenario. Having them at the same
     priority can cause starvation issues in some drivers"

* tag 'irq-core-2025-11-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits)
  genirq: Remove cpumask availability check on kthread affinity setting
  genirq: Fix interrupt threads affinity vs. cpuset isolated partitions
  genirq: Prevent early spurious wake-ups of interrupt threads
  genirq: Use raw_spinlock_irq() in irq_set_affinity_notifier()
  genirq/manage: Reduce priority of forced secondary interrupt handler
  genirq/proc: Fix race in show_irq_affinity()
  genirq: Fix percpu_devid irq affinity documentation
  perf: arm_pmu: Kill last use of per-CPU cpu_armpmu pointer
  irqdomain: Kill of_node_to_fwnode() helper
  genirq: Kill irq_{g,s}et_percpu_devid_partition()
  irqchip: Kill irq-partition-percpu
  irqchip/apple-aic: Drop support for custom PMU irq partitions
  irqchip/gic-v3: Drop support for custom PPI partitions
  coresight: trbe: Request specific affinities for per CPU interrupts
  perf: arm_spe_pmu: Request specific affinities for per CPU interrupts
  perf: arm_pmu: Request specific affinities for per CPU NMIs/interrupts
  genirq: Add request_percpu_irq_affinity() helper
  genirq: Allow per-cpu interrupt sharing for non-overlapping affinities
  genirq: Update request_percpu_nmi() to take an affinity
  genirq: Add affinity to percpu_devid interrupt requests
  ...
2025-12-02 09:14:26 -08:00
Linus Torvalds
312f5b1866 Merge tag 'core-debugobjects-2025-11-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull debugobjects update from Thomas Gleixner:
 "Two small updates for debugobjects:

   - Allow pool refill on RT enabled kernels before the scheduler is up
     and running to prevent pool exhaustion

   - Correct the lockdep override to prevent false positives"

* tag 'core-debugobjects-2025-11-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  debugobjects: Use LD_WAIT_CONFIG instead of LD_WAIT_SLEEP
  debugobjects: Allow to refill the pool before SYSTEM_SCHEDULING
2025-12-02 09:07:48 -08:00
Linus Torvalds
2b09f480f0 Merge tag 'core-rseq-2025-11-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull rseq updates from Thomas Gleixner:
 "A large overhaul of the restartable sequences and CID management:

  The recent enablement of RSEQ in glibc resulted in regressions which
  are caused by the related overhead. It turned out that the decision to
  invoke the exit to user work was not really a decision. More or less
  each context switch caused that. There is a long list of small issues
  which sums up nicely and results in a 3-4% regression in I/O
  benchmarks.

  The other detail which caused issues due to extra work in context
  switch and task migration is the CID (memory context ID) management.
  It also requires to use a task work to consolidate the CID space,
  which is executed in the context of an arbitrary task and results in
  sporadic uncontrolled exit latencies.

  The rewrite addresses this by:

   - Removing deprecated and long unsupported functionality

   - Moving the related data into dedicated data structures which are
     optimized for fast path processing.

   - Caching values so actual decisions can be made

   - Replacing the current implementation with a optimized inlined
     variant.

   - Separating fast and slow path for architectures which use the
     generic entry code, so that only fault and error handling goes into
     the TIF_NOTIFY_RESUME handler.

   - Rewriting the CID management so that it becomes mostly invisible in
     the context switch path. That moves the work of switching modes
     into the fork/exit path, which is a reasonable tradeoff. That work
     is only required when a process creates more threads than the
     cpuset it is allowed to run on or when enough threads exit after
     that. An artificial thread pool benchmarks which triggers this did
     not degrade, it actually improved significantly.

     The main effect in migration heavy scenarios is that runqueue lock
     held time and therefore contention goes down significantly"

* tag 'core-rseq-2025-11-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (54 commits)
  sched/mmcid: Switch over to the new mechanism
  sched/mmcid: Implement deferred mode change
  irqwork: Move data struct to a types header
  sched/mmcid: Provide CID ownership mode fixup functions
  sched/mmcid: Provide new scheduler CID mechanism
  sched/mmcid: Introduce per task/CPU ownership infrastructure
  sched/mmcid: Serialize sched_mm_cid_fork()/exit() with a mutex
  sched/mmcid: Provide precomputed maximal value
  sched/mmcid: Move initialization out of line
  signal: Move MMCID exit out of sighand lock
  sched/mmcid: Convert mm CID mask to a bitmap
  cpumask: Cache num_possible_cpus()
  sched/mmcid: Use cpumask_weighted_or()
  cpumask: Introduce cpumask_weighted_or()
  sched/mmcid: Prevent pointless work in mm_update_cpus_allowed()
  sched/mmcid: Move scheduler code out of global header
  sched: Fixup whitespace damage
  sched/mmcid: Cacheline align MM CID storage
  sched/mmcid: Use proper data structures
  sched/mmcid: Revert the complex CID management
  ...
2025-12-02 08:48:53 -08:00
Linus Torvalds
1dce50698a Merge tag 'core-uaccess-2025-11-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scoped user access updates from Thomas Gleixner:
 "Scoped user mode access and related changes:

   - Implement the missing u64 user access function on ARM when
     CONFIG_CPU_SPECTRE=n.

     This makes it possible to access a 64bit value in generic code with
     [unsafe_]get_user(). All other architectures and ARM variants
     provide the relevant accessors already.

   - Ensure that ASM GOTO jump label usage in the user mode access
     helpers always goes through a local C scope label indirection
     inside the helpers.

     This is required because compilers are not supporting that a ASM
     GOTO target leaves a auto cleanup scope. GCC silently fails to emit
     the cleanup invocation and CLANG fails the build.

     [ Editor's note: gcc-16 will have fixed the code generation issue
       in commit f68fe3ddda4 ("eh: Invoke cleanups/destructors in asm
       goto jumps [PR122835]"). But we obviously have to deal with clang
       and older versions of gcc, so.. - Linus ]

     This provides generic wrapper macros and the conversion of affected
     architecture code to use them.

   - Scoped user mode access with auto cleanup

     Access to user mode memory can be required in hot code paths, but
     if it has to be done with user controlled pointers, the access is
     shielded with a speculation barrier, so that the CPU cannot
     speculate around the address range check. Those speculation
     barriers impact performance quite significantly.

     This cost can be avoided by "masking" the provided pointer so it is
     guaranteed to be in the valid user memory access range and
     otherwise to point to a guaranteed unpopulated address space. This
     has to be done without branches so it creates an address dependency
     for the access, which the CPU cannot speculate ahead.

     This results in repeating and error prone programming patterns:

       	    if (can_do_masked_user_access())
                      from = masked_user_read_access_begin((from));
              else if (!user_read_access_begin(from, sizeof(*from)))
                      return -EFAULT;
              unsafe_get_user(val, from, Efault);
              user_read_access_end();
              return 0;
        Efault:
              user_read_access_end();
              return -EFAULT;

      which can be replaced with scopes and automatic cleanup:

              scoped_user_read_access(from, Efault)
                      unsafe_get_user(val, from, Efault);
              return 0;
         Efault:
              return -EFAULT;

   - Convert code which implements the above pattern over to
     scope_user.*.access(). This also corrects a couple of imbalanced
     masked_*_begin() instances which are harmless on most
     architectures, but prevent PowerPC from implementing the masking
     optimization.

   - Add a missing speculation barrier in copy_from_user_iter()"

* tag 'core-uaccess-2025-11-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  lib/strn*,uaccess: Use masked_user_{read/write}_access_begin when required
  scm: Convert put_cmsg() to scoped user access
  iov_iter: Add missing speculation barrier to copy_from_user_iter()
  iov_iter: Convert copy_from_user_iter() to masked user access
  select: Convert to scoped user access
  x86/futex: Convert to scoped user access
  futex: Convert to get/put_user_inline()
  uaccess: Provide put/get_user_inline()
  uaccess: Provide scoped user access regions
  arm64: uaccess: Use unsafe wrappers for ASM GOTO
  s390/uaccess: Use unsafe wrappers for ASM GOTO
  riscv/uaccess: Use unsafe wrappers for ASM GOTO
  powerpc/uaccess: Use unsafe wrappers for ASM GOTO
  x86/uaccess: Use unsafe wrappers for ASM GOTO
  uaccess: Provide ASM GOTO safe wrappers for unsafe_*_user()
  ARM: uaccess: Implement missing __get_user_asm_dword()
2025-12-02 08:01:39 -08:00
Borislav Petkov (AMD)
e2349c5811 Merge remote-tracking branches 'ras/edac-amd-atl', 'ras/edac-drivers' and 'ras/edac-misc' into edac-updates
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
2025-12-01 12:06:08 +01:00
Harry Fellowes
d911fe6e94 x86/boot: Clean up whitespace in a20.c
Remove trailing whitespace on empty lines.

No functional changes.

  [ bp: Massage commit message. ]

Signed-off-by: Harry Fellowes <harryfellowes1@gmail.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://patch.msgid.link/20250825192832.6444-3-harryfellowes1@gmail.com
2025-11-28 20:29:52 +01:00
Catalin Marinas
edde060637 Merge branch 'for-next/set_memory' into for-next/core
* for-next/set_memory:
  : Fix + documentation for the arm64 change_memory_common()
  arm64/mm: Document why linear map split failure upon vm_reset_perms is not problematic
  arm64/pageattr: Propagate return value from __change_memory_common
2025-11-28 15:48:03 +00:00
Catalin Marinas
52c4d1d624 Merge branch 'for-next/sysreg' into for-next/core
* for-next/sysreg:
  : arm64 sysreg updates/cleanups
  arm64/sysreg: Remove unused define ARM64_FEATURE_FIELD_BITS
  KVM: arm64: selftests: Consider all 7 possible levels of cache
  KVM: arm64: selftests: Remove ARM64_FEATURE_FIELD_BITS and its last user
  arm64/sysreg: Add ICH_VMCR_EL2
  arm64/sysreg: Move generation of RES0/RES1/UNKN to function
  arm64/sysreg: Support feature-specific fields with 'Prefix' descriptor
  arm64/sysreg: Fix checks for incomplete sysreg definitions
  arm64/sysreg: Replace TCR_EL1 field macros
2025-11-28 15:47:53 +00:00
Catalin Marinas
17c05cb0ef Merge branches 'for-next/misc', 'for-next/kselftest', 'for-next/efi-preempt', 'for-next/assembler-macro', 'for-next/typos', 'for-next/sme-ptrace-disable', 'for-next/local-tlbi-page-reused', 'for-next/mpam', 'for-next/acpi' and 'for-next/documentation', remote-tracking branch 'arm64/for-next/perf' into for-next/core
* arm64/for-next/perf:
  perf: arm_spe: Add support for filtering on data source
  perf: Add perf_event_attr::config4
  perf/imx_ddr: Add support for PMU in DB (system interconnects)
  perf/imx_ddr: Get and enable optional clks
  perf/imx_ddr: Move ida_alloc() from ddr_perf_init() to ddr_perf_probe()
  dt-bindings: perf: fsl-imx-ddr: Add compatible string for i.MX8QM, i.MX8QXP and i.MX8DXL
  arch_topology: Provide a stub topology_core_has_smt() for !CONFIG_GENERIC_ARCH_TOPOLOGY
  perf/arm-ni: Fix and optimise register offset calculation
  perf: arm_pmuv3: Add new Cortex and C1 CPU PMUs
  perf: arm_cspmu: fix error handling in arm_cspmu_impl_unregister()
  perf/arm-ni: Add NoC S3 support
  perf/arm_cspmu: nvidia: Add pmevfiltr2 support
  perf/arm_cspmu: nvidia: Add revision id matching
  perf/arm_cspmu: Add pmpidr support
  perf/arm_cspmu: Add callback to reset filter config
  perf: arm_pmuv3: Don't use PMCCNTR_EL0 on SMT cores

* for-next/misc:
  : Miscellaneous patches
  arm64: atomics: lse: Remove unused parameters from ATOMIC_FETCH_OP_AND macros
  arm64: remove duplicate ARCH_HAS_MEM_ENCRYPT
  arm64: mm: use untagged address to calculate page index
  arm64: mm: make linear mapping permission update more robust for patial range
  arm64/mm: Elide TLB flush in certain pte protection transitions
  arm64/mm: Rename try_pgd_pgtable_alloc_init_mm
  arm64/mm: Allow __create_pgd_mapping() to propagate pgtable_alloc() errors
  arm64: add unlikely hint to MTE async fault check in el0_svc_common
  arm64: acpi: add newline to deferred APEI warning
  arm64: entry: Clean out some indirection
  arm64/mm: Ensure PGD_SIZE is aligned to 64 bytes when PA_BITS = 52
  arm64/mm: Drop cpu_set_[default|idmap]_tcr_t0sz()
  arm64: remove unused ARCH_PFN_OFFSET
  arm64: use SOFTIRQ_ON_OWN_STACK for enabling softirq stack
  arm64: Remove assertion on CONFIG_VMAP_STACK

* for-next/kselftest:
  : arm64 kselftest patches
  kselftest/arm64: Align zt-test register dumps

* for-next/efi-preempt:
  : arm64: Make EFI calls preemptible
  arm64/efi: Call EFI runtime services without disabling preemption
  arm64/efi: Move uaccess en/disable out of efi_set_pgd()
  arm64/efi: Drop efi_rt_lock spinlock from EFI arch wrapper
  arm64/fpsimd: Permit kernel mode NEON with IRQs off
  arm64/fpsimd: Don't warn when EFI execution context is preemptible
  efi/runtime-wrappers: Keep track of the efi_runtime_lock owner
  efi: Add missing static initializer for efi_mm::cpus_allowed_lock

* for-next/assembler-macro:
  : arm64: Replace __ASSEMBLY__ with __ASSEMBLER__ in headers
  arm64: Replace __ASSEMBLY__ with __ASSEMBLER__ in non-uapi headers
  arm64: Replace __ASSEMBLY__ with __ASSEMBLER__ in uapi headers

* for-next/typos:
  : Random typo/spelling fixes
  arm64: Fix double word in comments
  arm64: Fix typos and spelling errors in comments

* for-next/sme-ptrace-disable:
  : Support disabling streaming mode via ptrace on SME only systems
  kselftest/arm64: Cover disabling streaming mode without SVE in fp-ptrace
  kselftst/arm64: Test NT_ARM_SVE FPSIMD format writes on non-SVE systems
  arm64/sme: Support disabling streaming mode via ptrace on SME only systems

* for-next/local-tlbi-page-reused:
  : arm64, mm: avoid TLBI broadcast if page reused in write fault
  arm64, tlbflush: don't TLBI broadcast if page reused in write fault
  mm: add spurious fault fixing support for huge pmd

* for-next/mpam: (34 commits)
  : Basic Arm MPAM driver (more to follow)
  MAINTAINERS: new entry for MPAM Driver
  arm_mpam: Add kunit tests for props_mismatch()
  arm_mpam: Add kunit test for bitmap reset
  arm_mpam: Add helper to reset saved mbwu state
  arm_mpam: Use long MBWU counters if supported
  arm_mpam: Probe for long/lwd mbwu counters
  arm_mpam: Consider overflow in bandwidth counter state
  arm_mpam: Track bandwidth counter state for power management
  arm_mpam: Add mpam_msmon_read() to read monitor value
  arm_mpam: Add helpers to allocate monitors
  arm_mpam: Probe and reset the rest of the features
  arm_mpam: Allow configuration to be applied and restored during cpu online
  arm_mpam: Use a static key to indicate when mpam is enabled
  arm_mpam: Register and enable IRQs
  arm_mpam: Extend reset logic to allow devices to be reset any time
  arm_mpam: Add a helper to touch an MSC from any CPU
  arm_mpam: Reset MSC controls from cpuhp callbacks
  arm_mpam: Merge supported features during mpam_enable() into mpam_class
  arm_mpam: Probe the hardware features resctrl supports
  arm_mpam: Add helpers for managing the locking around the mon_sel registers
  ...

* for-next/acpi:
  : arm64 acpi updates
  ACPI: GTDT: Get rid of acpi_arch_timer_mem_init()

* for-next/documentation:
  : arm64 Documentation updates
  Documentation/arm64: Fix the typo of register names
2025-11-28 15:47:12 +00:00
Dev Jain
0c2988aaa4 arm64/mm: Document why linear map split failure upon vm_reset_perms is not problematic
Consider the following code path:

(1) vmalloc -> (2) set_vm_flush_reset_perms -> (3) set_memory_ro/set_memory_rox
-> .... (4) use the mapping .... -> (5) vfree -> (6) vm_reset_perms
-> (7) set_area_direct_map.
Or, it may happen that we encounter failure at (3) and directly jump to (5).

In both cases, (7) may fail due to linear map split failure. But, we care
about its success *only* for the region which got successfully changed by
(3). Such a region is guaranteed to be pte-mapped.

The TLDR is that (7) will surely succeed for the regions we care about.

Signed-off-by: Dev Jain <dev.jain@arm.com>
Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-11-28 15:36:40 +00:00
Dev Jain
e5efd56fa1 arm64/pageattr: Propagate return value from __change_memory_common
The rodata=on security measure requires that any code path which does
vmalloc -> set_memory_ro/set_memory_rox must protect the linear map alias
too. Therefore, if such a call fails, we must abort set_memory_* and caller
must take appropriate action; currently we are suppressing the error, and
there is a real chance of such an error arising post commit a166563e7e
("arm64: mm: support large block mapping when rodata=full"). Therefore,
propagate any error to the caller.

Fixes: a166563e7e ("arm64: mm: support large block mapping when rodata=full")
Signed-off-by: Dev Jain <dev.jain@arm.com>
Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-11-28 15:36:40 +00:00
Ben Horgan
27abb1ee5a arm64/sysreg: Remove unused define ARM64_FEATURE_FIELD_BITS
The define ARM64_FEATURE_FIELD_BITS is now unused and feature id
fields don't always have 4 bits. Remove it.

Signed-off-by: Ben Horgan <ben.horgan@arm.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-11-27 18:17:59 +00:00
Ben Horgan
4138cc63d3 KVM: arm64: selftests: Consider all 7 possible levels of cache
In test_clidr() if an empty cache level is not found then the TEST_ASSERT
will not fire. Fix this by considering all 7 possible levels when iterating
through the hierarchy. Found by inspection.

Signed-off-by: Ben Horgan <ben.horgan@arm.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Acked-by: Oliver Upton <oupton@kernel.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-11-27 18:16:46 +00:00
Ben Horgan
bf09ee9180 KVM: arm64: selftests: Remove ARM64_FEATURE_FIELD_BITS and its last user
ARM64_FEATURE_FIELD_BITS is set to 4 but not all ID register fields are 4
bits. See for instance ID_AA64SMFR0_EL1. The last user of this define,
ARM64_FEATURE_FIELD_BITS, is the set_id_regs selftest. Its logic assumes
the fields aren't a single bits; assert that's the case and stop using the
define. As there are no more users, ARM64_FEATURE_FIELD_BITS is removed
from the arm64 tools sysreg.h header. A separate commit removes this from
the kernel version of the header.

Signed-off-by: Ben Horgan <ben.horgan@arm.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Acked-by: Oliver Upton <oupton@kernel.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-11-27 18:16:46 +00:00
Seongsu Park
c86d9f8764 arm64: atomics: lse: Remove unused parameters from ATOMIC_FETCH_OP_AND macros
The ATOMIC_FETCH_OP_AND and ATOMIC64_FETCH_OP_AND macros accept 'mb' and
'cl' parameters but never use them in their implementation. These macros
simply delegate to the corresponding andnot functions, which handle the
actual atomic operations and memory barriers.

Signed-off-by: Seongsu Park <sgsu.park@samsung.com>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-11-27 18:15:24 +00:00
Sebastian Andrzej Siewior
37de2dbc31 debugobjects: Use LD_WAIT_CONFIG instead of LD_WAIT_SLEEP
fill_pool_map is used to suppress nesting violations caused by acquiring
a spinlock_t (from within the memory allocator) while holding a
raw_spinlock_t. The used annotation is wrong.

LD_WAIT_SLEEP is for always sleeping lock types such as mutex_t.
LD_WAIT_CONFIG is for lock type which are sleeping while spinning on
PREEMPT_RT such as spinlock_t.

Use LD_WAIT_CONFIG as override.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://patch.msgid.link/20251127153652.291697-3-bigeasy@linutronix.de
2025-11-27 16:55:34 +01:00
Sebastian Andrzej Siewior
06e0ae988f debugobjects: Allow to refill the pool before SYSTEM_SCHEDULING
The pool of free objects is refilled on several occasions such as object
initialisation. On PREEMPT_RT refilling is limited to preemptible
sections due to sleeping locks used by the memory allocator. The system
boots with disabled interrupts so the pool can not be refilled.

If too many objects are initialized and the pool gets empty then
debugobjects disables itself.

Refiling can also happen early in the boot with disabled interrupts as
long as the scheduler is not operational. If the scheduler can not
preempt a task then a sleeping lock can not be contended.

Allow to additionally refill the pool if the scheduler is not
operational.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://patch.msgid.link/20251127153652.291697-2-bigeasy@linutronix.de
2025-11-27 16:55:34 +01:00
Brendan Jackman
3d1f108845 x86/mm: Delete disabled debug code
This code doesn't run. Since 2008:

  4f9c11dd49 ("x86, 64-bit: adjust mapping of physical pagetables to work with Xen")

the kernel has gained more flexible logging and tracing capabilities;
presumably if anyone wanted to take advantage of this log message they would
have got rid of the "if (0)" so they could use these capabilities.

Since they haven't, just delete it.

Signed-off-by: Brendan Jackman <jackmanb@google.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://patch.msgid.link/20251003-x86-init-cleanup-v1-1-f2b7994c2ad6@google.com
2025-11-27 14:32:16 +01:00
Heiko Carstens
283f90b50d watchdog: diag288_wdt: Remove KMSG_COMPONENT macro
The KMSG_COMPONENT macro is a leftover of the s390 specific "kernel message
catalog" from 2008 [1] which never made it upstream.

The macro was added to s390 code to allow for an out-of-tree patch which
used this to generate unique message ids. Also this out-of-tree doesn't
exist anymore.

Remove the macro in order to get rid of a pointless indirection.

[1] https://lwn.net/Articles/292650/

Reviewed-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2025-11-26 17:34:52 +01:00
Thomas Gleixner
2437f79880 Merge tag 'timers-v6.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/daniel.lezcano/linux into timers/clocksource
Pull clocksource/event changes from Daniel Lezcano:

    - Use 64-bits for timer compensation for IoT usage where the suspend
      time is much longer than what 32-bits can provide (Enlin Mu)

    - Add delay support on sp804 for ARM32 platforms (Stephen Eta Zhou)

    - Fix missing resource release on error in the probe path of in the
      ralink driver (Haotian Zhang)

    - Fix double deregistration on probe failure in the NXP STM driver
      (Johan Hovold)

    - Disable runtime PM for the Renesas SH CMT timer because it is
      incompatible with PREEMPT_RT=y (Niklas Söderlund)

    - Fix section mismatches in the NXP STM driver (Johan Hovold)

    - Preventing unbinding the NXP PIT, STM and MMIO ARM Arch timers as
      the code does not suppport bind/unbind (Johan Hovold)

    - Use the clocksource instead of ticks on the RDA8810PL platform
      (Enlin Mu)

    - Drop the unused module alias for the STM32-LP (Johan Hovold)

    - Add Realtek system timer driver (Hao-Wen Ting)

Link: https://lore.kernel.org/all/9303b790-28d4-4bd9-b01d-28fb05493596@linaro.org
2025-11-26 15:36:52 +01:00
Heiko Carstens
1c93edfd50 s390/entry: Use lay instead of aghik
Use the lay instruction instead of aghik. aghik is only available since
z196, therefore compiling the kernel for z10 results in this error:

   arch/s390/kernel/entry.S: Assembler messages:
   arch/s390/kernel/entry.S:165: Error: Unrecognized opcode: `aghik'

Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202511261518.nBbQN5h7-lkp@intel.com/
Fixes: f5730d44e0 ("s390: Add stackprotector support")
Reviewed-by: Jan Polensky <japo@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2025-11-26 12:28:23 +01:00
Hao-Wen Ting
d1780dce95 clocksource/drivers: Add Realtek system timer driver
Add a system timer driver for Realtek SoCs.

This driver registers the 1 MHz global hardware counter on Realtek
platforms as a clock event device. Since this hardware counter starts
counting automatically after SoC power-on, no clock initialization is
required. Because the counter does not stop or get affected by CPU power
down, and it supports oneshot mode, it is typically used as a tick
broadcast timer.

Signed-off-by: Hao-Wen Ting <haowen.ting@realtek.com>
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://patch.msgid.link/20251126060110.198330-3-haowen.ting@realtek.com
2025-11-26 11:25:15 +01:00
Hao-Wen Ting
40caba2bd0 dt-bindings: timer: Add Realtek SYSTIMER
The Realtek SYSTIMER (System Timer) is a 64-bit global hardware counter
operating at a fixed 1MHz frequency. Thanks to its compare match
interrupt capability, the timer natively supports oneshot mode for tick
broadcast functionality.

Signed-off-by: Hao-Wen Ting <haowen.ting@realtek.com>
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Reviewed-by: Krzysztof Kozlowski <krzk@kernel.org>
Link: https://patch.msgid.link/20251126060110.198330-2-haowen.ting@realtek.com
2025-11-26 11:25:15 +01:00
Johan Hovold
ed92a968a9 clocksource/drivers/stm32-lp: Drop unused module alias
The driver cannot be built as a module so drop the unused platform
module alias.

Note that platform aliases are not needed for OF probing should it ever
become possible to build the driver as a module.

Signed-off-by: Johan Hovold <johan@kernel.org>
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Link: https://patch.msgid.link/20251111154516.1698-1-johan@kernel.org
2025-11-26 11:25:15 +01:00
Enlin Mu
627f3f3716 clocksource/drivers/rda: Add sched_clock_register for RDA8810PL SoC
The current system log timestamp accuracy is tick based, which can not
meet the usage requirements and needs to reach nanoseconds.
Therefore, the sched_clock_register function needs to be added.

[ dlezcano: Fixed typos ]

Signed-off-by: Enlin Mu <enlin.mu@unisoc.com>
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Link: https://patch.msgid.link/20251107063347.3692-1-enlin.mu@linux.dev
2025-11-26 11:25:11 +01:00
Johan Hovold
6a2416892e clocksource/drivers/nxp-stm: Prevent driver unbind
Clockevents cannot be deregistered so suppress the bind attributes to
prevent the driver from being unbound and releasing the underlying
resources after registration.

Even if the driver can currently only be built-in, also switch to
builtin_platform_driver() to prevent it from being unloaded should
modular builds ever be enabled.

Fixes: cec32ac758 ("clocksource/drivers/nxp-timer: Add the System Timer Module for the s32gx platforms")
Signed-off-by: Johan Hovold <johan@kernel.org>
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Link: https://patch.msgid.link/20251111153226.579-4-johan@kernel.org
2025-11-26 11:25:03 +01:00
Johan Hovold
e25f964cf4 clocksource/drivers/nxp-pit: Prevent driver unbind
The driver does not support unbinding (e.g. as clockevents cannot be
deregistered) so suppress the bind attributes to prevent the driver from
being unbound and rebound after registration (and disabling the timer
when reprobing fails).

Even if the driver can currently only be built-in, also switch to
builtin_platform_driver() to prevent it from being unloaded should
modular builds ever be enabled.

Fixes: bee33f22d7 ("clocksource/drivers/nxp-pit: Add NXP Automotive s32g2 / s32g3 support")
Signed-off-by: Johan Hovold <johan@kernel.org>
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Link: https://patch.msgid.link/20251111153226.579-3-johan@kernel.org
2025-11-26 11:24:57 +01:00
Johan Hovold
6aa10f0e2e clocksource/drivers/arm_arch_timer_mmio: Prevent driver unbind
Clockevents cannot be deregistered so suppress the bind attributes to
prevent the driver from being unbound and releasing the underlying
resources after registration.

Fixes: 4891f01527 ("clocksource/drivers/arm_arch_timer: Add standalone MMIO driver")
Signed-off-by: Johan Hovold <johan@kernel.org>
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://patch.msgid.link/20251111153226.579-2-johan@kernel.org
2025-11-26 11:24:47 +01:00
Johan Hovold
b452d2c97e clocksource/drivers/nxp-stm: Fix section mismatches
Platform drivers can be probed after their init sections have been
discarded (e.g. on probe deferral or manual rebind through sysfs) so the
probe function must not live in init. Device managed resource actions
similarly cannot be discarded.

The "_probe" suffix of the driver structure name prevents modpost from
warning about this so replace it to catch any similar future issues.

Fixes: cec32ac758 ("clocksource/drivers/nxp-timer: Add the System Timer Module for the s32gx platforms")
Signed-off-by: Johan Hovold <johan@kernel.org>
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: stable@vger.kernel.org	# 6.16
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Link: https://patch.msgid.link/20251017054943.7195-1-johan@kernel.org
2025-11-26 11:24:44 +01:00
Niklas Söderlund
62524f285c clocksource/drivers/sh_cmt: Always leave device running after probe
The CMT device can be used as both a clocksource and a clockevent
provider. The driver tries to be smart and power itself on and off, as
well as enabling and disabling its clock when it's not in operation.
This behavior is slightly altered if the CMT is used as an early
platform device in which case the device is left powered on after probe,
but the clock is still enabled and disabled at runtime.

This has worked for a long time, but recent improvements in PREEMPT_RT
and PROVE_LOCKING have highlighted an issue. As the CMT registers itself
as a clockevent provider, clockevents_register_device(), it needs to use
raw spinlocks internally as this is the context of which the clockevent
framework interacts with the CMT driver. However in the context of
holding a raw spinlock the CMT driver can't really manage its power
state or clock with calls to pm_runtime_*() and clk_*() as these calls
end up in other platform drivers using regular spinlocks to control
power and clocks.

This mix of spinlock contexts trips a lockdep warning.

    =============================
    [ BUG: Invalid wait context ]
    6.17.0-rc3-arm64-renesas-03071-gb3c4f4122b28-dirty #21 Not tainted
    -----------------------------
    swapper/1/0 is trying to lock:
    ffff00000898d180 (&dev->power.lock){-...}-{3:3}, at: __pm_runtime_resume+0x38/0x88
    ccree e6601000.crypto: ARM CryptoCell 630P Driver: HW version 0xAF400001/0xDCC63000, Driver version 5.0
    other info that might help us debug this:
    ccree e6601000.crypto: ARM ccree device initialized
    context-{5:5}
    2 locks held by swapper/1/0:
     #0: ffff80008173c298 (tick_broadcast_lock){-...}-{2:2}, at: __tick_broadcast_oneshot_control+0xa4/0x3a8
     #1: ffff0000089a5858 (&ch->lock){....}-{2:2}
    usbcore: registered new interface driver usbhid
    , at: sh_cmt_start+0x30/0x364
    stack backtrace:
    CPU: 1 UID: 0 PID: 0 Comm: swapper/1 Not tainted 6.17.0-rc3-arm64-renesas-03071-gb3c4f4122b28-dirty #21 PREEMPT
    Hardware name: Renesas Salvator-X 2nd version board based on r8a77965 (DT)
    Call trace:
     show_stack+0x14/0x1c (C)
     dump_stack_lvl+0x6c/0x90
     dump_stack+0x14/0x1c
     __lock_acquire+0x904/0x1584
     lock_acquire+0x220/0x34c
     _raw_spin_lock_irqsave+0x58/0x80
     __pm_runtime_resume+0x38/0x88
     sh_cmt_start+0x54/0x364
     sh_cmt_clock_event_set_oneshot+0x64/0xb8
     clockevents_switch_state+0xfc/0x13c
     tick_broadcast_set_event+0x30/0xa4
     __tick_broadcast_oneshot_control+0x1e0/0x3a8
     tick_broadcast_oneshot_control+0x30/0x40
     cpuidle_enter_state+0x40c/0x680
     cpuidle_enter+0x30/0x40
     do_idle+0x1f4/0x26c
     cpu_startup_entry+0x34/0x40
     secondary_start_kernel+0x11c/0x13c
     __secondary_switched+0x74/0x78

For non-PREEMPT_RT builds this is not really an issue, but for
PREEMPT_RT builds where normal spinlocks can sleep this might be an
issue. Be cautious and always leave the power and clock running after
probe.

Signed-off-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se>
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Tested-by: Geert Uytterhoeven <geert+renesas@glider.be>
Link: https://patch.msgid.link/20251016182022.1837417-1-niklas.soderlund+renesas@ragnatech.se
2025-11-26 11:24:40 +01:00
Johan Hovold
6b38a8b31e clocksource/drivers/stm: Fix double deregistration on probe failure
The purpose of the devm_add_action_or_reset() helper is to call the
action function in case adding an action ever fails so drop the clock
source deregistration from the error path to avoid deregistering twice.

Fixes: cec32ac758 ("clocksource/drivers/nxp-timer: Add the System Timer Module for the s32gx platforms")
Signed-off-by: Johan Hovold <johan@kernel.org>
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Link: https://patch.msgid.link/20251017055039.7307-1-johan@kernel.org
2025-11-26 11:24:37 +01:00
Haotian Zhang
2ba8e2aae1 clocksource/drivers/ralink: Fix resource leaks in init error path
The ralink_systick_init() function does not release all acquired resources
on its error paths. If irq_of_parse_and_map() or a subsequent call fails,
the previously created I/O memory mapping and IRQ mapping are leaked.

Add goto-based error handling labels to ensure that all allocated
resources are correctly freed.

Fixes: 1f2acc5a8a ("MIPS: ralink: Add support for systick timer found on newer ralink SoC")
Signed-off-by: Haotian Zhang <vulab@iscas.ac.cn>
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Link: https://patch.msgid.link/20251030090710.1603-1-vulab@iscas.ac.cn
2025-11-26 11:24:34 +01:00
Stephen Eta Zhou
640594a04f clocksource/drivers/timer-sp804: Fix read_current_timer() issue when clock source is not registered
Register a valid read_current_timer() function for the
SP804 timer on ARM32.

On ARM32 platforms, when the SP804 timer is selected as the clocksource,
the driver does not register a valid read_current_timer() function.
As a result, features that rely on this API—such as rdseed—consistently
return incorrect values.

To fix this, a delay_timer structure is registered during the SP804
driver's initialization. The read_current_timer() function is implemented
using the existing sp804_read() logic, and the timer frequency is reused
from the already-initialized clocksource.

Signed-off-by: Stephen Eta Zhou <stephen.eta.zhou@gmail.com>
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Link: https://patch.msgid.link/20250525-sp804-fix-read_current_timer-v4-1-87a9201fa4ec@gmail.com
2025-11-26 11:24:32 +01:00
Enlin Mu
576c564ec3 clocksource/drivers/sprd: Enable register for timer counter from 32 bit to 64 bit
Using 32 bit for suspend compensation, the max compensation time is 36
hours(working clock is 32k).In some IOT devices, the suspend time may
be long, even exceeding 36 hours. Therefore, a 64 bit timer counter
is needed for counting.

Signed-off-by: Enlin Mu <enlin.mu@unisoc.com>
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Link: https://patch.msgid.link/20251106021830.34846-1-enlin.mu@linux.dev
2025-11-26 11:24:26 +01:00
Thomas Gleixner
653fda7ae7 sched/mmcid: Switch over to the new mechanism
Now that all pieces are in place, change the implementations of
sched_mm_cid_fork() and sched_mm_cid_exit() to adhere to the new strict
ownership scheme and switch context_switch() over to use the new
mm_cid_schedin() functionality.

The common case is that there is no mode change required, which makes
fork() and exit() just update the user count and the constraints.

In case that a new user would exceed the CID space limit the fork() context
handles the transition to per CPU mode with mm::mm_cid::mutex held. exit()
handles the transition back to per task mode when the user count drops
below the switch back threshold. fork() might also be forced to handle a
deferred switch back to per task mode, when a affinity change increased the
number of allowed CPUs enough.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251119172550.280380631@linutronix.de
2025-11-25 19:45:42 +01:00
Thomas Gleixner
9da6ccbcea sched/mmcid: Implement deferred mode change
When affinity changes cause an increase of the number of CPUs allowed for
tasks which are related to a MM, that might results in a situation where
the ownership mode can go back from per CPU mode to per task mode.

As affinity changes happen with runqueue lock held there is no way to do
the actual mode change and required fixup right there.

Add the infrastructure to defer it to a workqueue. The scheduled work can
race with a fork() or exit(). Whatever happens first takes care of it.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251119172550.216484739@linutronix.de
2025-11-25 19:45:42 +01:00
Thomas Gleixner
c809f081fe irqwork: Move data struct to a types header
... to avoid header recursion hell.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251119172550.152813625@linutronix.de
2025-11-25 19:45:41 +01:00
Thomas Gleixner
fbd0e71dc3 sched/mmcid: Provide CID ownership mode fixup functions
CIDs are either owned by tasks or by CPUs. The ownership mode depends on
the number of tasks related to a MM and the number of CPUs on which these
tasks are theoretically allowed to run on. Theoretically because that
number is the superset of CPU affinities of all tasks which only grows and
never shrinks.

Switching to per CPU mode happens when the user count becomes greater than
the maximum number of CIDs, which is calculated by:

	opt_cids = min(mm_cid::nr_cpus_allowed, mm_cid::users);
	max_cids = min(1.25 * opt_cids, nr_cpu_ids);

The +25% allowance is useful for tight CPU masks in scenarios where only a
few threads are created and destroyed to avoid frequent mode
switches. Though this allowance shrinks, the closer opt_cids becomes to
nr_cpu_ids, which is the (unfortunate) hard ABI limit.

At the point of switching to per CPU mode the new user is not yet visible
in the system, so the task which initiated the fork() runs the fixup
function: mm_cid_fixup_tasks_to_cpu() walks the thread list and either
transfers each tasks owned CID to the CPU the task runs on or drops it into
the CID pool if a task is not on a CPU at that point in time. Tasks which
schedule in before the task walk reaches them do the handover in
mm_cid_schedin(). When mm_cid_fixup_tasks_to_cpus() completes it's
guaranteed that no task related to that MM owns a CID anymore.

Switching back to task mode happens when the user count goes below the
threshold which was recorded on the per CPU mode switch:

	pcpu_thrs = min(opt_cids - (opt_cids / 4), nr_cpu_ids / 2);

This threshold is updated when a affinity change increases the number of
allowed CPUs for the MM, which might cause a switch back to per task mode.

If the switch back was initiated by a exiting task, then that task runs the
fixup function. If it was initiated by a affinity change, then it's run
either in the deferred update function in context of a workqueue or by a
task which forks a new one or by a task which exits. Whatever happens
first. mm_cid_fixup_cpus_to_task() walks through the possible CPUs and
either transfers the CPU owned CIDs to a related task which runs on the CPU
or drops it into the pool. Tasks which schedule in on a CPU which the walk
did not cover yet do the handover themselves.

This transition from CPU to per task ownership happens in two phases:

 1) mm:mm_cid.transit contains MM_CID_TRANSIT. This is OR'ed on the task
    CID and denotes that the CID is only temporarily owned by the
    task. When it schedules out the task drops the CID back into the
    pool if this bit is set.

 2) The initiating context walks the per CPU space and after completion
    clears mm:mm_cid.transit. After that point the CIDs are strictly
    task owned again.

This two phase transition is required to prevent CID space exhaustion
during the transition as a direct transfer of ownership would fail if
two tasks are scheduled in on the same CPU before the fixup freed per
CPU CIDs.

When mm_cid_fixup_cpus_to_tasks() completes it's guaranteed that no CID
related to that MM is owned by a CPU anymore.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251119172550.088189028@linutronix.de
2025-11-25 19:45:41 +01:00
Thomas Gleixner
9a723ed7fa sched/mmcid: Provide new scheduler CID mechanism
The MM CID management has two fundamental requirements:

  1) It has to guarantee that at no given point in time the same CID is
     used by concurrent tasks in userspace.

  2) The CID space must not exceed the number of possible CPUs in a
     system. While most allocators (glibc, tcmalloc, jemalloc) do not
     care about that, there seems to be at least some LTTng library
     depending on it.

The CID space compaction itself is not a functional correctness
requirement, it is only a useful optimization mechanism to reduce the
memory foot print in unused user space pools.

The optimal CID space is:

    min(nr_tasks, nr_cpus_allowed);

Where @nr_tasks is the number of actual user space threads associated to
the mm and @nr_cpus_allowed is the superset of all task affinities. It is
growth only as it would be insane to take a racy snapshot of all task
affinities when the affinity of one task changes just do redo it 2
milliseconds later when the next task changes it's affinity.

That means that as long as the number of tasks is lower or equal than the
number of CPUs allowed, each task owns a CID. If the number of tasks
exceeds the number of CPUs allowed it switches to per CPU mode, where the
CPUs own the CIDs and the tasks borrow them as long as they are scheduled
in.

For transition periods CIDs can go beyond the optimal space as long as they
don't go beyond the number of possible CPUs.

The current upstream implementation adds overhead into task migration to
keep the CID with the task. It also has to do the CID space consolidation
work from a task work in the exit to user space path. As that work is
assigned to a random task related to a MM this can inflict unwanted exit
latencies.

Implement the context switch parts of a strict ownership mechanism to
address this.

This removes most of the work from the task which schedules out. Only
during transitioning from per CPU to per task ownership it is required to
drop the CID when leaving the CPU to prevent CID space exhaustion. Other
than that scheduling out is just a single check and branch.

The task which schedules in has to check whether:

    1) The ownership mode changed
    2) The CID is within the optimal CID space

In stable situations this results in zero work. The only short disruption
is when ownership mode changes or when the associated CID is not in the
optimal CID space. The latter only happens when tasks exit and therefore
the optimal CID space shrinks.

That mechanism is strictly optimized for the common case where no change
happens. The only case where it actually causes a temporary one time spike
is on mode changes when and only when a lot of tasks related to a MM
schedule exactly at the same time and have eventually to compete on
allocating a CID from the bitmap.

In the sysbench test case which triggered the spinlock contention in the
initial CID code, __schedule() drops significantly in perf top on a 128
Core (256 threads) machine when running sysbench with 255 threads, which
fits into the task mode limit of 256 together with the parent thread:

  Upstream  rseq/perf branch  +CID rework
  0.42%     0.37%             0.32%          [k] __schedule

Increasing the number of threads to 256, which puts the test process into
per CPU mode looks about the same.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251119172550.023984859@linutronix.de
2025-11-25 19:45:41 +01:00
Thomas Gleixner
23343b6b09 sched/mmcid: Introduce per task/CPU ownership infrastructure
The MM CID management has two fundamental requirements:

  1) It has to guarantee that at no given point in time the same CID is
     used by concurrent tasks in userspace.

  2) The CID space must not exceed the number of possible CPUs in a
     system. While most allocators (glibc, tcmalloc, jemalloc) do not care
     about that, there seems to be at least librseq depending on it.

The CID space compaction itself is not a functional correctness
requirement, it is only a useful optimization mechanism to reduce the
memory foot print in unused user space pools.

The optimal CID space is:

    min(nr_tasks, nr_cpus_allowed);

Where @nr_tasks is the number of actual user space threads associated to
the mm and @nr_cpus_allowed is the superset of all task affinities. It is
growth only as it would be insane to take a racy snapshot of all task
affinities when the affinity of one task changes just do redo it 2
milliseconds later when the next task changes its affinity.

That means that as long as the number of tasks is lower or equal than the
number of CPUs allowed, each task owns a CID. If the number of tasks
exceeds the number of CPUs allowed it switches to per CPU mode, where the
CPUs own the CIDs and the tasks borrow them as long as they are scheduled
in.

For transition periods CIDs can go beyond the optimal space as long as they
don't go beyond the number of possible CPUs.

The current upstream implementation adds overhead into task migration to
keep the CID with the task. It also has to do the CID space consolidation
work from a task work in the exit to user space path. As that work is
assigned to a random task related to a MM this can inflict unwanted exit
latencies.

This can be done differently by implementing a strict CID ownership
mechanism. Either the CIDs are owned by the tasks or by the CPUs. The
latter provides less locality when tasks are heavily migrating, but there
is no justification to optimize for overcommit scenarios and thereby
penalizing everyone else.

Provide the basic infrastructure to implement this:

  - Change the UNSET marker to BIT(31) from ~0U
  - Add the ONCPU marker as BIT(30)
  - Add the TRANSIT marker as BIT(29)

That allows to check for ownership trivially and provides a simple check for
UNSET as well. The TRANSIT marker is required to prevent CID space
exhaustion when switching from per CPU to per task mode.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://patch.msgid.link/20251119172549.960252358@linutronix.de
2025-11-25 19:45:41 +01:00
Thomas Gleixner
51dd92c71a sched/mmcid: Serialize sched_mm_cid_fork()/exit() with a mutex
Prepare for the new CID management scheme which puts the CID ownership
transition into the fork() and exit() slow path by serializing
sched_mm_cid_fork()/exit() with it, so task list and cpu mask walks can be
done in interruptible and preemptible code.

The contention on it is not worse than on other concurrency controls in the
fork()/exit() machinery.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251119172549.895826703@linutronix.de
2025-11-25 19:45:41 +01:00
Thomas Gleixner
b0c3d51b54 sched/mmcid: Provide precomputed maximal value
Reading mm::mm_users and mm:::mm_cid::nr_cpus_allowed every time to compute
the maximal CID value is just wasteful as that value is only changing on
fork(), exit() and eventually when the affinity changes.

So it can be easily precomputed at those points and provided in mm::mm_cid
for consumption in the hot path.

But there is an issue with using mm::mm_users for accounting because that
does not necessarily reflect the number of user space tasks as other kernel
code can take temporary references on the MM which skew the picture.

Solve that by adding a users counter to struct mm_mm_cid, which is modified
by fork() and exit() and used for precomputing under mm_mm_cid::lock.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251119172549.832764634@linutronix.de
2025-11-25 19:45:40 +01:00
Thomas Gleixner
bf070520e3 sched/mmcid: Move initialization out of line
It's getting bigger soon, so just move it out of line to the rest of the
code.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251119172549.769636491@linutronix.de
2025-11-25 19:45:40 +01:00
Thomas Gleixner
2b1642b881 signal: Move MMCID exit out of sighand lock
There is no need anymore to keep this under sighand lock as the current
code and the upcoming replacement are not depending on the exit state of a
task anymore.

That allows to use a mutex in the exit path.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251119172549.706439391@linutronix.de
2025-11-25 19:45:40 +01:00
Thomas Gleixner
539115f08c sched/mmcid: Convert mm CID mask to a bitmap
This is truly a bitmap and just conveniently uses a cpumask because the
maximum size of the bitmap is nr_cpu_ids.

But that prevents to do searches for a zero bit in a limited range, which
is helpful to provide an efficient mechanism to consolidate the CID space
when the number of users decreases.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Yury Norov (NVIDIA) <yury.norov@gmail.com>
Link: https://patch.msgid.link/20251119172549.642866767@linutronix.de
2025-11-25 19:45:40 +01:00
Thomas Gleixner
35a5c37cb9 cpumask: Cache num_possible_cpus()
Reevaluating num_possible_cpus() over and over does not make sense. That
becomes a constant after init as cpu_possible_mask is marked ro_after_init.

Cache the value during initialization and provide that for consumption.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Yury Norov <yury.norov@gmail.com>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Link: https://patch.msgid.link/20251119172549.578653738@linutronix.de
2025-11-25 19:45:40 +01:00
Heiko Carstens
509c34924d s390/vdso: Get rid of -m64 flag handling
The compiler/assembler flag -m64 is added and removed at two locations.
This pointless exercise is a leftover to keep the 31 and 64 bit vdso
Makefiles as symmetrical as possible. Given that the 31 bit vdso code
does not exist anymore, remove the -m64 flag handling.

Suggested-by: Jens Remus <jremus@linux.ibm.com>
Reviewed-by: Jens Remus <jremus@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2025-11-25 15:28:08 +01:00
Heiko Carstens
c0087d807a s390/vdso: Rename vdso64 to vdso
Since compat is gone there is only a 64 bit vdso left.
Remove the superfluous "64" suffix everywhere.

Reviewed-by: Jens Remus <jremus@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2025-11-25 15:28:07 +01:00
Heiko Carstens
b3bdfdf1f9 s390: Rename head64.S to head.S
All the code is 64 bit, therefore remove the superfluous suffix.

Reviewed-by: Jens Remus <jremus@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2025-11-25 15:28:07 +01:00
Jens Remus
5e811b922e s390/vdso: Use common STABS_DEBUG and DWARF_DEBUG macros
This simplifies the vDSO linker script.  The ELF_DETAILS macro was not
used in addition, as done on arm64 and powerpc, as that would introduce
an empty .modinfo section.

Note that this rearranges the .comment section to follow after all of
the debug sections.

Signed-off-by: Jens Remus <jremus@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2025-11-25 15:28:07 +01:00
Zenon Xiu
4b7a59fa70 Documentation/arm64: Fix the typo of register names
The register name 'HWFGWTR_EL2' and 'HWFGRTR_EL2' is wrong, should be 'HFGWTR_EL2' and 'HFGRTR_EL2'.
Find the register description on arm website here,
https://developer.arm.com/documentation/ddi0601/2025-09/AArch64-Registers/HFGWTR-EL2--Hypervisor-Fine-Grained-Write-Trap-Register
https://developer.arm.com/documentation/ddi0601/2025-09/AArch64-Registers/HFGRTR-EL2--Hypervisor-Fine-Grained-Read-Trap-Register?lang=en

Signed-off-by: Zenon Xiu <zenonxiu@outlook.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-11-25 12:26:31 +00:00
Marc Zyngier
155f8d4ef0 ACPI: GTDT: Get rid of acpi_arch_timer_mem_init()
Since 0f67b56d84 ("clocksource/drivers/arm_arch_timer_mmio: Switch
over to standalone driver"), acpi_arch_timer_mem_init() is unused.

Remove it.

Signed-off-by: Marc Zyngier <maz@kernel.org>
Cc: Hanjun Guo <guohanjun@huawei.com>
Cc: Sudeep Holla <sudeep.holla@arm.com>
Cc: Rafael J. Wysocki <rafael@kernel.org>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Mark Rutland <mark.rutland@arm.com>
Acked-by: Hanjun Guo <guohanjun@huawei.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-11-25 11:55:13 +00:00
Randy Dunlap
73029e73cc x86/cc: Fix enum spelling to fix kernel-doc warnings
Make the enum name in kernel-doc match the code to prevent kernel-doc warnings:

  Warning: include/linux/cc_platform.h:106 Enum value
   'CC_ATTR_GUEST_SEV_SNP' not described in enum 'cc_attr'
  Warning: include/linux/cc_platform.h:106 Excess enum value
   '%CC_ATTR_SEV_SNP' description in 'cc_attr'

Fixes: f742b90e61 ("x86/mm: Extend cc_attr to include AMD SEV-SNP")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://patch.msgid.link/20251125022730.3163679-1-rdunlap@infradead.org
2025-11-25 09:17:13 +01:00
Nikolay Borisov
69acbdbbef RAS/AMD/ATL: Replace bitwise_xor_bits() with hweight16()
Doing hweight16() and checking whether the lsb is set is functionally
equivalent to what bitwise_xor_bits() does. In addition, it results in better
generated code as before gcc would inline the function 4 times.  With hweight16(),
the resulting code boils down to 2 instructions -  POPCNT and AND, and all
relevant CPUs support POPCNT.

An alternative would have been to use the __builtin_parity() function provided
by both Clang/GCC, however under some circumstances the compiler can choose not
to inline it but generate a library call which is unsupported in the kernel.

No functional changes.

  [ bp: Massage commit message. ]

Signed-off-by: Nikolay Borisov <nik.borisov@suse.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://patch.msgid.link/20251124142517.1708451-1-nik.borisov@suse.com
2025-11-24 17:00:37 +01:00
James Clark
e6a27290d8 perf: arm_spe: Add support for filtering on data source
SPE_FEAT_FDS adds the ability to filter on the data source of packets.
Like the other existing filters, enable filtering with PMSFCR_EL1.FDS
when any of the filter bits are set.

Each bit position of the 64 bit filter maps to numerical data sources
0-63 described by bits[0:5] in the data source packet (although the full
range of data source is 16 bits so higher value data sources can't be
filtered on). The filter is an OR of all the filter bits, so for example
clearing filter bits 0 and 3 only includes packets from data sources 0
OR 3.

Invert the filter given by userspace so that the default value of 0 is
equivalent to including all values (no filtering). This allows us to
skip adding a new format bit to enable filtering and still support
excluding all data sources which would have been a filter value of 0 if
not for the inversion.

Tested-by: Leo Yan <leo.yan@arm.com>
Reviewed-by: Leo Yan <leo.yan@arm.com>
Signed-off-by: James Clark <james.clark@linaro.org>
Signed-off-by: Will Deacon <will@kernel.org>
2025-11-24 15:59:18 +00:00
James Clark
cbbfba4847 perf: Add perf_event_attr::config4
Arm FEAT_SPE_FDS adds the ability to filter on the data source of a
packet using another 64-bits of event filtering control. As the existing
perf_event_attr::configN fields are all used up for SPE PMU, an
additional field is needed. Add a new 'config4' field.

Reviewed-by: Leo Yan <leo.yan@arm.com>
Tested-by: Leo Yan <leo.yan@arm.com>
Reviewed-by: Ian Rogers <irogers@google.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: James Clark <james.clark@linaro.org>
Signed-off-by: Will Deacon <will@kernel.org>
2025-11-24 15:59:18 +00:00
Joakim Zhang
11abb4e87b perf/imx_ddr: Add support for PMU in DB (system interconnects)
There is a PMU in DB, which has the same function with PMU in DDR
subsystem, the difference is PMU in DB only supports cycles, axid-read,
axid-write events.

e.g.
perf stat -a -e imx8_db0/axid-read,axi_mask=0xMMMM,axi_id=0xDDDD,axi_port=0xPP,axi_channel=0xH/ cmd
perf stat -a -e imx8_db0/axid-write,axi_mask=0xMMMM,axi_id=0xDDDD,axi_port=0xPP,axi_channel=0xH/ cmd

Signed-off-by: Joakim Zhang <qiangqing.zhang@nxp.com>
Signed-off-by: Frank Li <Frank.Li@nxp.com>
Signed-off-by: Will Deacon <will@kernel.org>
2025-11-24 15:39:05 +00:00
Frank Li
037e8cf671 perf/imx_ddr: Get and enable optional clks
Get and enable optional clks because fsl,imx8dxl-db-pmu have two clocks.

Signed-off-by: Frank Li <Frank.Li@nxp.com>
Signed-off-by: Will Deacon <will@kernel.org>
2025-11-24 15:39:05 +00:00
Frank Li
66db99ffdf perf/imx_ddr: Move ida_alloc() from ddr_perf_init() to ddr_perf_probe()
Move ida_alloc() from helper ddr_perf_init() into ddr_perf_probe() to
clarify why ida_free() must be called at the error path.

Add return value check for ida_alloc().

Rename label 'cpuhp_state_err' to 'idr_free' to make the code clearer,
since two error paths now jump to this label.

Signed-off-by: Frank Li <Frank.Li@nxp.com>
Signed-off-by: Will Deacon <will@kernel.org>
2025-11-24 15:39:05 +00:00
Frank Li
de8209e554 dt-bindings: perf: fsl-imx-ddr: Add compatible string for i.MX8QM, i.MX8QXP and i.MX8DXL
Add compatible string fsl,imx8qm-ddr-pmu, fsl,imx8qxp-ddr-pmu, which
fallback to fsl,imx8-ddr-pmu and fsl,imx8dxl-db-pmu (for data bus fabric).

Add clocks, clock-names for fsl,imx8dxl-db-pmu and keep the same
restriction for existing compatible strings.

Reviewed-by: Rob Herring (Arm) <robh@kernel.org>
Signed-off-by: Frank Li <Frank.Li@nxp.com>
Signed-off-by: Will Deacon <will@kernel.org>
2025-11-24 15:39:05 +00:00
Heiko Carstens
f5730d44e0 s390: Add stackprotector support
Stackprotector support was previously unavailable on s390 because by
default compilers generate code which is not suitable for the kernel:
the canary value is accessed via thread local storage, where the address
of thread local storage is within access registers 0 and 1.

Using those registers also for the kernel would come with a significant
performance impact and more complicated kernel entry/exit code, since
access registers contents would have to be exchanged on every kernel entry
and exit.

With the upcoming gcc 16 release new compiler options will become available
which allow to generate code suitable for the kernel. [1]

Compiler option -mstack-protector-guard=global instructs gcc to generate
stackprotector code that refers to a global stackprotector canary value via
symbol __stack_chk_guard. Access to this value is guaranteed to occur via
larl and lgrl instructions.

Furthermore, compiler option -mstack-protector-guard-record generates a
section containing all code addresses that reference the canary value.

To allow for per task canary values the instructions which load the address
of __stack_chk_guard are patched so they access a lowcore field instead: a
per task canary value is available within the task_struct of each task, and
is written to the per-cpu lowcore location on each context switch.

Also add sanity checks and debugging option to be consistent with other
kernel code patching mechanisms.

Full debugging output can be enabled with the following kernel command line
options:

debug_stackprotector
bootdebug
ignore_loglevel
earlyprintk
dyndbg="file stackprotector.c +p"

Example debug output:

stackprot: 0000021e402d4eda: c010005a9ae3 -> c01f00070240

where "<insn address>: <old insn> -> <new insn>".

[1] gcc commit 0cd1f03939d5 ("s390: Support global stack protector")

Reviewed-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2025-11-24 11:45:21 +01:00
Heiko Carstens
1d7764cfe3 s390/modules: Simplify module_finalize() slightly
Preinitialize the return value, and break out the for loop in
module_finalize() in case of an error to get rid of an ifdef.

This makes it easier to add additional code, which may also depend
on config options.

Reviewed-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2025-11-24 11:45:21 +01:00
Heiko Carstens
c3d17464f0 s390: Remove KMSG_COMPONENT macro
The KMSG_COMPONENT macro is a leftover of the s390 specific "kernel
message catalog" which never made it upstream.

Remove the macro in order to get rid of a pointless indirection. Replace
all users with the string it defines. In almost all cases this leads to a
simple replacement like this:

 - #define KMSG_COMPONENT "appldata"
 - #define pr_fmt(fmt) KMSG_COMPONENT ": " fmt
 + #define pr_fmt(fmt) "appldata: " fmt

Except for some special cases this is just mechanical/scripted work.

Acked-by: Thomas Richter <tmricht@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2025-11-24 11:45:21 +01:00
Heiko Carstens
e950d1f84d s390/percpu: Get rid of ARCH_MODULE_NEEDS_WEAK_PER_CPU
Since the rework of the kernel virtual address space [1] the module area
and the kernel image are within the same 4GB area. Therefore there is no
need for the weak per cpu workaround for modules anymore. Remove it.

[1] commit c98d2ecae0 ("s390/mm: Uncouple physical vs virtual address spaces")

Acked-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2025-11-24 11:45:20 +01:00
Heiko Carstens
f555d885bf Merge branch 'ap-driver-override' into features
Harald Freudenberger says:

====================

Support for driver override on AP queues.

Add a new sysfs attribute driver_override the AP queue's directory. Writing
in a string overrides the default driver determination and the drivers are
matched against this string instead. This overrules the driver binding
determined by the apmask/aqmask bitmask fields. With the write to the
attribute a check is done if the queue is in use by an mdev device.
If this is true, the write is aborted and EBUSY is returned.

As there exists some tooling for this kind of driver_override
(see package driverctl) the AP bus behavior for re-binding
should be compatible to this. The steps for a driver_override are:
 1) unbind the current driver from the device. For example
    echo "17.0005" > /sys/devices/ap/card17/17.0005/driver/unbind
 2) set the new driver for this device in the sysfs
    driver_override attribute. For example
    echo "vfio_ap" > /sys//devices/ap/card17/17.0005/driver_override
 3) trigger a bus reprobe of this device. For example
    echo "17.0005" > /sys/bus/ap/drivers_probe
With the driverctl package this is more comfortable and
the settings get persisted:
  driverctl -b ap set-override 17.0005 vfio_ap
and unset with
  driverctl -b ap unset-override 17.0005

====================

Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2025-11-24 11:43:19 +01:00
Harald Freudenberger
46030379f1 s390/ap: Restrict driver_override versus apmask and aqmask use
Introduce a restriction for the driver_override feature versus apmask
and aqmask:
- driver_override is only allowed when the apmask and aqmask values
  both are default (=0xffff..ffff).
- apmask and aqmask modifications are only allowed when there is no
  driver_override on any AP device active.
So in the end the user is restricted to choose to either use
apmask/apmask to divide the AP devices into host owned and vfio owned
or use the driver_override feature but not mix these two approaches.

Signed-off-by: Harald Freudenberger <freude@linux.ibm.com>
Reviewed-by: Holger Dengler <dengler@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2025-11-24 11:43:06 +01:00
Harald Freudenberger
8babcc2b6a s390/ap: Rename mutex ap_perms_mutex to ap_attr_mutex
The mutex ap_perms_mutex was already used not only for protection
of the struct ap_perms ap_perms variable but also for an consistent
update of the AP bus sysfs attributes apmask and aqmask.

So rename this mutex to ap_attr_mutex which better reflects the
current use. This is also a preparation for an upcoming patch which
will use this mutex to lock updates on a new sysfs attribute.

Signed-off-by: Harald Freudenberger <freude@linux.ibm.com>
Reviewed-by: Holger Dengler <dengler@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2025-11-24 11:43:06 +01:00
Harald Freudenberger
d38a87d7c0 s390/ap: Support driver_override for AP queue devices
Add a new sysfs attribute driver_override the AP queue's
directory. Writing in a string overrides the default driver
determination and the drivers are matched against this string
instead. This overrules the driver binding determined by the
apmask/aqmask bitmask fields.

According to the common understanding of how the driver_override
behavior shall work, there is no further checking done. Neither about
the string which is given as override driver nor if this device is
currently in use by an mdev device. Another patch may limit this
behavior to refuse a mixed usage of the driver_override and
apmask/aqmask feature.

As there exists some tooling for this kind of driver_override
(see package driverctl) the AP bus behavior for re-binding
should be compatible to this. The steps for a driver_override are:
 1) unbind the current driver from the device. For example
    echo "17.0005" > /sys/devices/ap/card17/17.0005/driver/unbind
 2) set the new driver for this device in the sysfs
    driver_override attribute. For example
    echo "vfio_ap" > /sys//devices/ap/card17/17.0005/driver_override
 3) trigger a bus reprobe of this device. For example
    echo "17.0005" > /sys/bus/ap/drivers_probe
With the driverctl package this is more comfortable and
the settings get persisted:
  driverctl -b ap set-override 17.0005 vfio_ap
and unset with
  driverctl -b ap unset-override 17.0005

Signed-off-by: Harald Freudenberger <freude@linux.ibm.com>
Reviewed-by: Holger Dengler <dengler@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2025-11-24 11:43:05 +01:00
Harald Freudenberger
6917f434fd s390/ap: Use all-bits-one apmask/aqmask for vfio in_use() checks
For the in_use() check of an updated apmask the host's aqmask
was provided to the vfio function. Similar on an update of the
aqmask the host's apmask was provided to the vfio in_use()
function. This led to false results on the check for apmask or
aqmask updates. For example with only one APQN when exactly
this card is tried to be re-assigned back to the host, the
in_use() check did not complain.

The correct behavior is achieved with providing a full mask
for aqmask when an adapter is to be checked and similar a full
mask for aqmask when a domain is to be checked for usage.

Signed-off-by: Harald Freudenberger <freude@linux.ibm.com>
Reviewed-by: Holger Dengler <dengler@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2025-11-24 11:43:05 +01:00
Geert Uytterhoeven
aaf4e92341 m68k: defconfig: Update defconfigs for v6.18-rc1
- Drop CONFIG_SCTP_COOKIE_HMAC_SHA1=y (removed in commit
    2f3dd6ec90 ("sctp: Convert cookie authentication to use
    HMAC-SHA256")),
  - Drop CONFIG_BATMAN_ADV_NC=y (removed in commit 87b95082db
    ("batman-adv: remove network coding support")),
  - Enable modular build of the SHA-1 secure hash algorithm (no longer
    auto-enabled since commit 2f3dd6ec90 ("sctp: Convert cookie
    authentication to use HMAC-SHA256")).

Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Link: https://patch.msgid.link/65e00bcb7b2980278bb087986ee405627aa32d8b.1760360254.git.geert@linux-m68k.org
2025-11-24 11:03:50 +01:00
Yue Haibing
e6a11a526e x86/{boot,mtrr}: Remove unused function declarations
Commits

  28be1b454c ("x86/boot: Remove unused copy_*_gs() functions")
  34d2819f20 ("x86, mtrr: Remove unused mtrr/state.c")

removed the functions but left the prototypes. Remove them.

  [ bp: Merge into a single patch. ]

Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://patch.msgid.link/20251120121037.1479334-1-yuehaibing@huawei.com
2025-11-22 21:26:36 +01:00
Lorenzo Pieralisi
9c1fbc56ca irqchip/gic-its: Rework platform MSI deviceID detection
Current code retrieving platform devices MSI devID in the GIC ITS MSI
parent helpers suffers from some minor issues:

- It leaks a struct device_node reference
- It is duplicated between GICv3 and GICv5 for no good reason
- It does not use the OF phandle iterator code that simplifies
  the msi-parent property parsing

Consolidate GIC v3 and v5 deviceID retrieval in a function that addresses
the full set of issues in one go by merging GIC v3 and v5 code and
converting the msi-parent parsing loop to the more modern OF phandle
iterator API, fixing the struct device_node reference leak in the process.

Signed-off-by: Lorenzo Pieralisi <lpieralisi@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://patch.msgid.link/20251021124103.198419-6-lpieralisi@kernel.org
2025-11-22 17:09:03 +01:00
Lorenzo Pieralisi
4f32612f6a PCI: iproc: Implement MSI controller node detection with of_msi_xlate()
The functionality implemented in the iproc driver in order to detect an
OF MSI controller node is now fully implemented in of_msi_xlate().

Replace the current msi-map/msi-parent parsing code with of_msi_xlate().

Since of_msi_xlate() is also a deviceID mapping API, pass in a fictitious
0 as deviceID - the driver only requires detecting the OF MSI controller
node not the deviceID mapping per-se (of_msi_xlate() return value is
ignored for the same reason).

Signed-off-by: Lorenzo Pieralisi <lpieralisi@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Link: https://patch.msgid.link/20251021124103.198419-5-lpieralisi@kernel.org
2025-11-22 17:09:03 +01:00
Thomas Gleixner
ebb922c920 Merge tag 'v6.18-rc3' into irq/msi
Pick up OF changes to resolve dependencies
2025-11-22 17:07:57 +01:00
Babu Moger
ac7de456a3 fs/resctrl: Update bit_usage to reflect io_alloc
The "shareable_bits" and "bit_usage" resctrl files associated with cache
resources give insight into how instances of a cache is used.

Update the annotated capacity bitmasks displayed by "bit_usage" to include the
cache portions allocated for I/O via the "io_alloc" feature. "shareable_bits"
is a global bitmask of shareable cache with I/O and can thus not present the
per-domain I/O allocations possible with the "io_alloc" feature. Revise the
"shareable_bits" documentation to direct users to "bit_usage" for accurate
cache usage information.

Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://patch.msgid.link/e02a0d424129fd7f3e45822a559b1c614ae4652a.1762995456.git.babu.moger@amd.com
2025-11-22 14:30:34 +01:00
Babu Moger
28fa2cce7a fs/resctrl: Introduce interface to modify io_alloc capacity bitmasks
The io_alloc feature in resctrl enables system software to configure the
portion of the cache allocated for I/O traffic. When supported, the
io_alloc_cbm file in resctrl provides access to capacity bitmasks (CBMs)
allocated for I/O devices.

Enable users to modify io_alloc CBMs by writing to the io_alloc_cbm resctrl
file when the io_alloc feature is enabled.

Mirror the CBMs between CDP_CODE and CDP_DATA when CDP is enabled to present
consistent I/O allocation information to user space.

Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://patch.msgid.link/67609641b03ccfba18a8ee0bf9dbd1f3dcbecda3.1762995456.git.babu.moger@amd.com
2025-11-22 14:28:31 +01:00
Babu Moger
af1242eeca fs/resctrl: Modify struct rdt_parse_data to pass mode and CLOSID
parse_cbm() requires resource group mode and CLOSID to validate the capacity
bitmask (CBM). It is passed via struct rdtgroup in struct rdt_parse_data.

The io_alloc feature also uses CBMs to indicate which portions of cache are
allocated for I/O traffic. The CBMs are provided by user space and need to be
validated the same as CBMs provided for general (CPU) cache allocation.
parse_cbm() cannot be used as-is since io_alloc does not have rdtgroup context.

Pass the resource group mode and CLOSID directly to parse_cbm() via struct
rdt_parse_data, instead of through the rdtgroup struct, to facilitate calling
parse_cbm() to verify the CBM of the io_alloc feature.

Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://patch.msgid.link/f8ec6ab5cf594d906a3fe75f56793d5fbd63f38f.1762995456.git.babu.moger@amd.com
2025-11-22 13:10:12 +01:00
Babu Moger
77b6623262 fs/resctrl: Introduce interface to display io_alloc CBMs
Introduce the "io_alloc_cbm" resctrl file to display the capacity bitmasks
(CBMs) that represent the portions of each cache instance allocated
for I/O traffic on a cache resource that supports the "io_alloc" feature.

io_alloc_cbm resides in the info directory of a cache resource, for example,
/sys/fs/resctrl/info/L3/. Since the resource name is part of the path, it
is not necessary to display the resource name as done in the schemata file.

When CDP is enabled, io_alloc routes traffic using the highest CLOSID
associated with the CDP_CODE resource and that CLOSID becomes unusable for
the CDP_DATA resource. The highest CLOSID of CDP_CODE and CDP_DATA resources
will be kept in sync to ensure consistent user interface. In preparation for
this, access the CBMs for I/O traffic through highest CLOSID of either
CDP_CODE or CDP_DATA resource.

Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://patch.msgid.link/55a3ff66a70e7ce8239f022e62b334e9d64af604.1762995456.git.babu.moger@amd.com
2025-11-22 11:37:21 +01:00
Frederic Weisbecker
3de5e46e50 genirq: Remove cpumask availability check on kthread affinity setting
Failing to allocate the affinity mask of an interrupt descriptor fails the
whole descriptor initialization. It is then guaranteed that the cpumask is
always available whenever the related interrupt objects are alive, such as
the kthread handler.

Therefore remove the superfluous check since it is merely a historical
leftover. Get rid also of the comments above it that are obsolete and
useless.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://patch.msgid.link/20251121143500.42111-4-frederic@kernel.org
2025-11-22 09:26:18 +01:00
Frederic Weisbecker
801afdfbfc genirq: Fix interrupt threads affinity vs. cpuset isolated partitions
When a cpuset isolated partition is created / updated or destroyed, the
interrupt threads are affined blindly to all the non-isolated CPUs. This
happens without taking into account the interrupt threads initial affinity
that becomes ignored.

For example in a system with 8 CPUs, if an interrupt and its kthread are
initially affine to CPU 5, creating an isolated partition with only CPU 2
inside will eventually end up affining the interrupt kthread to all CPUs
but CPU 2 (that is CPUs 0,1,3-7), losing the kthread preference for CPU 5.

Besides the blind re-affining, this doesn't take care of the actual low
level interrupt which isn't migrated. As of today the only way to isolate
non managed interrupts, along with their kthreads, is to overwrite their
affinity separately, for example through /proc/irq/

To avoid doing that manually, future development should focus on updating
the interrupt's affinity whenever cpuset isolated partitions are updated.

In the meantime, cpuset shouldn't fiddle with interrupt threads directly.
To prevent from that, set the PF_NO_SETAFFINITY flag to them.

This is done through kthread_bind_mask() by affining them initially to all
possible CPUs as at that point the interrupt is not started up which means
the affinity of the hard interrupt is not known. The thread will adjust
that once it reaches the handler, which is guaranteed to happen after the
initial affinity of the hard interrupt is established.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://patch.msgid.link/20251121143500.42111-3-frederic@kernel.org
2025-11-22 09:26:18 +01:00
Frederic Weisbecker
68775ca79a genirq: Prevent early spurious wake-ups of interrupt threads
During initialization, the interrupt thread is created before the interrupt
is enabled. The interrupt enablement happens before the actual kthread wake
up point. Once the interrupt is enabled the hardware can raise an interrupt
and once setup_irq() drops the descriptor lock a interrupt wake-up can
happen.

Even when such an interrupt can be considered premature, this is not a
problem in general because at the point where the descriptor lock is
dropped and the wakeup can happen, the data which is used by the thread is
fully initialized.

Though from the perspective of least surprise, the initial wakeup really
should be performed by the setup code and not randomly by a premature
interrupt.

Prevent this by performing a wake-up only if the target is in state
TASK_INTERRUPTIBLE, which the thread uses in wait_for_interrupt().

If the thread is still in state TASK_UNINTERRUPTIBLE, the wake-up is not
lost because after the setup code completed the initial wake-up the thread
will observe the IRQTF_RUNTHREAD and proceed with the handling.

[ tglx: Simplified the changes and extended the changelog. ]

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://patch.msgid.link/20251121143500.42111-2-frederic@kernel.org
2025-11-22 09:26:18 +01:00
Babu Moger
9445c7059c fs/resctrl: Add user interface to enable/disable io_alloc feature
AMD's SDCIAE forces all SDCI lines to be placed into the L3 cache portions
identified by the highest-supported L3_MASK_n register, where n is the maximum
supported CLOSID.

To support this, when io_alloc resctrl feature is enabled, reserve the highest
CLOSID exclusively for I/O allocation traffic making it no longer available for
general CPU cache allocation.

Introduce user interface to enable/disable io_alloc feature and encourage users
to enable io_alloc only when running workloads that can benefit from this
functionality. On enable, initialize the io_alloc CLOSID with all usable CBMs
across all the domains.

Since CLOSIDs are managed by resctrl fs, it is least invasive to make "io_alloc
is supported by maximum supported CLOSID" part of the initial resctrl fs
support for io_alloc. Take care to minimally (only in error messages) expose
this use of CLOSID for io_alloc to user space so that this is not required from
other architectures that may support io_alloc differently in the future.

When resctrl is mounted with "-o cdp" to enable code/data prioritization,
there are two L3 resources that can support I/O allocation: L3CODE and
L3DATA.  From resctrl fs perspective the two resources share a CLOSID and
the architecture's available CLOSID are halved to support this.

The architecture's underlying CLOSID used by SDCIAE when CDP is enabled is the
CLOSID associated with the CDP_CODE resource, but from resctrl's perspective
there is only one CLOSID for both CDP_CODE and CDP_DATA. CDP_DATA is thus not
usable for general (CPU) cache allocation nor I/O allocation.

Keep the CDP_CODE and CDP_DATA I/O alloc status in sync to avoid any confusion
to user space.  That is, enabling io_alloc on CDP_CODE does so on CDP_DATA and
vice-versa, and keep the I/O allocation CBMs of CDP_CODE and CDP_DATA in sync.

Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://patch.msgid.link/c7d3037795e653e22b02d8fc73ca80d9b075031c.1762995456.git.babu.moger@amd.com
2025-11-21 23:01:54 +01:00
Babu Moger
48068e5650 fs/resctrl: Introduce interface to display "io_alloc" support
Introduce the "io_alloc" resctrl file to the "info" area of a cache resource,
for example /sys/fs/resctrl/info/L3/io_alloc. "io_alloc" indicates support for
the "io_alloc" feature that allows direct insertion of data from I/O
devices into the cache.

Restrict exposing support for "io_alloc" to the L3 resource that is the only
resource where this feature can be backed by AMD's L3 Smart Data Cache
Injection Allocation Enforcement (SDCIAE). With that, the "io_alloc" file is
only visible to user space if the L3 resource supports "io_alloc".

Doing so makes the file visible for all cache resources though, for example
also L2 cache (if it supports cache allocation). As a consequence, add
capability for file to report expected "enabled" and "disabled", as well as
"not supported".

Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://patch.msgid.link/e8b116a8f424128b227734bb1d433c14af478d90.1762995456.git.babu.moger@amd.com
2025-11-21 22:49:42 +01:00
Babu Moger
556d2892aa x86,fs/resctrl: Implement "io_alloc" enable/disable handlers
"io_alloc" is the generic name of the new resctrl feature that enables system
software to configure the portion of cache allocated for I/O traffic. On AMD
systems, "io_alloc" resctrl feature is backed by AMD's L3 Smart Data Cache
Injection Allocation Enforcement (SDCIAE).

Introduce the architecture-specific functions that resctrl fs should call to
enable, disable, or check status of the "io_alloc" feature. Change SDCIAE state
by setting (to enable) or clearing (to disable) bit 1 of
MSR_IA32_L3_QOS_EXT_CFG on all logical processors within the cache domain.

Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://patch.msgid.link/9e9070100c320eab5368e088a3642443dee95ed7.1762995456.git.babu.moger@amd.com
2025-11-21 22:35:22 +01:00
Babu Moger
7923ae7698 x86,fs/resctrl: Detect io_alloc feature
AMD's SDCIAE (SDCI Allocation Enforcement) PQE feature enables system software
to control the portions of L3 cache used for direct insertion of data from I/O
devices into the L3 cache.

Introduce a generic resctrl cache resource property "io_alloc_capable" as the
first part of the new "io_alloc" resctrl feature that will support AMD's
SDCIAE. Any architecture can set a cache resource as "io_alloc_capable" if
a portion of the cache can be allocated for I/O traffic.

Set the "io_alloc_capable" property for the L3 cache resource on x86 (AMD)
systems that support SDCIAE.

Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://patch.msgid.link/df85a9a6081674fd3ef6b4170920485512ce2ded.1762995456.git.babu.moger@amd.com
2025-11-21 22:04:59 +01:00
Babu Moger
4d4840b125 x86/resctrl: Add SDCIAE feature in the command line options
Add a kernel command-line parameter to enable or disable the exposure of
the L3 Smart Data Cache Injection Allocation Enforcement (SDCIAE) hardware
feature to resctrl.

Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://patch.msgid.link/c623edf7cb369ba9da966de47d9f1b666778a40e.1762995456.git.babu.moger@amd.com
2025-11-21 22:03:23 +01:00
Babu Moger
3767def18f x86/cpufeatures: Add support for L3 Smart Data Cache Injection Allocation Enforcement
Smart Data Cache Injection (SDCI) is a mechanism that enables direct insertion
of data from I/O devices into the L3 cache. By directly caching data from I/O
devices rather than first storing the I/O data in DRAM, SDCI reduces demands on
DRAM bandwidth and reduces latency to the processor consuming the I/O data.

The SDCIAE (SDCI Allocation Enforcement) PQE feature allows system software to
control the portion of the L3 cache used for SDCI.

When enabled, SDCIAE forces all SDCI lines to be placed into the L3 cache
partitions identified by the highest-supported L3_MASK_n register, where n is
the maximum supported CLOSID.

Add CPUID feature bit that can be used to configure SDCIAE.

The SDCIAE feature details are documented in:

  AMD64 Architecture Programmer's Manual Volume 2: System Programming
     Publication # 24593 Revision 3.41 section 19.4.7 L3 Smart Data Cache
    Injection Allocation Enforcement (SDCIAE).

available at https://bugzilla.kernel.org/show_bug.cgi?id=206537

Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Acked-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://patch.msgid.link/83ca10d981c48e86df2c3ad9658bb3ba3544c763.1762995456.git.babu.moger@amd.com
2025-11-21 22:03:07 +01:00
Smita Koralahalli
5c4663ed1e x86/mce: Handle AMD threshold interrupt storms
Extend the logic of handling CMCI storms to AMD threshold interrupts.

Rely on the similar approach as of Intel's CMCI to mitigate storms per CPU and
per bank. But, unlike CMCI, do not set thresholds and reduce interrupt rate on
a storm. Rather, disable the interrupt on the corresponding CPU and bank.
Re-enable back the interrupts if enough consecutive polls of the bank show no
corrected errors (30, as programmed by Intel).

Turning off the threshold interrupts would be a better solution on AMD systems
as other error severities will still be handled even if the threshold
interrupts are disabled.

  [ Tony: Small tweak because mce_handle_storm() isn't a pointer now ]
  [ Yazen: Rebase and simplify ]
  [ Avadhut: Remove check to not clear bank's bit in mce_poll_banks and fix
    checkpatch warnings. ]

Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Avadhut Naik <avadhut.naik@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://patch.msgid.link/20251121190542.2447913-3-avadhut.naik@amd.com
2025-11-21 20:41:10 +01:00
Avadhut Naik
d7ac083f09 x86/mce: Do not clear bank's poll bit in mce_poll_banks on AMD SMCA systems
Currently, when a CMCI storm detected on a Machine Check bank, subsides, the
bank's corresponding bit in the mce_poll_banks per-CPU variable is cleared
unconditionally by cmci_storm_end().

On AMD SMCA systems, this essentially disables polling on that particular bank
on that CPU. Consequently, any subsequent correctable errors or storms will not
be logged.

Since AMD SMCA systems allow banks to be managed by both polling and
interrupts, the polling banks bitmap for a CPU, i.e., mce_poll_banks, should
not be modified when a storm subsides.

Fixes: 7eae17c4ad ("x86/mce: Add per-bank CMCI storm mitigation")
Signed-off-by: Avadhut Naik <avadhut.naik@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20251121190542.2447913-2-avadhut.naik@amd.com
2025-11-21 20:33:12 +01:00
Ma Ke
ef1b6d9049 EDAC/igen6: Fix error handling in igen6_edac driver
The igen6_edac driver calls device_initialize() for all memory
controllers in igen6_register_mci(), but misses corresponding
put_device() calls in error paths and during normal shutdown in
igen6_unregister_mcis().

Adding the missing put_device() calls improves code readability and
ensures proper reference counting for the device structure.

Found by code review.

Signed-off-by: Ma Ke <make24@iscas.ac.cn>
Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://patch.msgid.link/20251105090244.23327-1-make24@iscas.ac.cn
2025-11-21 10:20:51 -08:00
Qiuxu Zhuo
5f40ea7f41 EDAC/imh: Setup 'imh_test' debugfs testing node
Setup the following debugfs testing node to enable fake memory error
address decoding tests for the imh_edac driver.

  /sys/kernel/debug/edac/imh_test/addr

Tested-by: Yi Lai <yi1.lai@intel.com>
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://patch.msgid.link/20251119134132.2389472-8-qiuxu.zhuo@intel.com
2025-11-21 10:20:51 -08:00
Qiuxu Zhuo
f619613f30 EDAC/{skx_comm,imh}: Detect 2-level memory configuration
Detect 2-level memory configurations and notify the 'skx_common' library
to enable ADXL 2-level memory error decoding.

Tested-by: Yi Lai <yi1.lai@intel.com>
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://patch.msgid.link/20251119134132.2389472-7-qiuxu.zhuo@intel.com
2025-11-21 10:20:51 -08:00
Qiuxu Zhuo
39abdcbdad EDAC/skx_common: Extend the maximum number of DRAM chip row bits
The allowed maximum number of row bits for DRAM chips in the Diamond
Rapids server processor is up to 19. Extend the current maximum row
bits from 18 to 19.

Tested-by: Yi Lai <yi1.lai@intel.com>
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://patch.msgid.link/20251119134132.2389472-6-qiuxu.zhuo@intel.com
2025-11-21 10:20:51 -08:00
Qiuxu Zhuo
9fc67b1170 EDAC/{skx_common,imh}: Add EDAC driver for Intel Diamond Rapids servers
Intel Diamond Rapids CPUs include Integrated Memory and I/O Hubs (IMH).
The memory controllers within the IMHs provide memory stacks to the
processor. Create a new driver for this IMH-based memory controllers
rather than applying additional patches to the existing i10nm_edac.c
for the following reasons:

1) The memory controllers are not presented as PCI devices; instead,
   the detection and all their registers have been transitioned to
   MMIO-based memory spaces.

2) Validation processes are costly. Modifications to i10nm_edac would
   require extensive validation checks against multiple platforms,
   including Ice Lake, Sapphire Rapids, Emerald Rapids, Granite Rapids,
   Sierra Forest, and Grand Ridge.

3) Future Intel CPUs will likely only need patches on top of this new
   EDAC driver. Validation can be limited to Diamond Rapids servers
   and future Intel CPU generations.

[Tony: Fix kerneldoc for struct local_reg]
[randconfig: Added dependencies on NFIT and DMI]

Tested-by: Yi Lai <yi1.lai@intel.com>
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://patch.msgid.link/20251119134132.2389472-5-qiuxu.zhuo@intel.com
2025-11-21 10:19:43 -08:00
Avadhut Naik
821f5fe4db x86/mce: Add support for physical address valid bit
Starting with Zen6, AMD's Scalable MCA systems will incorporate two new bits in
MCA_STATUS and MCA_CONFIG MSRs. These bits will indicate if a valid System
Physical Address (SPA) is present in MCA_ADDR.

PhysAddrValidSupported bit (MCA_CONFIG[11]) serves as the architectural
indicator and states if PhysAddrV bit (MCA_STATUS[54]) is Reserved or if it
indicates validity of SPA in MCA_ADDR.

PhysAddrV bit (MCA_STATUS[54]) advertises if MCA_ADDR contains valid SPA or if
it is implementation specific.

Use and prefer MCA_STATUS[PhysAddrV] when checking for a usable address.

Signed-off-by: Avadhut Naik <avadhut.naik@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://patch.msgid.link/20251118191731.181269-1-avadhut.naik@amd.com
2025-11-21 10:32:28 +01:00
Yazen Ghannam
eeb3f76d73 x86/mce: Save and use APEI corrected threshold limit
The MCA threshold limit generally is not something that needs to change during
runtime. It is common for a system administrator to decide on a policy for
their managed systems.

If MCA thresholding is OS-managed, then the threshold limit must be set at
every boot. However, many systems allow the user to set a value in their BIOS.
And this is reported through an APEI HEST entry even if thresholding is not in
FW-First mode.

Use this value, if available, to set the OS-managed threshold limit.  Users
can still override it through sysfs if desired for testing or debug.

APEI is parsed after MCE is initialized. So reset the thresholding blocks
later to pick up the threshold limit.

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/20251104-wip-mca-updates-v8-0-66c8eacf67b9@amd.com
2025-11-21 10:32:28 +01:00
Ard Biesheuvel
a3e6907128 x86/boot: Drop unused sev_enable() fallback
The misc.h header is not included by the EFI stub, which is the only
C caller of sev_enable(). This means the fallback for cases where
CONFIG_AMD_MEM_ENCRYPT is not set is never used, so it can be dropped.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://patch.msgid.link/20250909080631.2867579-6-ardb+git@google.com
2025-11-20 21:12:48 +01:00
Gabriele Monaco
7dec062cfc timers/migration: Exclude isolated cpus from hierarchy
The timer migration mechanism allows active CPUs to pull timers from
idle ones to improve the overall idle time. This is however undesired
when CPU intensive workloads run on isolated cores, as the algorithm
would move the timers from housekeeping to isolated cores, negatively
affecting the isolation.

Exclude isolated cores from the timer migration algorithm, extend the
concept of unavailable cores, currently used for offline ones, to
isolated ones:
* A core is unavailable if isolated or offline;
* A core is available if non isolated and online;

A core is considered unavailable as isolated if it belongs to:
* the isolcpus (domain) list
* an isolated cpuset
Except if it is:
* in the nohz_full list (already idle for the hierarchy)
* the nohz timekeeper core (must be available to handle global timers)

CPUs are added to the hierarchy during late boot, excluding isolated
ones, the hierarchy is also adapted when the cpuset isolation changes.

Due to how the timer migration algorithm works, any CPU part of the
hierarchy can have their global timers pulled by remote CPUs and have to
pull remote timers, only skipping pulling remote timers would break the
logic.
For this reason, prevent isolated CPUs from pulling remote global
timers, but also the other way around: any global timer started on an
isolated CPU will run there. This does not break the concept of
isolation (global timers don't come from outside the CPU) and, if
considered inappropriate, can usually be mitigated with other isolation
techniques (e.g. IRQ pinning).

This effect was noticed on a 128 cores machine running oslat on the
isolated cores (1-31,33-63,65-95,97-127). The tool monopolises CPUs,
and the CPU with lowest count in a timer migration hierarchy (here 1
and 65) appears as always active and continuously pulls global timers,
from the housekeeping CPUs. This ends up moving driver work (e.g.
delayed work) to isolated CPUs and causes latency spikes:

before the change:

 # oslat -c 1-31,33-63,65-95,97-127 -D 62s
 ...
  Maximum:     1203 10 3 4 ... 5 (us)

after the change:

 # oslat -c 1-31,33-63,65-95,97-127 -D 62s
 ...
  Maximum:      10 4 3 4 3 ... 5 (us)

The same behaviour was observed on a machine with as few as 20 cores /
40 threads with isocpus set to: 1-9,11-39 with rtla-osnoise-top.

Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: John B. Wyatt IV <jwyatt@redhat.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://patch.msgid.link/20251120145653.296659-8-gmonaco@redhat.com
2025-11-20 20:17:32 +01:00
Yury Norov
b56651007f cpumask: Add initialiser to use cleanup helpers
Now we can simplify a code that allocates cpumasks for local needs.

Automatic variables have to be initialized at declaration, or at least
before any possibility for the logic to return, so that compiler
wouldn't try to call an associate destructor function on a random stack
number.

Because cpumask_var_t, depending on the CPUMASK_OFFSTACK config, is
either a pointer or an array, we have to have a macro for initialization.

So define a CPUMASK_VAR_NULL macro, which allows to init struct cpumask
pointer with NULL when CPUMASK_OFFSTACK is enabled, and effectively a
no-op when CPUMASK_OFFSTACK is disabled (initialisation optimised out
with -O2).

Signed-off-by: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://patch.msgid.link/20251120145653.296659-7-gmonaco@redhat.com
2025-11-20 20:17:32 +01:00
Gabriele Monaco
185bccc797 sched/isolation: Force housekeeping if isolcpus and nohz_full don't leave any
Currently the user can set up isolcpus and nohz_full in such a way that
leaves no housekeeping CPU (i.e. no CPU that is neither domain isolated
nor nohz full). This can be a problem for other subsystems (e.g. the
timer wheel imgration).

Prevent this configuration by invalidating the last setting in case the
union of isolcpus (domain) and nohz_full covers all CPUs.

Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Waiman Long <longman@redhat.com>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://patch.msgid.link/20251120145653.296659-6-gmonaco@redhat.com
2025-11-20 20:17:31 +01:00
Gabriele Monaco
22f8e41680 cgroup/cpuset: Rename update_unbound_workqueue_cpumask() to update_isolation_cpumasks()
update_unbound_workqueue_cpumask() updates unbound workqueues settings
when there's a change in isolated CPUs, but it can be used for other
subsystems requiring updated when isolated CPUs change.

Generalise the name to update_isolation_cpumasks() to prepare for other
functions unrelated to workqueues to be called in that spot.

[longman: Change the function name to update_isolation_cpumasks()]

Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Chen Ridong <chenridong@huaweicloud.com>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Waiman Long <longman@redhat.com>
Link: https://patch.msgid.link/20251120145653.296659-5-gmonaco@redhat.com
2025-11-20 20:17:31 +01:00
Gabriele Monaco
4c2374ed86 timers/migration: Use scoped_guard on available flag set/clear
Cleanup tmigr_clear_cpu_available() and tmigr_set_cpu_available() to
prepare for easier checks on the available flag.

Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://patch.msgid.link/20251120145653.296659-4-gmonaco@redhat.com
2025-11-20 20:17:31 +01:00
Gabriele Monaco
a048ca5f00 timers/migration: Add mask for CPUs available in the hierarchy
Keep track of the CPUs available for timer migration in a cpumask. This
prepares the ground to generalise the concept of unavailable CPUs.

Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://patch.msgid.link/20251120145653.296659-3-gmonaco@redhat.com
2025-11-20 20:17:31 +01:00
Gabriele Monaco
8312cab5ff timers/migration: Rename 'online' bit to 'available'
The timer migration hierarchy excludes offline CPUs via the
tmigr_is_not_available function, which is essentially checking the
online bit for the CPU.

Rename the online bit to available and all references in function names
and tracepoint to generalise the concept of available CPUs.

Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://patch.msgid.link/20251120145653.296659-2-gmonaco@redhat.com
2025-11-20 20:17:31 +01:00
Cai Xinchen
f20810157f arm64: remove duplicate ARCH_HAS_MEM_ENCRYPT
The commit e7bafbf717 ("arm64: mm: Add top-level dispatcher for
internal mem_encrypt API") adds ARCH_HAS_MEM_ENCRYPT. And then the
commit 42be24a417 ("arm64: Enable memory encrypt for Realms") adds
duplicate config. Just remove it.

Fixes: 42be24a417 ("arm64: Enable memory encrypt for Realms")
Signed-off-by: Cai Xinchen <caixinchen1@huawei.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-11-20 17:57:23 +00:00
Yang Shi
a06494adb7 arm64: mm: use untagged address to calculate page index
Nathan Chancellor reported the below bug:

[    0.149929] BUG: KASAN: invalid-access in change_memory_common+0x258/0x2d0
[    0.151006] Read of size 8 at addr f96680000268a000 by task swapper/0/1
[    0.152031]
[    0.152274] CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.18.0-rc1-00012-g37cb0aab9068 #1 PREEMPT
[    0.152288] Hardware name: linux,dummy-virt (DT)
[    0.152292] Call trace:
[    0.152295]  show_stack+0x18/0x30 (C)
[    0.152309]  dump_stack_lvl+0x60/0x80
[    0.152320]  print_report+0x480/0x498
[    0.152331]  kasan_report+0xac/0xf0
[    0.152343]  kasan_check_range+0x90/0xb0
[    0.152353]  __hwasan_load8_noabort+0x20/0x34
[    0.152364]  change_memory_common+0x258/0x2d0
[    0.152375]  set_memory_ro+0x18/0x24
[    0.152386]  bpf_prog_pack_alloc+0x200/0x2e8
[    0.152397]  bpf_jit_binary_pack_alloc+0x78/0x188
[    0.152409]  bpf_int_jit_compile+0xa4c/0xc74
[    0.152420]  bpf_prog_select_runtime+0x1c0/0x2bc
[    0.152430]  bpf_prepare_filter+0x5a4/0x7c0
[    0.152443]  bpf_prog_create+0xa4/0x100
[    0.152454]  ptp_classifier_init+0x80/0xd0
[    0.152465]  sock_init+0x12c/0x178
[    0.152474]  do_one_initcall+0xa0/0x260
[    0.152484]  kernel_init_freeable+0x2d8/0x358
[    0.152495]  kernel_init+0x20/0x140
[    0.152510]  ret_from_fork+0x10/0x20

It is because the KASAN tagged address was used when calculating the page
index. The untagged address should be used.

Fixes: 37cb0aab90 ("arm64: mm: make linear mapping permission update more robust for patial range")
Reported-by: Nathan Chancellor <nathan@kernel.org>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Yang Shi <yang@os.amperecomputing.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-11-20 17:57:19 +00:00
Thomas Gleixner
79c11fb3da sched/mmcid: Use cpumask_weighted_or()
Use cpumask_weighted_or() instead of cpumask_or() and cpumask_weight() on
the result, which walks the same bitmap twice. Results in 10-20% less
cycles, which reduces the runqueue lock hold time.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Yury Norov (NVIDIA) <yury.norov@gmail.com>
Link: https://patch.msgid.link/20251119172549.511736272@linutronix.de
2025-11-20 12:14:54 +01:00
Thomas Gleixner
437cb3ded2 cpumask: Introduce cpumask_weighted_or()
CID management OR's two cpumasks and then calculates the weight on the
result. That's inefficient as that has to walk the same stuff twice. As
this is done with runqueue lock held, there is a real benefit of speeding
this up. Depending on the system this results in 10-20% less cycles spent
with runqueue lock held for a 4K cpumask.

Provide cpumask_weighted_or() and the corresponding bitmap functions which
return the weight of the OR result right away.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Yury Norov (NVIDIA) <yury.norov@gmail.com>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251119172549.448263340@linutronix.de
2025-11-20 12:14:54 +01:00
Thomas Gleixner
0d032a43eb sched/mmcid: Prevent pointless work in mm_update_cpus_allowed()
mm_update_cpus_allowed() is not required to be invoked for affinity changes
due to migrate_disable() and migrate_enable().

migrate_disable() restricts the task temporarily to a CPU on which the task
was already allowed to run, so nothing changes. migrate_enable() restores
the actual task affinity mask.

If that mask changed between migrate_disable() and migrate_enable() then
that change was already accounted for.

Move the invocation to the proper place to avoid that.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251119172549.385208276@linutronix.de
2025-11-20 12:14:54 +01:00
Thomas Gleixner
b08ef5fc8f sched/mmcid: Move scheduler code out of global header
This is only used in the scheduler core code, so there is no point to have
it in a global header.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Yury Norov (NVIDIA) <yury.norov@gmail.com>
Link: https://patch.msgid.link/20251119172549.321259077@linutronix.de
2025-11-20 12:14:53 +01:00
Thomas Gleixner
925b7847bb sched: Fixup whitespace damage
With whitespace checks enabled in the editor this makes eyes bleed.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251119172549.258651925@linutronix.de
2025-11-20 12:14:53 +01:00
Thomas Gleixner
be4463fa2c sched/mmcid: Cacheline align MM CID storage
Both the per CPU storage and the data in mm_struct are heavily used in
context switch. As they can end up next to other frequently modified data,
they are subject to false sharing.

Make them cache line aligned.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251119172549.194111661@linutronix.de
2025-11-20 12:14:53 +01:00
Thomas Gleixner
8cea569ca7 sched/mmcid: Use proper data structures
Having a lot of CID functionality specific members in struct task_struct
and struct mm_struct is not really making the code easier to read.

Encapsulate the CID specific parts in data structures and keep them
separate from the stuff they are embedded in.

No functional change.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251119172549.131573768@linutronix.de
2025-11-20 12:14:52 +01:00
Thomas Gleixner
77d7dc8bef sched/mmcid: Revert the complex CID management
The CID management is a complex beast, which affects both scheduling and
task migration. The compaction mechanism forces random tasks of a process
into task work on exit to user space causing latency spikes.

Revert back to the initial simple bitmap allocating mechanics, which are
known to have scalability issues as that allows to gradually build up a
replacement functionality in a reviewable way.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251119172549.068197830@linutronix.de
2025-11-20 12:14:52 +01:00
Qiuxu Zhuo
d4839582bc EDAC/skx_common: Prepare for skx_set_hi_lo()
The upcoming imh_edac driver for Intel Diamond Rapids servers cannot
use skx_get_hi_lo() in skx_common to retrieve the TOHM (Top of High
Memory) and TOLM (Top of Low Memory) parameters. Instead, it obtains
these parameters within its own EDAC driver. To accommodate this,
prepare skx_set_hi_lo() to allow the driver to notify skx_common of
these parameters.

Tested-by: Yi Lai <yi1.lai@intel.com>
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://patch.msgid.link/20251119134132.2389472-4-qiuxu.zhuo@intel.com
2025-11-19 12:11:40 -08:00
Qiuxu Zhuo
9529e69773 EDAC/skx_common: Prepare for skx_get_edac_list()
The Intel EDAC library 'skx_common' maintains the Intel server EDAC device
list for {skx, i10nm}_edac drivers, which use skx_get_all_bus_mappings()
to build and retrieve the EDAC device list.

However, the upcoming Intel EDAC driver, imh_edac, for Diamond Rapids
servers is designed for memory controllers that are MMIO-based devices
rather than PCI devices. Consequently, it can't use
skx_get_all_bus_mappings() due to the absence of a PCI bus. To accommodate
this, prepare skx_get_edac_list() to enable the upcoming imh_edac driver
to obtain the EDAC device list from the skx_common library and build the
EDAC device list independently.

Tested-by: Yi Lai <yi1.lai@intel.com>
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://patch.msgid.link/20251119134132.2389472-3-qiuxu.zhuo@intel.com
2025-11-19 12:11:40 -08:00
Qiuxu Zhuo
b3d70059cb EDAC/{skx_common,skx,i10nm}: Make skx_register_mci() independent of pci_dev
Memory controllers in the new Intel server CPUs, such as Diamond Rapids,
are presented as MMIO-based devices rather than PCI devices.
Modify skx_register_mci() to be independent of 'pci_dev' and use a generic
'dev' of 'struct device' to prepare for support of such MMIO-based memory
controllers.

Tested-by: Yi Lai <yi1.lai@intel.com>
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://patch.msgid.link/20251119134132.2389472-2-qiuxu.zhuo@intel.com
2025-11-19 12:11:40 -08:00
Ben Horgan
ce1e1421f8 MAINTAINERS: new entry for MPAM Driver
Create a maintainer entry for the new MPAM Driver. Add myself and
James Morse as maintainers. James created the driver and I have
taken up the later versions of his series.

Cc: James Morse <james.morse@arm.com>
Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Ben Horgan <ben.horgan@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-11-19 18:34:24 +00:00
James Morse
2557e0eafe arm_mpam: Add kunit tests for props_mismatch()
When features are mismatched between MSC the way features are combined
to the class determines whether resctrl can support this SoC.

Add some tests to illustrate the sort of thing that is expected to
work, and those that must be removed.

Signed-off-by: James Morse <james.morse@arm.com>
Reviewed-by: Ben Horgan <ben.horgan@arm.com>
Reviewed-by: Fenghua Yu <fenghuay@nvidia.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Fenghua Yu <fenghuay@nvidia.com>
Tested-by: Carl Worth <carl@os.amperecomputing.com>
Tested-by: Gavin Shan <gshan@redhat.com>
Tested-by: Zeng Heng <zengheng4@huawei.com>
Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Hanjun Guo <guohanjun@huawei.com>
Signed-off-by: Ben Horgan <ben.horgan@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-11-19 18:34:24 +00:00
James Morse
e3565d1fd4 arm_mpam: Add kunit test for bitmap reset
The bitmap reset code has been a source of bugs. Add a unit test.

This currently has to be built in, as the rest of the driver is
builtin.

Suggested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: James Morse <james.morse@arm.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Reviewed-by: Ben Horgan <ben.horgan@arm.com>
Reviewed-by: Fenghua Yu <fenghuay@nvidia.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Fenghua Yu <fenghuay@nvidia.com>
Tested-by: Carl Worth <carl@os.amperecomputing.com>
Tested-by: Gavin Shan <gshan@redhat.com>
Tested-by: Zeng Heng <zengheng4@huawei.com>
Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Hanjun Guo <guohanjun@huawei.com>
Signed-off-by: Ben Horgan <ben.horgan@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-11-19 18:34:24 +00:00
James Morse
201d96ca4c arm_mpam: Add helper to reset saved mbwu state
resctrl expects to reset the bandwidth counters when the filesystem
is mounted.

To allow this, add a helper that clears the saved mbwu state. Instead
of cross calling to each CPU that can access the component MSC to
write to the counter, set a flag that causes it to be zero'd on the
the next read. This is easily done by forcing a configuration update.

Signed-off-by: James Morse <james.morse@arm.com>
Cc: Peter Newman <peternewman@google.com>
Reviewed-by: Fenghua Yu <fenghuay@nvdia.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Fenghua Yu <fenghuay@nvidia.com>
Tested-by: Carl Worth <carl@os.amperecomputing.com>
Tested-by: Gavin Shan <gshan@redhat.com>
Tested-by: Zeng Heng <zengheng4@huawei.com>
Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Hanjun Guo <guohanjun@huawei.com>
Signed-off-by: Ben Horgan <ben.horgan@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-11-19 18:34:24 +00:00
Rohit Mathew
9e5afb7c32 arm_mpam: Use long MBWU counters if supported
Now that the larger counter sizes are probed, make use of them.

Callers of mpam_msmon_read() may not know (or care!) about the different
counter sizes. Allow them to specify mpam_feat_msmon_mbwu and have the
driver pick the counter to use.

Only 32bit accesses to the MSC are required to be supported by the
spec, but these registers are 64bits. The lower half may overflow
into the higher half between two 32bit reads. To avoid this, use
a helper that reads the top half multiple times to check for overflow.

Signed-off-by: Rohit Mathew <rohit.mathew@arm.com>
[morse: merged multiple patches from Rohit, added explicit counter selection ]
Signed-off-by: James Morse <james.morse@arm.com>
Cc: Peter Newman <peternewman@google.com>
Reviewed-by: Ben Horgan <ben.horgan@arm.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Reviewed-by: Fenghua Yu <fenghuay@nvidia.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Fenghua Yu <fenghuay@nvidia.com>
Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Carl Worth <carl@os.amperecomputing.com>
Tested-by: Gavin Shan <gshan@redhat.com>
Tested-by: Zeng Heng <zengheng4@huawei.com>
Tested-by: Hanjun Guo <guohanjun@huawei.com>
Signed-off-by: Ben Horgan <ben.horgan@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-11-19 18:34:24 +00:00
Rohit Mathew
fdc29a141d arm_mpam: Probe for long/lwd mbwu counters
mpam v0.1 and versions above v1.0 support optional long counter for
memory bandwidth monitoring. The MPAMF_MBWUMON_IDR register has fields
indicating support for long counters.

Probe these feature bits.

The mpam_feat_msmon_mbwu feature is used to indicate that bandwidth
monitors are supported, instead of muddling this with which size of
bandwidth monitors, add an explicit 31 bit counter feature.

Signed-off-by: Rohit Mathew <rohit.mathew@arm.com>
[ morse: Added 31bit counter feature to simplify later logic ]
Signed-off-by: James Morse <james.morse@arm.com>
Reviewed-by: Ben Horgan <ben.horgan@arm.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Reviewed-by: Fenghua Yu <fenghuay@nvidia.com>
Tested-by: Fenghua Yu <fenghuay@nvidia.com>
Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Peter Newman <peternewman@google.com>
Tested-by: Carl Worth <carl@os.amperecomputing.com>
Tested-by: Gavin Shan <gshan@redhat.com>
Tested-by: Zeng Heng <zengheng4@huawei.com>
Tested-by: Hanjun Guo <guohanjun@huawei.com>
Signed-off-by: Ben Horgan <ben.horgan@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-11-19 18:34:23 +00:00
Ben Horgan
b353637932 arm_mpam: Consider overflow in bandwidth counter state
Use the overflow status bit to track overflow on each bandwidth counter
read and add the counter size to the correction when overflow is detected.

This assumes that only a single overflow has occurred since the last read
of the counter. Overflow interrupts, on hardware that supports them could
be used to remove this limitation.

Cc: Zeng Heng <zengheng4@huawei.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: Zeng Heng <zengheng4@huawei.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Reviewed-by: Fenghua Yu <fenghuay@nvidia.com>
Tested-by: Carl Worth <carl@os.amperecomputing.com>
Tested-by: Gavin Shan <gshan@redhat.com>
Tested-by: Zeng Heng <zengheng4@huawei.com>
Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Hanjun Guo <guohanjun@huawei.com>
Signed-off-by: Ben Horgan <ben.horgan@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-11-19 18:34:23 +00:00
James Morse
41e8a14950 arm_mpam: Track bandwidth counter state for power management
Bandwidth counters need to run continuously to correctly reflect the
bandwidth.

Save the counter state when the hardware is reset due to CPU hotplug.
Add struct mbwu_state to track the bandwidth counter. Support for
tracking overflow with the same structure will be added in a
subsequent commit.

Cc: Zeng Heng <zengheng4@huawei.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: Zeng Heng <zengheng4@huawei.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Reviewed-by: Fenghua Yu <fenghuay@nvidia.com>
Tested-by: Carl Worth <carl@os.amperecomputing.com>
Tested-by: Gavin Shan <gshan@redhat.com>
Tested-by: Zeng Heng <zengheng4@huawei.com>
Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Hanjun Guo <guohanjun@huawei.com>
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Ben Horgan <ben.horgan@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-11-19 18:34:23 +00:00
James Morse
823e7c3712 arm_mpam: Add mpam_msmon_read() to read monitor value
Reading a monitor involves configuring what you want to monitor, and
reading the value. Components made up of multiple MSC may need values
from each MSC. MSCs may take time to configure, returning 'not ready'.
The maximum 'not ready' time should have been provided by firmware.

Add mpam_msmon_read() to hide all this. If (one of) the MSC returns
not ready, then wait the full timeout value before trying again.

CC: Shanker Donthineni <sdonthineni@nvidia.com>
Cc: Shaopeng Tan (Fujitsu) <tan.shaopeng@fujitsu.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Reviewed-by: Fenghua Yu <fenghuay@nvidia.com>
Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Peter Newman <peternewman@google.com>
Tested-by: Carl Worth <carl@os.amperecomputing.com>
Tested-by: Gavin Shan <gshan@redhat.com>
Tested-by: Zeng Heng <zengheng4@huawei.com>
Tested-by: Hanjun Guo <guohanjun@huawei.com>
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Ben Horgan <ben.horgan@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-11-19 18:34:23 +00:00
James Morse
c891bae664 arm_mpam: Add helpers to allocate monitors
MPAM's MSC support a number of monitors, each of which supports
bandwidth counters, or cache-storage-utilisation counters. To use
a counter, a monitor needs to be configured. Add helpers to allocate
and free CSU or MBWU monitors.

Signed-off-by: James Morse <james.morse@arm.com>
Reviewed-by: Ben Horgan <ben.horgan@arm.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Reviewed-by: Fenghua Yu <fenghuay@nvidia.com>
Tested-by: Fenghua Yu <fenghuay@nvidia.com>
Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Peter Newman <peternewman@google.com>
Tested-by: Carl Worth <carl@os.amperecomputing.com>
Tested-by: Gavin Shan <gshan@redhat.com>
Tested-by: Zeng Heng <zengheng4@huawei.com>
Tested-by: Hanjun Guo <guohanjun@huawei.com>
Signed-off-by: Ben Horgan <ben.horgan@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-11-19 18:34:22 +00:00
James Morse
880df85d86 arm_mpam: Probe and reset the rest of the features
MPAM supports more features than are going to be exposed to resctrl.
For partid other than 0, the reset values of these controls isn't
known.

Discover the rest of the features so they can be reset to avoid any
side effects when resctrl is in use.

PARTID narrowing allows MSC/RIS to support less configuration space than
is usable. If this feature is found on a class of device we are likely
to use, then reduce the partid_max to make it usable. This allows us
to map a PARTID to itself.

CC: Rohit Mathew <Rohit.Mathew@arm.com>
CC: Zeng Heng <zengheng4@huawei.com>
CC: Dave Martin <Dave.Martin@arm.com>
Signed-off-by: James Morse <james.morse@arm.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Reviewed-by: Fenghua Yu <fenghuay@nvidia.com>
Tested-by: Fenghua Yu <fenghuay@nvidia.com>
Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Peter Newman <peternewman@google.com>
Tested-by: Carl Worth <carl@os.amperecomputing.com>
Tested-by: Gavin Shan <gshan@redhat.com>
Tested-by: Zeng Heng <zengheng4@huawei.com>
Tested-by: Hanjun Guo <guohanjun@huawei.com>
Signed-off-by: Ben Horgan <ben.horgan@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-11-19 18:34:22 +00:00
James Morse
09b89d2a72 arm_mpam: Allow configuration to be applied and restored during cpu online
When CPUs come online the MSC's original configuration should be restored.

Add struct mpam_config to hold the configuration. For each component, this
has a bitmap of features that have been changed from the reset values. The
mpam_config is also used on RIS reset where all bits are set to ensure all
features are reset.

Once the maximum partid is known, allocate a configuration array for each
component, and reprogram each RIS configuration from this.

CC: Dave Martin <Dave.Martin@arm.com>
Signed-off-by: James Morse <james.morse@arm.com>
Cc: Fujitsu Fujitsu <Shaopeng Tan  tan.shaopeng@fujitsu.com>
Cc: Peter Newman peternewman@google.com
Reviewed-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Reviewed-by: Fenghua Yu <fenghuay@nvidia.com>
Tested-by: Carl Worth <carl@os.amperecomputing.com>
Tested-by: Gavin Shan <gshan@redhat.com>
Tested-by: Zeng Heng <zengheng4@huawei.com>
Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Hanjun Guo <guohanjun@huawei.com>
Signed-off-by: Ben Horgan <ben.horgan@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-11-19 18:34:22 +00:00
James Morse
3796f75aa7 arm_mpam: Use a static key to indicate when mpam is enabled
Once all the MSC have been probed, the system wide usable number of
PARTID is known and the configuration arrays can be allocated.

After this point, checking all the MSC have been probed is pointless,
and the cpuhp callbacks should restore the configuration, instead of
just resetting the MSC.

Add a static key to enable this behaviour. This will also allow MPAM
to be disabled in response to an error, and the architecture code to
enable/disable the context switch of the MPAM system registers.

Signed-off-by: James Morse <james.morse@arm.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Reviewed-by: Ben Horgan <ben.horgan@arm.com>
Reviewed-by: Fenghua Yu <fenghuay@nvidia.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Fenghua Yu <fenghuay@nvidia.com>
Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Peter Newman <peternewman@google.com>
Tested-by: Carl Worth <carl@os.amperecomputing.com>
Tested-by: Gavin Shan <gshan@redhat.com>
Tested-by: Zeng Heng <zengheng4@huawei.com>
Tested-by: Hanjun Guo <guohanjun@huawei.com>
Signed-off-by: Ben Horgan <ben.horgan@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-11-19 18:34:22 +00:00
James Morse
49aa621c4d arm_mpam: Register and enable IRQs
Register and enable error IRQs. All the MPAM error interrupts indicate a
software bug, e.g. out of range partid. If the error interrupt is ever
signalled, attempt to disable MPAM.

Only the irq handler accesses the MPAMF_ESR register, so no locking is
needed. The work to disable MPAM after an error needs to happen at process
context as it takes mutex. It also unregisters the interrupts, meaning
it can't be done from the threaded part of a threaded interrupt.
Instead, mpam_disable() gets scheduled.

Enabling the IRQs in the MSC may involve cross calling to a CPU that
can access the MSC.

Once the IRQ is requested, the mpam_disable() path can be called
asynchronously, which will walk structures sized by max_partid. Ensure
this size is fixed before the interrupt is requested.

CC: Rohit Mathew <rohit.mathew@arm.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Reviewed-by: Fenghua Yu <fenghuay@nvidia.com>
Tested-by: Rohit Mathew <rohit.mathew@arm.com>
Tested-by: Fenghua Yu <fenghuay@nvidia.com>
Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Peter Newman <peternewman@google.com>
Tested-by: Carl Worth <carl@os.amperecomputing.com>
Tested-by: Gavin Shan <gshan@redhat.com>
Tested-by: Zeng Heng <zengheng4@huawei.com>
Tested-by: Hanjun Guo <guohanjun@huawei.com>
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Ben Horgan <ben.horgan@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-11-19 18:34:22 +00:00
James Morse
3bd04fe7d8 arm_mpam: Extend reset logic to allow devices to be reset any time
cpuhp callbacks aren't the only time the MSC configuration may need to
be reset. Resctrl has an API call to reset a class.
If an MPAM error interrupt arrives it indicates the driver has
misprogrammed an MSC. The safest thing to do is reset all the MSCs
and disable MPAM.

Add a helper to reset RIS via their class. Call this from mpam_disable(),
which can be scheduled from the error interrupt handler.

Signed-off-by: James Morse <james.morse@arm.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Reviewed-by: Ben Horgan <ben.horgan@arm.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: Fenghua Yu <fenghuay@nvidia.com>
Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Fenghua Yu <fenghuay@nvidia.com>
Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Peter Newman <peternewman@google.com>
Tested-by: Carl Worth <carl@os.amperecomputing.com>
Tested-by: Gavin Shan <gshan@redhat.com>
Tested-by: Zeng Heng <zengheng4@huawei.com>
Tested-by: Hanjun Guo <guohanjun@huawei.com>
Signed-off-by: Ben Horgan <ben.horgan@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-11-19 18:34:22 +00:00
James Morse
475228d15d arm_mpam: Add a helper to touch an MSC from any CPU
Resetting RIS entries from the cpuhp callback is easy as the
callback occurs on the correct CPU. This won't be true for any other
caller that wants to reset or configure an MSC.

Add a helper that schedules the provided function if necessary.

Callers should take the cpuhp lock to prevent the cpuhp callbacks from
changing the MSC state.

Signed-off-by: James Morse <james.morse@arm.com>
Reviewed-by: Ben Horgan <ben.horgan@arm.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Reviewed-by: Fenghua Yu <fenghuay@nvidia.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Fenghua Yu <fenghuay@nvidia.com>
Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Peter Newman <peternewman@google.com>
Tested-by: Carl Worth <carl@os.amperecomputing.com>
Tested-by: Gavin Shan <gshan@redhat.com>
Tested-by: Zeng Heng <zengheng4@huawei.com>
Tested-by: Hanjun Guo <guohanjun@huawei.com>
Signed-off-by: Ben Horgan <ben.horgan@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-11-19 18:34:21 +00:00
James Morse
f188a36ca2 arm_mpam: Reset MSC controls from cpuhp callbacks
When a CPU comes online, it may bring a newly accessible MSC with
it. Only the default partid has its value reset by hardware, and
even then the MSC might not have been reset since its config was
previously dirtied. e.g. Kexec.

Any in-use partid must have its configuration restored, or reset.
In-use partids may be held in caches and evicted later.

MSC are also reset when CPUs are taken offline to cover cases where
firmware doesn't reset the MSC over reboot using UEFI, or kexec
where there is no firmware involvement.

If the configuration for a RIS has not been touched since it was
brought online, it does not need resetting again.

To reset, write the maximum values for all discovered controls.

CC: Rohit Mathew <Rohit.Mathew@arm.com>
Signed-off-by: James Morse <james.morse@arm.com>
Reviewed-by: Fenghua Yu <fenghuay@nvidia.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Tested-by: Fenghua Yu <fenghuay@nvidia.com>
Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Peter Newman <peternewman@google.com>
Tested-by: Carl Worth <carl@os.amperecomputing.com>
Tested-by: Gavin Shan <gshan@redhat.com>
Tested-by: Zeng Heng <zengheng4@huawei.com>
Tested-by: Hanjun Guo <guohanjun@huawei.com>
Signed-off-by: Ben Horgan <ben.horgan@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-11-19 18:34:21 +00:00
James Morse
c10ca83a77 arm_mpam: Merge supported features during mpam_enable() into mpam_class
To make a decision about whether to expose an mpam class as
a resctrl resource we need to know its overall supported
features and properties.

Once we've probed all the resources, we can walk the tree
and produce overall values by merging the bitmaps. This
eliminates features that are only supported by some MSC
that make up a component or class.

If bitmap properties are mismatched within a component we
cannot support the mismatched feature.

Care has to be taken as vMSC may hold mismatched RIS.

Signed-off-by: James Morse <james.morse@arm.com>
Reviewed-by: Ben Horgan <ben.horgan@arm.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Reviewed-by: Fenghua Yu <fenghuay@nvidia.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Fenghua Yu <fenghuay@nvidia.com>
Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Peter Newman <peternewman@google.com>
Tested-by: Carl Worth <carl@os.amperecomputing.com>
Tested-by: Gavin Shan <gshan@redhat.com>
Tested-by: Zeng Heng <zengheng4@huawei.com>
Tested-by: Hanjun Guo <guohanjun@huawei.com>
Signed-off-by: Ben Horgan <ben.horgan@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-11-19 18:34:21 +00:00
James Morse
8c90dc68a5 arm_mpam: Probe the hardware features resctrl supports
Expand the probing support with the control and monitor types
we can use with resctrl.

CC: Dave Martin <Dave.Martin@arm.com>
Signed-off-by: James Morse <james.morse@arm.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Reviewed-by: Fenghua Yu <fenghuay@nvidia.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Fenghua Yu <fenghuay@nvidia.com>
Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Peter Newman <peternewman@google.com>
Tested-by: Carl Worth <carl@os.amperecomputing.com>
Tested-by: Gavin Shan <gshan@redhat.com>
Tested-by: Zeng Heng <zengheng4@huawei.com>
Tested-by: Hanjun Guo <guohanjun@huawei.com>
Signed-off-by: Ben Horgan <ben.horgan@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-11-19 18:34:21 +00:00
James Morse
d02beb06ca arm_mpam: Add helpers for managing the locking around the mon_sel registers
The MSC MON_SEL register needs to be accessed from hardirq for the overflow
interrupt, and when taking an IPI to access these registers on platforms
where MSC are not accessible from every CPU. This makes an irqsave
spinlock the obvious lock to protect these registers. On systems with SCMI
or PCC mailboxes it must be able to sleep, meaning a mutex must be used.
The SCMI or PCC platforms can't support an overflow interrupt, and
can't access the registers from hardirq context.

Clearly these two can't exist for one MSC at the same time.

Add helpers for the MON_SEL locking. For now, use a irqsave spinlock and
only support 'real' MMIO platforms.

In the future this lock will be split in two allowing SCMI/PCC platforms
to take a mutex. Because there are contexts where the SCMI/PCC platforms
can't make an access, mpam_mon_sel_lock() needs to be able to fail. Do
this now, so that all the error handling on these paths is present. This
allows the relevant paths to fail if they are needed on a platform where
this isn't possible, instead of having to make explicit checks of the
interface type.

Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Reviewed-by: Fenghua Yu <fenghuay@nvidia.com>
Tested-by: Fenghua Yu <fenghuay@nvidia.com>
Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Peter Newman <peternewman@google.com>
Tested-by: Carl Worth <carl@os.amperecomputing.com>
Tested-by: Gavin Shan <gshan@redhat.com>
Tested-by: Zeng Heng <zengheng4@huawei.com>
Tested-by: Hanjun Guo <guohanjun@huawei.com>
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Ben Horgan <ben.horgan@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-11-19 18:34:21 +00:00
James Morse
bd221f9f82 arm_mpam: Probe hardware to find the supported partid/pmg values
CPUs can generate traffic with a range of PARTID and PMG values,
but each MSC may also have its own maximum size for these fields.
Before MPAM can be used, the driver needs to probe each RIS on
each MSC, to find the system-wide smallest value that can be used.
The limits from requestors (e.g. CPUs) also need taking into account.

While doing this, RIS entries that firmware didn't describe are created
under MPAM_CLASS_UNKNOWN.

This adds the low level MSC write accessors.

While we're here, implement the mpam_register_requestor() call
for the arch code to register the CPU limits. Future callers of this
will tell us about the SMMU and ITS.

Signed-off-by: James Morse <james.morse@arm.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Reviewed-by: Ben Horgan <ben.horgan@arm.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Reviewed-by: Fenghua Yu <fenghuay@nvidia.com>
Tested-by: Fenghua Yu <fenghuay@nvidia.com>
Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Peter Newman <peternewman@google.com>
Tested-by: Carl Worth <carl@os.amperecomputing.com>
Tested-by: Gavin Shan <gshan@redhat.com>
Tested-by: Zeng Heng <zengheng4@huawei.com>
Tested-by: Hanjun Guo <guohanjun@huawei.com>
Signed-off-by: Ben Horgan <ben.horgan@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-11-19 18:34:20 +00:00
James Morse
8f8d0ac1da arm_mpam: Add cpuhp callbacks to probe MSC hardware
Because an MSC can only by accessed from the CPUs in its cpu-affinity
set we need to be running on one of those CPUs to probe the MSC
hardware.

Do this work in the cpuhp callback. Probing the hardware will only
happen before MPAM is enabled, walk all the MSCs and probe those we can
reach that haven't already been probed as each CPU's online call is made.

This adds the low-level MSC register read accessors.

Once all MSCs reported by the firmware have been probed from a CPU in
their respective cpu-affinity set, the probe-time cpuhp callbacks are
replaced.  The replacement callbacks will ultimately need to handle
save/restore of the runtime MSC state across power transitions, but for
now there is nothing to do in them: so do nothing.

The architecture's context switch code will be enabled by a static-key,
this can be set by mpam_enable(), but must be done from process context,
not a cpuhp callback because both take the cpuhp lock.
Whenever a new MSC has been probed, the mpam_enable() work is scheduled
to test if all the MSCs have been probed. If probing fails, mpam_disable()
is scheduled to unregister the cpuhp callbacks and free memory.

CC: Lecopzer Chen <lecopzerc@nvidia.com>
Signed-off-by: James Morse <james.morse@arm.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Reviewed-by: Ben Horgan <ben.horgan@arm.com>
Reviewed-by: Fenghua Yu <fenghuay@nvidia.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Fenghua Yu <fenghuay@nvidia.com>
Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Peter Newman <peternewman@google.com>
Tested-by: Carl Worth <carl@os.amperecomputing.com>
Tested-by: Gavin Shan <gshan@redhat.com>
Tested-by: Zeng Heng <zengheng4@huawei.com>
Tested-by: Hanjun Guo <guohanjun@huawei.com>
Signed-off-by: Ben Horgan <ben.horgan@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-11-19 18:34:20 +00:00
James Morse
aa64b9e110 arm_mpam: Add MPAM MSC register layout definitions
Memory Partitioning and Monitoring (MPAM) has memory mapped devices
(MSCs) with an identity/configuration page.

Add the definitions for these registers as offset within the page(s).

Link: https://developer.arm.com/documentation/ihi0099/aa/
Signed-off-by: James Morse <james.morse@arm.com>
Reviewed-by: Ben Horgan <ben.horgan@arm.com>
Reviewed-by: Fenghua Yu <fenghuay@nvidia.com>
Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Tested-by: Fenghua Yu <fenghuay@nvidia.com>
Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Peter Newman <peternewman@google.com>
Tested-by: Carl Worth <carl@os.amperecomputing.com>
Tested-by: Gavin Shan <gshan@redhat.com>
Tested-by: Zeng Heng <zengheng4@huawei.com>
Tested-by: Hanjun Guo <guohanjun@huawei.com>
Signed-off-by: Ben Horgan <ben.horgan@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-11-19 18:34:20 +00:00
James Morse
01fb4b8224 arm_mpam: Add the class and component structures for firmware described ris
An MSC is a container of resources, each identified by their RIS index.
Some RIS are described by firmware to provide their position in the system.
Others are discovered when the driver probes the hardware.

To configure a resource it needs to be found by its class, e.g. 'L2'.
There are two kinds of grouping, a class is a set of components, which
are visible to user-space as there are likely to be multiple instances
of the L2 cache. (e.g. one per cluster or package)

Add support for creating and destroying structures to allow a hierarchy
of resources to be created.

Reviewed-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Reviewed-by: Fenghua Yu <fenghuay@nvidia.com>
Tested-by: Fenghua Yu <fenghuay@nvidia.com>
Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Peter Newman <peternewman@google.com>
Tested-by: Carl Worth <carl@os.amperecomputing.com>
Tested-by: Gavin Shan <gshan@redhat.com>
Tested-by: Zeng Heng <zengheng4@huawei.com>
Tested-by: Hanjun Guo <guohanjun@huawei.com>
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Ben Horgan <ben.horgan@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-11-19 18:34:20 +00:00
James Morse
f04046f257 arm_mpam: Add probe/remove for mpam msc driver and kbuild boiler plate
Probing MPAM is convoluted. MSCs that are integrated with a CPU may
only be accessible from those CPUs, and they may not be online.
Touching the hardware early is pointless as MPAM can't be used until
the system-wide common values for num_partid and num_pmg have been
discovered.

Start with driver probe/remove and mapping the MSC.

Cc: Carl Worth <carl@os.amperecomputing.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Reviewed-by: Fenghua Yu <fenghuay@nvidia.com>
Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Fenghua Yu <fenghuay@nvidia.com>
Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Peter Newman <peternewman@google.com>
Tested-by: Carl Worth <carl@os.amperecomputing.com>
Tested-by: Gavin Shan <gshan@redhat.com>
Tested-by: Zeng Heng <zengheng4@huawei.com>
Tested-by: Hanjun Guo <guohanjun@huawei.com>
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Ben Horgan <ben.horgan@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-11-19 18:34:20 +00:00
James Morse
115c5325be ACPI / MPAM: Parse the MPAM table
Add code to parse the arm64 specific MPAM table, looking up the cache
level from the PPTT and feeding the end result into the MPAM driver.

This happens in two stages. Platform devices are created first for the
MSC devices. Once the driver probes it calls acpi_mpam_parse_resources()
to discover the RIS entries the MSC contains.

For now the MPAM hook mpam_ris_create() is stubbed out, but will update
the MPAM driver with optional discovered data about the RIS entries.

CC: Carl Worth <carl@os.amperecomputing.com>
Link: https://developer.arm.com/documentation/den0065/3-0bet/?lang=en
Reviewed-by: Lorenzo Pieralisi <lpieralisi@kernel.org>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Reviewed-by: Fenghua Yu <fenghuay@nvidia.com>
Tested-by: Fenghua Yu <fenghuay@nvidia.com>
Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Peter Newman <peternewman@google.com>
Tested-by: Carl Worth <carl@os.amperecomputing.com>
Tested-by: Gavin Shan <gshan@redhat.com>
Tested-by: Zeng Heng <zengheng4@huawei.com>
Tested-by: Hanjun Guo <guohanjun@huawei.com>
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Ben Horgan <ben.horgan@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-11-19 18:34:20 +00:00
Ben Horgan
96f4a4d53e ACPI: Define acpi_put_table cleanup handler and acpi_get_table_pointer() helper
Define a cleanup helper for use with __free to release the acpi table when
the pointer goes out of scope. Also, introduce the helper
acpi_get_table_pointer() to simplify a commonly used pattern involving
acpi_get_table().

These are first used in a subsequent commit.

Reviewed-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Reviewed-by: Fenghua Yu <fenghuay@nvidia.com>
Tested-by: Fenghua Yu <fenghuay@nvidia.com>
Tested-by: Carl Worth <carl@os.amperecomputing.com>
Tested-by: Gavin Shan <gshan@redhat.com>
Tested-by: Zeng Heng <zengheng4@huawei.com>
Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Hanjun Guo <guohanjun@huawei.com>
Signed-off-by: Ben Horgan <ben.horgan@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-11-19 18:34:19 +00:00
Ben Horgan
f5915600cc platform: Define platform_device_put cleanup handler
Define a cleanup helper for use with __free to destroy platform devices
automatically when the pointer goes out of scope. This is only intended to
be used in error cases and so should be used with return_ptr() or
no_free_ptr() directly to avoid the automatic destruction on success.

A first use of this is introduced in a subsequent commit.

Reviewed-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Reviewed-by: Fenghua Yu <fenghuay@nvidia.com>
Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Fenghua Yu <fenghuay@nvidia.com>
Tested-by: Carl Worth <carl@os.amperecomputing.com>
Tested-by: Gavin Shan <gshan@redhat.com>
Tested-by: Zeng Heng <zengheng4@huawei.com>
Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Hanjun Guo <guohanjun@huawei.com>
Signed-off-by: Ben Horgan <ben.horgan@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-11-19 18:34:19 +00:00
James Morse
d8bf01d809 arm64: kconfig: Add Kconfig entry for MPAM
The bulk of the MPAM driver lives outside the arch code because it
largely manages MMIO devices that generate interrupts. The driver
needs a Kconfig symbol to enable it. As MPAM is only found on arm64
platforms, the arm64 tree is the most natural home for the Kconfig
option.

This Kconfig option will later be used by the arch code to enable
or disable the MPAM context-switch code, and to register properties
of CPUs with the MPAM driver.

Signed-off-by: James Morse <james.morse@arm.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Reviewed-by: Ben Horgan <ben.horgan@arm.com>
Reviewed-by: Fenghua Yu <fenghuay@nvidia.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Fenghua Yu <fenghuay@nvidia.com>
Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Peter Newman <peternewman@google.com>
Tested-by: Carl Worth <carl@os.amperecomputing.com>
Tested-by: Gavin Shan <gshan@redhat.com>
Tested-by: Zeng Heng <zengheng4@huawei.com>
Tested-by: Hanjun Guo <guohanjun@huawei.com>
CC: Dave Martin <dave.martin@arm.com>
Signed-off-by: Ben Horgan <ben.horgan@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-11-19 18:34:19 +00:00
James Morse
a39a723a6f ACPI / PPTT: Add a helper to fill a cpumask from a cache_id
MPAM identifies CPUs by the cache_id in the PPTT cache structure.

The driver needs to know which CPUs are associated with the cache.
The CPUs may not all be online, so cacheinfo does not have the
information.

Add a helper to pull this information out of the PPTT.

CC: Rohit Mathew <Rohit.Mathew@arm.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: Fenghua Yu <fenghuay@nvidia.com>
Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Reviewed-by: Jeremy Linton <jeremy.linton@arm.com>
Tested-by: Fenghua Yu <fenghuay@nvidia.com>
Tested-by: Carl Worth <carl@os.amperecomputing.com>
Tested-by: Gavin Shan <gshan@redhat.com>
Tested-by: Zeng Heng <zengheng4@huawei.com>
Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Hanjun Guo <guohanjun@huawei.com>
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Ben Horgan <ben.horgan@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-11-19 18:34:14 +00:00
James Morse
41a7bb39fe ACPI / PPTT: Find cache level by cache-id
The MPAM table identifies caches by id. The MPAM driver also wants to know
the cache level to determine if the platform is of the shape that can be
managed via resctrl. Cacheinfo has this information, but only for CPUs that
are online.

Waiting for all CPUs to come online is a problem for platforms where
CPUs are brought online late by user-space.

Add a helper that walks every possible cache, until it finds the one
identified by cache-id, then return the level.

Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Ben Horgan <ben.horgan@arm.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: Fenghua Yu <fenghuay@nvidia.com>
Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Reviewed-by: Jeremy Linton <jeremy.linton@arm.com>
Tested-by: Fenghua Yu <fenghuay@nvidia.com>
Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Carl Worth <carl@os.amperecomputing.com>
Tested-by: Gavin Shan <gshan@redhat.com>
Tested-by: Zeng Heng <zengheng4@huawei.com>
Tested-by: Hanjun Guo <guohanjun@huawei.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-11-19 18:34:09 +00:00
Ben Horgan
cfc085af83 ACPI / PPTT: Add acpi_pptt_cache_v1_full to use pptt cache as one structure
In actbl2.h, acpi_pptt_cache describes the fields in the original
Cache Type Structure. In PPTT table version 3 a new field was added at the
end, cache_id. This is described in acpi_pptt_cache_v1 but rather than
including all v1 fields it just includes this one.

In lieu of this being fixed in acpica, introduce acpi_pptt_cache_v1_full to
contain all the fields of the Cache Type Structure . Update the existing
code to use this new struct. This simplifies the code and removes a
non-standard use of ACPI_ADD_PTR.

Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Reviewed-by: Fenghua Yu <fenghuay@nvidia.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: Hanjun Guo <guohanjun@huawei.com>
Reviewed-by: Jeremy Linton <jeremy.linton@arm.com>
Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Hanjun Guo <guohanjun@huawei.com>
Signed-off-by: Ben Horgan <ben.horgan@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-11-19 18:34:01 +00:00
James Morse
eeec7845e9 ACPI / PPTT: Stop acpi_count_levels() expecting callers to clear levels
In acpi_count_levels(), the initial value of *levels passed by the
caller is really an implementation detail of acpi_count_levels(), so it
is unreasonable to expect the callers of this function to know what to
pass in for this parameter.  The only sensible initial value is 0,
which is what the only upstream caller (acpi_get_cache_info()) passes.

Use a local variable for the starting cache level in acpi_count_levels(),
and pass the result back to the caller via the function return value.

Get rid of the levels parameter, which has no remaining purpose.

Fix acpi_get_cache_info() to match.

Suggested-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Signed-off-by: James Morse <james.morse@arm.com>
Reviewed-by: Lorenzo Pieralisi <lpieralisi@kernel.org>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Reviewed-by: Fenghua Yu <fenghuay@nvidia.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Reviewed-by: Hanjun Guo <guohanjun@huawei.com>
Reviewed-by: Jeremy Linton <jeremy.linton@arm.com>
Tested-by: Fenghua Yu <fenghuay@nvidia.com>
Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Peter Newman <peternewman@google.com>
Tested-by: Carl Worth <carl@os.amperecomputing.com>
Tested-by: Gavin Shan <gshan@redhat.com>
Tested-by: Zeng Heng <zengheng4@huawei.com>
Tested-by: Hanjun Guo <guohanjun@huawei.com>
Signed-off-by: Ben Horgan <ben.horgan@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-11-19 18:33:56 +00:00
James Morse
796e29b857 ACPI / PPTT: Add a helper to fill a cpumask from a processor container
The ACPI MPAM table uses the UID of a processor container specified in
the PPTT to indicate the subset of CPUs and cache topology that can
access each MPAM System Component (MSC).

This information is not directly useful to the kernel. The equivalent
cpumask is needed instead.

Add a helper to find the processor container by its id, then walk
the possible CPUs to fill a cpumask with the CPUs that have this
processor container as a parent.

CC: Dave Martin <dave.martin@arm.com>
Reviewed-by: Sudeep Holla <sudeep.holla@arm.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Reviewed-by: Fenghua Yu <fenghuay@nvidia.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Reviewed-by: Hanjun Guo <guohanjun@huawei.com>
Reviewed-by: Jeremy Linton <jeremy.linton@arm.com>
Tested-by: Fenghua Yu <fenghuay@nvidia.com>
Tested-by: Shaopeng Tan <tan.shaopeng@jp.fujitsu.com>
Tested-by: Peter Newman <peternewman@google.com>
Tested-by: Carl Worth <carl@os.amperecomputing.com>
Tested-by: Gavin Shan <gshan@redhat.com>
Tested-by: Zeng Heng <zengheng4@huawei.com>
Tested-by: Hanjun Guo <guohanjun@huawei.com>
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Ben Horgan <ben.horgan@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-11-19 18:33:36 +00:00
Huang Ying
cb1fa2e999 arm64, tlbflush: don't TLBI broadcast if page reused in write fault
A multi-thread customer workload with large memory footprint uses
fork()/exec() to run some external programs every tens seconds.  When
running the workload on an arm64 server machine, it's observed that
quite some CPU cycles are spent in the TLB flushing functions.  While
running the workload on the x86_64 server machine, it's not.  This
causes the performance on arm64 to be much worse than that on x86_64.

During the workload running, after fork()/exec() write-protects all
pages in the parent process, memory writing in the parent process
will cause a write protection fault.  Then the page fault handler
will make the PTE/PDE writable if the page can be reused, which is
almost always true in the workload.  On arm64, to avoid the write
protection fault on other CPUs, the page fault handler flushes the TLB
globally with TLBI broadcast after changing the PTE/PDE.  However, this
isn't always necessary.  Firstly, it's safe to leave some stale
read-only TLB entries as long as they will be flushed finally.
Secondly, it's quite possible that the original read-only PTE/PDEs
aren't cached in remote TLB at all if the memory footprint is large.
In fact, on x86_64, the page fault handler doesn't flush the remote
TLB in this situation, which benefits the performance a lot.

To improve the performance on arm64, make the write protection fault
handler flush the TLB locally instead of globally via TLBI broadcast
after making the PTE/PDE writable.  If there are stale read-only TLB
entries in the remote CPUs, the page fault handler on these CPUs will
regard the page fault as spurious and flush the stale TLB entries.

To test the patchset, make the usemem.c from
vm-scalability (https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git).
support calling fork()/exec() periodically.  To mimic the behavior of
the customer workload, run usemem with 4 threads, access 100GB memory,
and call fork()/exec() every 40 seconds.  Test results show that with
the patchset the score of usemem improves ~40.6%.  The cycles% of TLB
flush functions reduces from ~50.5% to ~0.3% in perf profile.

Signed-off-by: Huang Ying <ying.huang@linux.alibaba.com>
Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Acked-by: Zi Yan <ziy@nvidia.com>
Cc: Will Deacon <will@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Yang Shi <yang@os.amperecomputing.com>
Cc: Christoph Lameter (Ampere) <cl@gentwo.org>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Kevin Brodsky <kevin.brodsky@arm.com>
Cc: Yin Fengwei <fengwei_yin@linux.alibaba.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Reviewed-by: David Hildenbrand (Red Hat) <david@kernel.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-11-19 16:01:48 +00:00
Huang Ying
79301c7d60 mm: add spurious fault fixing support for huge pmd
The page faults may be spurious because of the racy access to the page
table.  For example, a non-populated virtual page is accessed on 2
CPUs simultaneously, thus the page faults are triggered on both CPUs.
However, it's possible that one CPU (say CPU A) cannot find the reason
for the page fault if the other CPU (say CPU B) has changed the page
table before the PTE is checked on CPU A.  Most of the time, the
spurious page faults can be ignored safely.  However, if the page
fault is for the write access, it's possible that a stale read-only
TLB entry exists in the local CPU and needs to be flushed on some
architectures.  This is called the spurious page fault fixing.

In the current kernel, there is spurious fault fixing support for pte,
but not for huge pmd because no architectures need it. But in the
next patch in the series, we will change the write protection fault
handling logic on arm64, so that some stale huge pmd entries may
remain in the TLB. These entries need to be flushed via the huge pmd
spurious fault fixing mechanism.

Signed-off-by: Huang Ying <ying.huang@linux.alibaba.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Zi Yan <ziy@nvidia.com>
Cc: Will Deacon <will@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Yang Shi <yang@os.amperecomputing.com>
Cc: Christoph Lameter (Ampere) <cl@gentwo.org>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Kevin Brodsky <kevin.brodsky@arm.com>
Cc: Yin Fengwei <fengwei_yin@linux.alibaba.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-11-19 16:01:48 +00:00
Reinette Chatre
5a88a6e92b fs/resctrl: Consider sparse masks when initializing new group's allocation
A new resource group is intended to be created with sane defaults. For a cache
resource this means all cache portions the new group could possibly allocate
into. This includes unused cache portions and shareable cache portions used by
other groups and hardware.

New resource group creation does not take sparse masks into account. After
determining the bitmask reflecting the new group's possible allocations the
bitmask is forced to be contiguous even if the system supports sparse masks.
For example, a new group could by default allocate into a large portion of
cache represented by 0xff0f, but it is instead created with a mask of 0xf.

Do not force a contiguous allocation range if the system supports sparse masks.

Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://patch.msgid.link/abbbb008bc09d982d715e79d3b885c10f92c64e0.1763426240.git.reinette.chatre@intel.com
2025-11-18 21:10:56 +01:00
Sohil Mehta
d5cb957439 x86/cpu: Enable LASS during CPU initialization
Linear Address Space Separation (LASS) mitigates a class of side-channel
attacks that rely on speculative access across the user/kernel boundary.
Enable LASS along with similar security features if the platform
supports it.

While at it, remove the comment above the SMAP/SMEP/UMIP/LASS setup
instead of updating it, as the whole sequence is quite self-explanatory.

Some EFI runtime and boot services may rely on 1:1 mappings in the lower
half during early boot and even after SetVirtualAddressMap(). To avoid
tripping LASS, the initial CR4 programming would need to be delayed
until EFI has completely finished entering virtual mode (including
efi_free_boot_services()). Also, LASS would need to be temporarily
disabled while switching to efi_mm to avoid potential faults on stray
runtime accesses.

Similarly, legacy vsyscall page accesses are flagged by LASS resulting
in a #GP (instead of a #PF). Without LASS, the #PF handler emulates the
accesses and returns the appropriate values. Equivalent emulation
support is required in the #GP handler with LASS enabled. In case of
vsyscall XONLY (execute only) mode, the faulting address is readily
available in the RIP which would make it easier to reuse the #PF
emulation logic.

For now, keep it simple and disable LASS if either of those are compiled
in. Though not ideal, this makes it easier to start testing LASS support
in some environments. In future, LASS support can easily be expanded to
support EFI and legacy vsyscalls.

Signed-off-by: Sohil Mehta <sohil.mehta@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://patch.msgid.link/20251118182911.2983253-9-sohil.mehta%40intel.com
2025-11-18 10:38:27 -08:00
Sohil Mehta
c9129cf0f0 selftests/x86: Update the negative vsyscall tests to expect a #GP
Some of the vsyscall selftests expect a #PF when vsyscalls are disabled.
However, with LASS enabled, an invalid access results in a SIGSEGV due
to a #GP instead of a #PF. One such negative test fails because it is
expecting X86_PF_INSTR to be set.

Update the failing test to expect either a #GP or a #PF. Also, update
the printed messages to show the trap number (denoting the type of
fault) instead of assuming a #PF.

Signed-off-by: Sohil Mehta <sohil.mehta@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://patch.msgid.link/20251118182911.2983253-8-sohil.mehta%40intel.com
2025-11-18 10:38:26 -08:00
Alexander Shishkin
42fea0a3a7 x86/traps: Communicate a LASS violation in #GP message
A LASS violation typically results in a #GP. With LASS active, any
invalid access to user memory (including the first page frame) would be
reported as a #GP, instead of a #PF.

Unfortunately, the #GP error messages provide limited information about
the cause of the fault. This could be confusing for kernel developers
and users who are accustomed to the friendly #PF messages.

To make the transition easier, enhance the #GP Oops message to include a
hint about LASS violations. Also, add a special hint for kernel NULL
pointer dereferences to match with the existing #PF message.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Sohil Mehta <sohil.mehta@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://patch.msgid.link/20251118182911.2983253-7-sohil.mehta%40intel.com
2025-11-18 10:38:26 -08:00
Sohil Mehta
731d43750c x86/kexec: Disable LASS during relocate kernel
The relocate kernel mechanism uses an identity mapping to copy the new
kernel, which leads to a LASS violation when executing from a low
address.

LASS must be disabled after the original CR4 value is saved because
kexec paths that preserve context need to restore CR4.LASS. But,
disabling it along with CET during identity_mapped() is too late. So,
disable LASS immediately after saving CR4, along with PGE, and before
jumping to the identity-mapped page.

Signed-off-by: Sohil Mehta <sohil.mehta@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://patch.msgid.link/20251118182911.2983253-6-sohil.mehta%40intel.com
2025-11-18 10:38:26 -08:00
Sohil Mehta
b3a7e973ab x86/alternatives: Disable LASS when patching kernel code
For patching, the kernel initializes a temporary mm area in the lower
half of the address range. LASS blocks these accesses because its
enforcement relies on bit 63 of the virtual address as opposed to SMAP
which depends on the _PAGE_BIT_USER bit in the page table. Disable LASS
enforcement by toggling the RFLAGS.AC bit during patching to avoid
triggering a #GP fault.

Introduce LASS-specific STAC/CLAC helpers to set the AC bit only on
platforms that need it. Name the wrappers as lass_stac()/_clac() instead
of lass_disable()/_enable() because they only control the kernel data
access enforcement. The entire LASS mechanism (including instruction
fetch enforcement) is controlled by the CR4.LASS bit.

Describe the usage of the new helpers in comparison to the ones used for
SMAP. Also, add comments to explain when the existing stac()/clac()
should be used. While at it, move the duplicated "barrier" comment to
the same block.

The Text poking functions use standard memcpy()/memset() while patching
kernel code. However, objtool complains about calling such dynamic
functions within an AC=1 region. See warning #9, regarding function
calls with UACCESS enabled, in tools/objtool/Documentation/objtool.txt.

To pacify objtool, one option is to add memcpy() and memset() to the
list of allowed-functions. However, that would provide a blanket
exemption for all usages of memcpy() and memset(). Instead, replace the
standard calls in the text poking functions with their unoptimized,
always-inlined versions. Considering that patching is usually small,
there is no performance impact expected.

Signed-off-by: Sohil Mehta <sohil.mehta@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://patch.msgid.link/20251118182911.2983253-5-sohil.mehta%40intel.com
2025-11-18 10:38:26 -08:00
Peter Zijlstra (Intel)
d9a96cc18b x86/asm: Introduce inline memcpy and memset
Provide inline memcpy and memset functions that can be used instead of
the GCC builtins when necessary. The immediate use case is for the text
poking functions to avoid the standard memcpy()/memset() calls because
objtool complains about such dynamic calls within an AC=1 region. See
tools/objtool/Documentation/objtool.txt, warning #9, regarding function
calls with UACCESS enabled.

Some user copy functions such as copy_user_generic() and __clear_user()
have similar rep_{movs,stos} usages. But, those are highly specialized
and hard to combine or reuse for other things. Define these new helpers
for all other usages that need a completely unoptimized, strictly inline
version of memcpy() or memset().

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Sohil Mehta <sohil.mehta@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://patch.msgid.link/20251118182911.2983253-4-sohil.mehta%40intel.com
2025-11-18 10:38:26 -08:00
Sohil Mehta
e39c5387ad x86/cpu: Add an LASS dependency on SMAP
With LASS enabled, any kernel data access to userspace typically results
in a #GP, or a #SS in some stack-related cases. When the kernel needs to
access user memory, it can suspend LASS enforcement by toggling the
RFLAGS.AC bit. Most of these cases are already covered by the
stac()/clac() pairs used to avoid SMAP violations.

Even though LASS could potentially be enabled independently, it would be
very painful without SMAP and the related stac()/clac() calls. There is
no reason to support such a configuration because all future hardware
with LASS is expected to have SMAP as well. Also, the STAC/CLAC
instructions are architected to:
	#UD - If CPUID.(EAX=07H, ECX=0H):EBX.SMAP[bit 20] = 0.

So, make LASS depend on SMAP to conveniently reuse the existing AC bit
toggling already in place.

Note: Additional STAC/CLAC would still be needed for accesses such as
text poking which are not flagged by SMAP. This is because such mappings
are in the lower half but do not have the _PAGE_USER bit set which SMAP
uses for enforcement.

Signed-off-by: Sohil Mehta <sohil.mehta@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://patch.msgid.link/20251118182911.2983253-3-sohil.mehta%40intel.com
2025-11-18 10:38:26 -08:00
Sohil Mehta
7baadd463e x86/cpufeatures: Enumerate the LASS feature bits
Linear Address Space Separation (LASS) is a security feature that
mitigates a class of side-channel attacks relying on speculative access
across the user/kernel boundary.

Privilege mode based access protection already exists today with paging
and features such as SMEP and SMAP. However, to enforce these
protections, the processor must traverse the paging structures in
memory. An attacker can use timing information resulting from this
traversal to determine details about the paging structures, and to
determine the layout of the kernel memory.

LASS provides the same mode-based protections as paging but without
traversing the paging structures. Because the protections are enforced
prior to page-walks, an attacker will not be able to derive paging-based
timing information from the various caching structures such as the TLBs,
mid-level caches, page walker, data caches, etc.

LASS enforcement relies on the kernel implementation to divide the
64-bit virtual address space into two halves:
  Addr[63]=0 -> User address space
  Addr[63]=1 -> Kernel address space

Any data access or code execution across address spaces typically
results in a #GP fault, with an #SS generated in some rare cases. The
LASS enforcement for kernel data accesses is dependent on CR4.SMAP being
set. The enforcement can be disabled by toggling the RFLAGS.AC bit
similar to SMAP.

Define the CPU feature bits to enumerate LASS. Also, disable the feature
at compile time on 32-bit kernels. Use a direct dependency on X86_32
(instead of !X86_64) to make it easier to combine with similar 32-bit
specific dependencies in the future.

LASS mitigates a class of side-channel speculative attacks, such as
Spectre LAM, described in the paper, "Leaky Address Masking: Exploiting
Unmasked Spectre Gadgets with Noncanonical Address Translation".

Add the "lass" flag to /proc/cpuinfo to indicate that the feature is
supported by hardware and enabled by the kernel. This allows userspace
to determine if the system is secure against such attacks.

Signed-off-by: Sohil Mehta <sohil.mehta@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Xin Li (Intel) <xin@zytor.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://patch.msgid.link/20251118182911.2983253-2-sohil.mehta%40intel.com
2025-11-18 10:38:26 -08:00
Thorsten Blum
cdf5ecc3f6 EDAC/ghes: Replace deprecated strcpy() in ghes_edac_report_mem_error()
strcpy() has been deprecated¹ because it performs no bounds checking on the
destination buffer, which can lead to buffer overflows. Use the safer
strscpy() instead.

¹ https://www.kernel.org/doc/html/latest/process/deprecated.html#strcpy

Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Link: https://patch.msgid.link/20251118135621.101148-2-thorsten.blum@linux.dev
2025-11-18 16:50:32 +01:00
Chengkaitao
9d3faec60b genirq: Use raw_spinlock_irq() in irq_set_affinity_notifier()
Since irq_set_affinity_notifier() may sleep, interrupts are enabled. So
raw_spinlock_irqsave() can be replaced with raw_spinlock_irq().

Signed-off-by: Chengkaitao <chengkaitao@kylinos.cn>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://patch.msgid.link/20251118012754.61805-1-pilgrimtao@gmail.com
2025-11-18 16:19:40 +01:00
Dan Carpenter
80adaccf0e rseq: Delete duplicate if statement in rseq_virt_userspace_exit()
This if statement is indented weirdly.  It's a duplicate and doesn't
affect runtime (beyond wasting a little time).  Delete it.

Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://patch.msgid.link/aRxP3YcwscrP1BU_@stanley.mountain
2025-11-18 15:56:55 +01:00
Christophe Leroy
4322c8f81c lib/strn*,uaccess: Use masked_user_{read/write}_access_begin when required
Properly use masked_user_read_access_begin() and
masked_user_write_access_begin() instead of masked_user_access_begin() in
order to match user_read_access_end() and user_write_access_end().  This is
important for architectures like PowerPC that enable separately user reads
and user writes.

That means masked_user_read_access_begin() is used when user memory is
exclusively read during the window and masked_user_write_access_begin()
is used when user memory is exclusively writen during the window.
masked_user_access_begin() remains and is used when both reads and
writes are performed during the open window. Each of them is expected
to be terminated by the matching user_read_access_end(),
user_write_access_end() and user_access_end().

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://patch.msgid.link/cb5e4b0fa49ea9c740570949d5e3544423389757.1763396724.git.christophe.leroy@csgroup.eu
2025-11-18 15:27:35 +01:00
Christophe Leroy
1c204914bc scm: Convert put_cmsg() to scoped user access
Replace the open coded implementation with the scoped user access guard.

That also corrects the imbalance between masked_user_access_begin() and
user_write_access_end(), which would affect PowerPC when it gains masked
user access support.

No functional change intended.

[ tglx: Amend change log ]

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://patch.msgid.link/793219313f641eda09a892d06768d2837246bf9f.1763396724.git.christophe.leroy@csgroup.eu
2025-11-18 15:27:34 +01:00
Christophe Leroy
803abedbd5 iov_iter: Add missing speculation barrier to copy_from_user_iter()
The results of "access_ok()" can be mis-speculated.  The result is that
the CPU can end speculatively:

	if (access_ok(from, size))
		// Right here

For the same reason as done in copy_from_user() in commit 74e19ef0ff
("uaccess: Add speculation barrier to copy_from_user()"), add a speculation
barrier to copy_from_user_iter().

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://patch.msgid.link/6b73e69cc7168c89df4eab0a216e3ed4cca36b0a.1763396724.git.christophe.leroy@csgroup.eu
2025-11-18 15:27:34 +01:00
Christophe Leroy
4db1df7a72 iov_iter: Convert copy_from_user_iter() to masked user access
copy_from_user_iter() lacks a speculation barrier, which will degrade
performance on some architecture like x86, which would be unfortunate as
copy_from_user_iter() is a critical hotpath function.

Convert copy_from_user_iter() to using masked user access on architecture
that support it. This allows to add the speculation barrier without
impacting performance.

This is similar to what was done for copy_from_user() in commit
0fc810ae3a ("x86/uaccess: Avoid barrier_nospec() in 64-bit
copy_from_user()")

[ tglx: Massage change log ]

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://patch.msgid.link/58e4b07d469ca68a2b9477fe2c1ccc8a44cef131.1763396724.git.christophe.leroy@csgroup.eu
2025-11-18 15:27:34 +01:00
Mark Brown
a0245b42f8 kselftest/arm64: Cover disabling streaming mode without SVE in fp-ptrace
On a system which support SME but not SVE we can now disable streaming mode
via ptrace by writing FPSIMD formatted data through NT_ARM_SVE with a VL of
0. Extend fp-ptrace to cover rather than skip these cases, relax the check
for SVE writes of FPSIMD format data to not skip if SME is supported and
accept 0 as the VL when performing the ptrace write.

Signed-off-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-11-17 20:11:54 +00:00
Mark Brown
eb9df6d69a kselftst/arm64: Test NT_ARM_SVE FPSIMD format writes on non-SVE systems
In order to allow exiting streaming mode on systems with SME but not SVE
we allow writes of FPSIMD format data via NT_ARM_SVE even when SVE is not
supported, add a test case that covers this to sve-ptrace.

We do not support reads.

Signed-off-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-11-17 20:11:54 +00:00
Mark Brown
472800cd5e arm64/sme: Support disabling streaming mode via ptrace on SME only systems
Currently it is not possible to disable streaming mode via ptrace on SME
only systems, the interface for doing this is to write via NT_ARM_SVE but
such writes will be rejected on a system without SVE support. Enable this
functionality by allowing userspace to write SVE_PT_REGS_FPSIMD format data
via NT_ARM_SVE with the vector length set to 0 on SME only systems. Such
writes currently error since we require that a vector length is specified
which should minimise the risk that existing software is relying on current
behaviour.

Reads are not supported since I am not aware of any use case for this and
there is some risk that an existing userspace application may be confused if
it reads NT_ARM_SVE on a system without SVE. Existing kernels will return
FPSIMD formatted register state from NT_ARM_SVE if full SVE state is not
stored, for example if the task has not used SVE. Returning a vector length
of 0 would create a risk that software would try to do things like allocate
space for register state with zero sizes, while returning a vector length of
128 bits would look like SVE is supported. It seems safer to just not make
the changes to add read support.

It remains possible for userspace to detect a SME only system via the ptrace
interface only since reads of NT_ARM_SSVE and NT_ARM_ZA will succeed while
reads of NT_ARM_SVE will fail. Read/write access to the FPSIMD registers in
non-streaming mode is available via REGSET_FPR.

sve_set_common() already avoids allocating SVE storage when doing a FPSIMD
formatted write and allocating SME storage when doing a NT_ARM_SVE write so
we change the function to validate the new case and skip setting a vector
length for it.

The aim is to make a minimally invasive change, no operation that would
previously have succeeded will be affected, and we use a previously
defined interface in new circumstances rather than define completely new
ABI.

Signed-off-by: Mark Brown <broonie@kernel.org>
Reviewed-by: David Spickett <david.spickett@linaro.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-11-17 20:11:54 +00:00
Peter Oberparleiter
2a2153a2ba s390/debug: Update description of resize operation
With commit 1204777867 ("s390/debug: keep debug data on resize")
the behavior of a debug area resize operation was changed. Update the
associated documentation to reflect this change.

Fixes: 1204777867 ("s390/debug: keep debug data on resize")
Reported-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Peter Oberparleiter <oberpar@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2025-11-17 11:55:16 +01:00
Heiko Carstens
0d79affa31 Merge branch 'compat-removal'
Heiko Carstens says:

====================

Remove s390 compat support to allow for code simplification and especially
reduced test effort. To the best of our knowledge there aren't any 31 bit
binaries out in the world anymore that would matter for newer kernels or
newer distributions.

Distributions do not provide compat packages since quite some time or even
have CONFIG_COMPAT disabled.

Instead of adding deprecation warnings to config option, or adding kernel
messages, just remove the code. Deprecation warnings haven't proven to be
useful. If it turns out there is still a reason to keep the compat support
this series can be reverted at any time in the future.

====================

Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2025-11-17 11:11:48 +01:00
Heiko Carstens
4ac286c4a8 s390/syscalls: Switch to generic system call table generation
The s390 syscall.tbl format differs slightly from most others, and
therefore requires an s390 specific system call table generation
script.

With compat support gone use the opportunity to switch to generic
system call table generation. The abi for all 64 bit system calls is
now common, since there is no need to specify if system call entry
points are only for 64 bit anymore.

Furthermore create the system call table in C instead of assembler
code in order to get type checking for all system call functions
contained within the table.

Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2025-11-17 11:10:39 +01:00
Heiko Carstens
f4e1f1b137 s390/syscalls: Remove system call table pointer from thread_struct
With compat support gone there is only one system call table
left. Therefore remove the sys_call_table pointer from
thread_struct and use the sys_call_table directly.

Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2025-11-17 11:10:39 +01:00
Heiko Carstens
3db5cf9354 s390/uapi: Remove 31 bit support from uapi header files
Since the kernel does not support running 31 bit / compat binaries
anymore, remove also the corresponding 31 bit support from uapi header
files.

Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2025-11-17 11:10:38 +01:00
Heiko Carstens
8e0b986c59 s390: Remove compat support
There shouldn't be any 31 bit code around anymore that matters.
Remove the compat layer support required to run 31 bit code.

Reason for removal is code simplification and reduced test effort.

Note that this comes without any deprecation warnings added to config
options, or kernel messages, since most likely those would be ignored
anyway.

If it turns out there is still a reason to keep the compat layer this
can be reverted at any time in the future.

Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2025-11-17 11:10:38 +01:00
Heiko Carstens
169ebcbb90 tools: Remove s390 compat support
Remove s390 compat support from everything within tools, since s390 compat
support will be removed from the kernel.

Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Thomas Weißschuh <linux@weissschuh.net> # tools/nolibc selftests/nolibc
Reviewed-by: Thomas Weißschuh <linux@weissschuh.net> # selftests/vDSO
Acked-by: Alexei Starovoitov <ast@kernel.org> # bpf bits
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2025-11-17 11:10:38 +01:00
Heiko Carstens
7afb095df3 s390/syscalls: Add pt_regs parameter to SYSCALL_DEFINE0() syscall wrapper
All system call wrappers should match the sys_call_ptr_t type. This is not
the case for system calls without parameters. Add the missing pt_regs
parameter there too.

Note: this is currently not a problem, since the parameter is unused.
However it prevents to create a correctly typed system call table in
C. With the current assembler implementation this works because of
missing type checking.

Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2025-11-17 11:10:38 +01:00
Heiko Carstens
b2da5f6400 s390/kvm: Use psw32_t instead of psw_compat_t
kvm_s390_handle_lpsw() make use of the psw_compat_t type even though
the code has nothing to do with CONFIG_COMPAT, for which the type is
supposed to be used. Use psw32_t instead.

Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2025-11-17 11:10:37 +01:00
Heiko Carstens
8c633c78c2 s390/ptrace: Rename psw_t32 to psw32_t
Use a standard "_t" suffix for psw_t32 and rename it to psw32_t.

Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2025-11-17 11:10:37 +01:00
Sean Christopherson
f2f22721ac x86/sgx: Fix a typo in the kernel-doc comment for enum sgx_attribute
Use the exact enum name when documenting "enum sgx_attribute" to fix a
warning if the file is fed into kernel-doc processing:

  WARNING: ./arch/x86/include/asm/sgx.h:139 expecting prototype for enum
           sgx_attributes. Prototype was for enum sgx_attribute instead

Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Link: https://patch.msgid.link/20251112160708.1343355-6-seanjc%40google.com
2025-11-14 15:30:32 -08:00
Sean Christopherson
55bf13b612 x86/sgx: Remove superfluous asterisk from copyright comment in asm/sgx.h
Drop an asterisk from a file-level copyright comment so that the comment
isn't intrepeted as a kernel-doc comment.

E.g. if arch/x86/include/asm/sgx.h is fed into kernel-doc processing:

  WARNING: ./arch/x86/include/asm/sgx.h:2 This comment starts with '/**',
  but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst

Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Link: https://patch.msgid.link/20251112160708.1343355-5-seanjc%40google.com
2025-11-14 15:30:28 -08:00
Sean Christopherson
905885fdb1 x86/sgx: Document structs and enums with '@', not '%'
Use '@' to document structure members and enum values in kernel-doc markup,
as per Documentation/doc-guide/kernel-doc.rst and flagged by make htmldocs.

  WARNING: arch/x86/include/uapi/asm/sgx.h:17 Enum value 'SGX_PAGE_MEASURE'
           not described in enum 'sgx_page_flags'

Opportunistically add a missing ':' for SGX_CHILD_PRESENT.

Closes: https://lore.kernel.org/all/20251106145506.145fc620@canb.auug.org.au
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Link: https://patch.msgid.link/20251112160708.1343355-4-seanjc%40google.com
2025-11-14 15:30:26 -08:00
Sean Christopherson
243ea511fe x86/sgx: Add kernel-doc descriptions for params passed to vDSO user handler
Add kernel-doc markup for the register parameters passed by the vDSO blob
to the user handler to suppress build warnings, e.g.

  WARNING: arch/x86/include/uapi/asm/sgx.h:157 function parameter 'r8' not
           described in 'sgx_enclave_user_handler_t'

Call out that except for RSP, the registers are undefined on asynchronous
exits as far as the vDSO ABI is concerned.  E.g. the vDSO's exception
handler clobbers RDX, RDI, and RSI, and the kernel doesn't guarantee that
R8 or R9 will be zero (the synthetic value loaded by the CPU).

Closes: https://lore.kernel.org/all/20251106145506.145fc620@canb.auug.org.au
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Link: https://patch.msgid.link/20251112160708.1343355-3-seanjc%40google.com
2025-11-14 15:30:22 -08:00
Sean Christopherson
75801ca620 x86/sgx: Add a missing colon in kernel-doc markup for "struct sgx_enclave_run"
Add a missing ':' for the description of sgx_enclave_run.reserved so that
documentation for the member is correctly generated:

  WARNING: arch/x86/include/uapi/asm/sgx.h:184 struct member 'reserved' not
  described in 'sgx_enclave_run'

Closes: https://lore.kernel.org/all/20251106145506.145fc620@canb.auug.org.au
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Link: https://patch.msgid.link/20251112160708.1343355-2-seanjc%40google.com
2025-11-14 15:30:13 -08:00
Thomas Weißschuh
308bc2e338 selftests/timers/nanosleep: Add tests for return of remaining time
If interrupted by a signal clock_nanosleep() returns the remaining time
into the structure pointed to by the rmtp parameter. So far this
functionality was not tested by the timer selftests.

Extend the nanosleep selftest to cover this feature.

Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://patch.msgid.link/20251106-nanosleep-rtmp-selftest-v1-1-f9212fb295fe@linutronix.de
2025-11-14 20:34:50 +01:00
Wake Liu
05d89fe7e4 selftests/timers: Clean up kernel version check in posix_timers
Several tests in the posix_timers selftest which test timer behavior
related to SIG_IGN fail on kernels older than 6.13. This is due to
a refactoring of signal handling in commit caf77435dd ("signal:
Handle ignored signals in do_sigaction(action != SIG_IGN)").

A previous attempt to fix this by adding a kernel version check to each
of the nine affected tests was suboptimal, as it resulted in emitting
the same skip message nine times.

Following the suggestion from Thomas Gleixner, this is refactored to
perform a single version check in main(). To satisfy the kselftest
framework's requirement for the test count to match the declared plan,
the plan is now conditionally set to 10 (for older kernels) or 19.

While setting the plan conditionally may seem complex, it is the
better approach to avoid the alternatives: either running tests on
unsupported kernels that are known to fail, or emitting a noisy series
of nine identical skip messages. A single informational message is now
printed instead when the tests are skipped.

Signed-off-by: Wake Liu <wakel@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250807085042.1690931-1-wakel@google.com/
Link: https://patch.msgid.link/20251103114502.584940-1-wakel@google.com
2025-11-14 20:34:50 +01:00
Jianyun Gao
4518767be9 time: Fix a few typos in time[r] related code comments
Signed-off-by: Jianyun Gao <jianyungao89@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://patch.msgid.link/20250927093411.1509275-1-jianyungao89@gmail.com
2025-11-14 20:34:50 +01:00
Borislav Petkov (AMD)
e67997021f x86/bugs: Get rid of the forward declarations
Get rid of the forward declarations of the mitigation functions by
moving their single caller below them.

No functional changes.

Suggested-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Link: https://lore.kernel.org/r/20251105200447.GBaQut3w4dLilZrX-z@fat_crate.local
2025-11-14 20:32:21 +01:00
Sunday Adelodun
e54dd0474c time: tick-oneshot: Add missing Return and parameter descriptions to kernel-doc
Several functions in kernel/time/tick-oneshot.c are missing parameter and
return value descriptions in their kernel-doc comments. This causes
warnings during doc generation.

Update the kernel-doc blocks to include detailed @param and Return:
descriptions for better clarity and to fix kernel-doc warnings.  No
functional code changes are made.

Signed-off-by: Sunday Adelodun <adelodunolaoluwa@yahoo.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://patch.msgid.link/20251106113938.34693-3-adelodunolaoluwa@yahoo.com
2025-11-14 20:17:44 +01:00
Thomas Weißschuh
4702f4eceb hrtimer: Store time as ktime_t in restart block
The hrtimer core uses ktime_t to represent times, use that also for the
restart block. CPU timers internally use nanoseconds instead of ktime_t
but use the same restart block, so use the correct accessors for those.

Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://patch.msgid.link/20251110-restart-block-expiration-v1-3-5d39cc93df4f@linutronix.de
2025-11-14 16:31:19 +01:00
Heiko Carstens
52a1f73d17 s390/fault: Print unmodified PSW address on protection exception
In case of a kernel crash caused by a protection exception, print the
unmodified PSW address as reported by the CPU. The protection exception
handler modifies the PSW address in order to keep fault handling easy,
however that leads to misleading call traces.

Therefore restore the original PSW address before printing it.

Before this change the output in case of a protection exception looks like
this:

 Oops: 0004 ilc:2 [#1]SMP
 Krnl PSW : 0704c00180000000 000003ffe0b40d78 (sysrq_handle_crash+0x28/0x40)
            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3
...
 Krnl Code: 000003ffe0b40d66: e3e0f0980024        stg     %r14,152(%r15)
            000003ffe0b40d6c: c010fffffff2        larl    %r1,000003ffe0b40d50
           #000003ffe0b40d72: c0200046b6bc        larl    %r2,000003ffe1417aea
           >000003ffe0b40d78: 92021000            mvi     0(%r1),2
            000003ffe0b40d7c: c0e5ffae03d6        brasl   %r14,000003ffe0101528

With this change it looks like this:

 Oops: 0004 ilc:2 [#1]SMP
 Krnl PSW : 0704c00180000000 000003ffe0b40dfc (sysrq_handle_crash+0x2c/0x40)
            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3
...
 Krnl Code: 000003ffe0b40dec: c010fffffff2        larl    %r1,000003ffe0b40dd0
            000003ffe0b40df2: c0200046b67c        larl    %r2,000003ffe1417aea
           *000003ffe0b40df8: 92021000            mvi     0(%r1),2
           >000003ffe0b40dfc: c0e5ffae03b6        brasl   %r14,000003ffe0101568
            000003ffe0b40e02: 0707                bcr     0,%r7

Note that with this change the PSW address points to the instruction behind
the instruction which caused the exception like it is expected for
protection exceptions.

This also replaces the '#' marker in the disassembly with '*', which allows
to distinguish between new and old behavior.

Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2025-11-14 11:34:28 +01:00
Heiko Carstens
a603a00399 s390/uprobes: Use __forward_psw() instead of private implementation
With adjust_psw_addr() the uprobes code contains more or less a private
__forward_psw() implementation. Switch it to use __forward_psw(), and
remove adjust_psw_addr().

Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2025-11-14 11:34:28 +01:00
Heiko Carstens
37450e0994 s390/processor: Add __forward_psw() helper
Similar to __rewind_psw() add the counter part __forward_psw(). This
helps to make code more readable if a PSW address has to be forwarded,
since it is more natural to write

addr = __forward_psw(psw, ilen);

instead of

addr = __rewind_psw(psw, -ilen);

This renames also the ilc parameter of __rewind_psw() to ilen, since
the parameter reflects an instruction length, and not an instruction
length code. Also change the type of ilen from unsigned long to long
so it reflects that lengths can be negative or positive.

Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2025-11-14 11:34:28 +01:00
Aleksei Nikiforov
14e4e4175b s390/fpu: Fix false-positive kmsan report in fpu_vstl()
A false-positive kmsan report is detected when running ping command.

An inline assembly instruction 'vstl' can write varied amount of bytes
depending on value of 'index' argument. If 'index' > 0, 'vstl' writes
at least 2 bytes.

clang generates kmsan write helper call depending on inline assembly
constraints. Constraints are evaluated compile-time, but value of
'index' argument is known only at runtime.

clang currently generates call to __msan_instrument_asm_store with 1 byte
as size. Manually call kmsan function to indicate correct amount of bytes
written and fix false-positive report.

This change fixes following kmsan reports:

[   36.563119] =====================================================
[   36.563594] BUG: KMSAN: uninit-value in virtqueue_add+0x35c6/0x7c70
[   36.563852]  virtqueue_add+0x35c6/0x7c70
[   36.564016]  virtqueue_add_outbuf+0xa0/0xb0
[   36.564266]  start_xmit+0x288c/0x4a20
[   36.564460]  dev_hard_start_xmit+0x302/0x900
[   36.564649]  sch_direct_xmit+0x340/0xea0
[   36.564894]  __dev_queue_xmit+0x2e94/0x59b0
[   36.565058]  neigh_resolve_output+0x936/0xb40
[   36.565278]  __neigh_update+0x2f66/0x3a60
[   36.565499]  neigh_update+0x52/0x60
[   36.565683]  arp_process+0x1588/0x2de0
[   36.565916]  NF_HOOK+0x1da/0x240
[   36.566087]  arp_rcv+0x3e4/0x6e0
[   36.566306]  __netif_receive_skb_list_core+0x1374/0x15a0
[   36.566527]  netif_receive_skb_list_internal+0x1116/0x17d0
[   36.566710]  napi_complete_done+0x376/0x740
[   36.566918]  virtnet_poll+0x1bae/0x2910
[   36.567130]  __napi_poll+0xf4/0x830
[   36.567294]  net_rx_action+0x97c/0x1ed0
[   36.567556]  handle_softirqs+0x306/0xe10
[   36.567731]  irq_exit_rcu+0x14c/0x2e0
[   36.567910]  do_io_irq+0xd4/0x120
[   36.568139]  io_int_handler+0xc2/0xe8
[   36.568299]  arch_cpu_idle+0xb0/0xc0
[   36.568540]  arch_cpu_idle+0x76/0xc0
[   36.568726]  default_idle_call+0x40/0x70
[   36.568953]  do_idle+0x1d6/0x390
[   36.569486]  cpu_startup_entry+0x9a/0xb0
[   36.569745]  rest_init+0x1ea/0x290
[   36.570029]  start_kernel+0x95e/0xb90
[   36.570348]  startup_continue+0x2e/0x40
[   36.570703]
[   36.570798] Uninit was created at:
[   36.571002]  kmem_cache_alloc_node_noprof+0x9e8/0x10e0
[   36.571261]  kmalloc_reserve+0x12a/0x470
[   36.571553]  __alloc_skb+0x310/0x860
[   36.571844]  __ip_append_data+0x483e/0x6a30
[   36.572170]  ip_append_data+0x11c/0x1e0
[   36.572477]  raw_sendmsg+0x1c8c/0x2180
[   36.572818]  inet_sendmsg+0xe6/0x190
[   36.573142]  __sys_sendto+0x55e/0x8e0
[   36.573392]  __s390x_sys_socketcall+0x19ae/0x2ba0
[   36.573571]  __do_syscall+0x12e/0x240
[   36.573823]  system_call+0x6e/0x90
[   36.573976]
[   36.574017] Byte 35 of 98 is uninitialized
[   36.574082] Memory access of size 98 starts at 0000000007aa0012
[   36.574218]
[   36.574325] CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Tainted: G    B            N  6.17.0-dirty #16 NONE
[   36.574541] Tainted: [B]=BAD_PAGE, [N]=TEST
[   36.574617] Hardware name: IBM 3931 A01 703 (KVM/Linux)
[   36.574755] =====================================================

[   63.532541] =====================================================
[   63.533639] BUG: KMSAN: uninit-value in virtqueue_add+0x35c6/0x7c70
[   63.533989]  virtqueue_add+0x35c6/0x7c70
[   63.534940]  virtqueue_add_outbuf+0xa0/0xb0
[   63.535861]  start_xmit+0x288c/0x4a20
[   63.536708]  dev_hard_start_xmit+0x302/0x900
[   63.537020]  sch_direct_xmit+0x340/0xea0
[   63.537997]  __dev_queue_xmit+0x2e94/0x59b0
[   63.538819]  neigh_resolve_output+0x936/0xb40
[   63.539793]  ip_finish_output2+0x1ee2/0x2200
[   63.540784]  __ip_finish_output+0x272/0x7a0
[   63.541765]  ip_finish_output+0x4e/0x5e0
[   63.542791]  ip_output+0x166/0x410
[   63.543771]  ip_push_pending_frames+0x1a2/0x470
[   63.544753]  raw_sendmsg+0x1f06/0x2180
[   63.545033]  inet_sendmsg+0xe6/0x190
[   63.546006]  __sys_sendto+0x55e/0x8e0
[   63.546859]  __s390x_sys_socketcall+0x19ae/0x2ba0
[   63.547730]  __do_syscall+0x12e/0x240
[   63.548019]  system_call+0x6e/0x90
[   63.548989]
[   63.549779] Uninit was created at:
[   63.550691]  kmem_cache_alloc_node_noprof+0x9e8/0x10e0
[   63.550975]  kmalloc_reserve+0x12a/0x470
[   63.551969]  __alloc_skb+0x310/0x860
[   63.552949]  __ip_append_data+0x483e/0x6a30
[   63.553902]  ip_append_data+0x11c/0x1e0
[   63.554912]  raw_sendmsg+0x1c8c/0x2180
[   63.556719]  inet_sendmsg+0xe6/0x190
[   63.557534]  __sys_sendto+0x55e/0x8e0
[   63.557875]  __s390x_sys_socketcall+0x19ae/0x2ba0
[   63.558869]  __do_syscall+0x12e/0x240
[   63.559832]  system_call+0x6e/0x90
[   63.560780]
[   63.560972] Byte 35 of 98 is uninitialized
[   63.561741] Memory access of size 98 starts at 0000000005704312
[   63.561950]
[   63.562824] CPU: 3 UID: 0 PID: 192 Comm: ping Tainted: G    B            N  6.17.0-dirty #16 NONE
[   63.563868] Tainted: [B]=BAD_PAGE, [N]=TEST
[   63.564751] Hardware name: IBM 3931 A01 703 (KVM/Linux)
[   63.564986] =====================================================

Fixes: dcd3e1de9d ("s390/checksum: provide csum_partial_copy_nocheck()")
Signed-off-by: Aleksei Nikiforov <aleksei.nikiforov@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2025-11-14 11:34:27 +01:00
Thomas Richter
d17901e8e8 s390/pai: Calculate size of reserved PAI extension control block area
The PAI extension 1 control block area is 512 bytes in total.
It currently contains three address pointer which refer to counter
memory blocks followed by a reserved area.
Calculate the reserved area instead of hardcoding its size. This
makes the code more readable and maintainable.
No functional chance.

Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Suggested-by: Jan Polensky <japo@linux.ibm.com>
Reviewed-by: Jan Polensky <japo@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2025-11-14 11:34:27 +01:00
Heiko Carstens
b60d126c8e s390/mm: Let dump_fault_info() print additional information
Let dump_fault_info() print additional information to make debugging
easier:

Print "FSI" if the access-exception-fetch/store-indication facility is
installed. If it is installed the TEID may also indicate if an exception
happened because of a fetch or a store operation.

Print "SOP", "ESOP-1", or "ESOP-2" depending on the type of the installed
Suppression-on-Protection facility. This also gives additional information
about the validity and meaning of the TEID bits.

The output is changed from something like:

Failing address: 0000000000000000 TEID: 0000000000000803

to

Failing address: 0000000000000000 TEID: 0000000000000803 ESOP-2 FSI

Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2025-11-14 11:34:27 +01:00
Heiko Carstens
76502abca2 s390/mm: Change comment and die() message if teid.b61 is zero
The comments in do_protection() give the impression that a TEID, where bit
61 is zero, indicates a low address protection exception. This is not
necessarily true, and it depends on the type of Suppression-on-Protection
facility of the machine (see Princples of Operation) what this means.

Rework the comments and the die() message to reflect this. This may also
help to avoid confusion.

Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2025-11-14 11:34:27 +01:00
Heiko Carstens
02310adcc6 s390/mm: Remove unused flush_tlb()
flush_tlb() exists for historic reasons and was never used. Remove it.

Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2025-11-14 11:34:27 +01:00
Heiko Carstens
f518d469fe Merge branch 'pai-pmu-merge'
Thomas Richter says:

====================

The PAI PMUs pai_crypto and pai_ext both operate on memory
mapped counters supported by z16 and follow on machines.
These memory mapped counters have a lot in common, like:
 - validation, installing and removing events
 - starting and stopping events
 - retrieving counter values
 - collecting sample data.

However both PMU drivers have slightly different parameters,
for example:
 - different mapped memory size
 - different number of supported counters
 - different counter numbers and names
 - different bits in the CR0 register
 - different anchor address in lowcore

Due to these different parameters, two independent
PMUs have been developed. However both PMU drivers
have very much in common and most of the PMU call back
functions look very similar and are sometimes identical.

This patch set combines both independent PMU device drivers
perf_pai_crypto.c and per_pai_ext.c into one device driver.
The new device driver operations on a table which contains
the different parameters and uses common functions for
event operations.

Result is one PAI PMU driver which supports both PMUs.
It is also extendable to support new PAI PMUs.

====================

Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2025-11-14 11:30:50 +01:00
Thomas Richter
492578d3a2 s390/pai: Rename perf_pai_crypto.c to perf_pai.c
Rename perf_pai_crypto.c to perf_pai.c. The new perf_pai.c
contains both PAI device drivers:
 - pai_crypto for PAI crypto counter set
 - pai_ext for PAI NNPA counter set
The rename reflects this common driver supporting both PMUs.

Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Reviewed-by: Jan Polensky <japo@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2025-11-14 11:30:08 +01:00
Thomas Richter
8b65b0ba35 s390/pai_crypto: Merge pai_ext PMU into pai_crypto
Combine PAI cryptography and PAI extension (NNPA) PMUs in one driver.
Remove file perf_pai_ext.c and registration of PMU "pai_ext"
from perf_pai_crypto.c.

Includes:
- Shared alloc/free and sched_task handling
- NNPA events with exclude_kernel enforced, exclude_user rejected
- Setup CR0 bits for both PMUs

Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Reviewed-by: Jan Polensky <japo@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2025-11-14 11:30:07 +01:00
Thomas Richter
3abb6b1675 s390/pai_crypto: Introduce PAI crypto specific event delete function
Introduce PAI crypto specific event delete function to handle
additional actions to be done at event removal.

Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Reviewed-by: Jan Polensky <japo@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2025-11-14 11:30:07 +01:00
Thomas Richter
35a27bad07 s390/pai_crypto: Make pai_root per-PMU and unify naming
Prepare the common PAI PMU driver to handle multiple PMUs.

Convert pai_root into an array indexed by PAI_PMU_IDX(event)
so that per-CPU state becomes per-PMU. Adjust all call sites
accordingly. Rename KMSG_COMPONENT and the s390dbf buffer from
"pai_crypto" to "pai" for consistent naming.

No functional change intended beyond log identifiers.

Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Reviewed-by: Jan Polensky <japo@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2025-11-14 11:30:07 +01:00
Thomas Richter
f124735413 s390/pai_crypto: Rename paicrypt_copy() to pai_copy()
To support one common PAI PMU device driver which handles
both PMUs pai_crypto and pai_ext, use a common naming scheme
for structures and variables suitable for both device drivers.
Rename paicrypt_copy() to pai_copy() to indicate its common usage.
No functional change.

Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Reviewed-by: Jan Polensky <japo@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2025-11-14 11:30:07 +01:00
Thomas Richter
42e6a0f6d2 s390/pai_crypto: Add common pai_del() function
To support one common PAI PMU device driver which handles
both PMUs pai_crypto and pai_ext, use a common naming scheme
for structures and variables suitable for both device drivers.
Add a common usable function pai_stop() for the event on a CPU.

Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Reviewed-by: Jan Polensky <japo@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2025-11-14 11:30:07 +01:00
Thomas Richter
ac03223f07 s390/pai_crypto: Add common pai_stop() function
To support one common PAI PMU device driver which handles
both PMUs pai_crypto and pai_ext, use a common naming scheme
for structures and variables suitable for both device drivers.

Add a common usable function pai_stop() for the event on a CPU.
Call this common pai_stop() from paicrypt_del().

Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Reviewed-by: Jan Polensky <japo@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2025-11-14 11:30:07 +01:00
Thomas Richter
a65a4d7e80 s390/pai_crypto: Add common pai_add() function
To support one common PAI PMU device driver which handles
both PMUs pai_crypto and pai_ext, use a common naming scheme
for structures and variables suitable for both device drivers.

Add a common usable function pai_add() for the event on a CPU.
Call this common pai_add() from paicrypt_add().

Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Reviewed-by: Jan Polensky <japo@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2025-11-14 11:30:06 +01:00
Thomas Richter
6fe66b2157 s390/pai_crypto: Add common pai_start() function
To support one common PAI PMU device driver which handles
both PMUs pai_crypto and pai_ext, use a common naming scheme
for structures and variables suitable for both device drivers.

Add a common usable function pai_start() to the event on a CPU.
The function expects a PAI PMU specific read function as second
parameter to read out the start value for an event.

Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Reviewed-by: Jan Polensky <japo@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2025-11-14 11:30:06 +01:00
Thomas Richter
8f6116fd49 s390/pai_crypto: Add common pai_read() function
To support one common PAI PMU device driver which handles
both PMUs pai_crypto and pai_ext, use a common naming scheme
for structures and variables suitable for both device drivers.

Add a common usable function pai_read() to read counter values.
The function expects a PAI PMU specific read function as second
parameter.

Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Reviewed-by: Jan Polensky <japo@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2025-11-14 11:30:06 +01:00
Thomas Richter
74466e87e7 s390/pai_crypto: Unify sample push logic and update context handling
Unify naming and logic for PAI PMU drivers to support both PMUs
pai_crypto and pai_ext.

Rename paicrypt_push_sample() to pai_push_sample() to reflect
its common usage.  Add detailed comments about invocation context
and scheduler callbacks.  Use struct pai_pmu to determine area_size
instead of PAGE_SIZE for counter backup.
Remove obsolete variable paicrypt_cnt.

Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Reviewed-by: Jan Polensky <japo@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2025-11-14 11:30:06 +01:00
Thomas Richter
0f1c0d754a s390/pai_crypto: Rename paicrypt_have_samples() to pai_have_samples()
To support one common PAI PMU device driver which handles
both PMUs pai_crypto and pai_ext, use a common naming scheme
for structures and variables suitable for both device drivers.
Rename paicrypt_have_samples() to pai_have_samples() to reflect
its common usage. No functional change.

Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Reviewed-by: Jan Polensky <japo@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2025-11-14 11:30:06 +01:00
Thomas Richter
360e180d8b s390/pai_crypto: Rename paicrypt_getctr() to pai_getctr()
To support one common PAI PMU device driver which handles
both PMUs pai_crypto and pai_ext, use a common naming scheme
for structures and variables suitable for both device drivers.

Rename paicrypt_getctr() to pai_getctr() to reflect is common
purpose. pai_getctr() now uses pai_pmu table to extract PAI PMU
characteristics such as kernel_offset inside the counter area page.
Also rename paicrypt_have_sample() to pai_have_sample().

Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Reviewed-by: Jan Polensky <japo@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2025-11-14 11:30:06 +01:00
Thomas Richter
42cd0c8242 s390/pai_crypto: Rename paicrypt_getdata() to pai_getdata()
To support one common PAI PMU device driver which handles
both PMUs pai_crypto and pai_ext, use a common naming scheme
for structures and variables suitable for both device drivers.
Rename paicrypt_getdata() to pai_getdata(). Use the PAI PMU
characteristics in the pai_pmu table to determine the number
of counters to be extracted.

Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Reviewed-by: Jan Polensky <japo@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2025-11-14 11:30:05 +01:00
Thomas Richter
65b9831bd3 s390/pai_crypto: Rename some function for common usage.
To support one common PAI PMU device driver which handles
both PMUs pai_crypto and pai_ext, use a common naming scheme
for structures and variables suitable for both device drivers.
Rename functions
 - paicrypt_free() -> pai_free()
 - paicrypt_destroy_event() -> pai_destroy_event()
 - paicrypt_destroy_event_cpu() -> pai_destroy_event_cpu()
to reflect their future common usage.
No functional change.

Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Reviewed-by: Jan Polensky <japo@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2025-11-14 11:30:05 +01:00
Thomas Richter
413957980a s390/pai_crypto: Introduce generic event init using pai_pmu[]
To support one common PAI PMU device driver which handles
both PMUs pai_crypto and pai_ext, use a common naming scheme
for structures and variables suitable for both device drivers.

Rework PAI crypto event initialization. Add a common
function for event initialization. It uses the PAI characteristics
stored in the pai_pmu table instead of hardcoded values.
Enlarge pai_event_valid() to check all event validation aspects.

Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Reviewed-by: Jan Polensky <japo@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2025-11-14 11:30:05 +01:00
Thomas Richter
a3f8423622 s390/pai_crypto: Add PAI crypto characteristics table for parameters
Create and add a PMU characteristics table to store the parameters
of the PAI crypto PMU. This table contains PMU details such as
 - number of available counters
 - name of these counters to export to /sysfs
 - Size of the memory mapped counter area
 - base number of first counter
 - etc

Also define a PMU specific initialization function to be called when
a PAI PMU feature is supported. At device driver initialization
test these features and if available use instruction qpaci to
retrieve the number of available counters. Also export these counter
names to /sysfs and register this PMU.

Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Reviewed-by: Jan Polensky <japo@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2025-11-14 11:30:05 +01:00
Thomas Richter
387c7b5f04 s390/pai_crypto: Rename paicrypt_root_alloc() and paicrypt_root_free()
To support one common PAI PMU device driver which handles
both PMUs pai_crypto and pai_ext, use a common naming scheme
for structures and variables suitable for both device drivers.
Rename functions paicrypt_root_alloc() and paicrypt_root_free()
to pai_root_alloc() and pai_root_free().
No functional change.

Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Reviewed-by: Jan Polensky <japo@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2025-11-14 11:30:05 +01:00
Thomas Richter
3f082c2e47 s390/pai_crypto: Rename structure paicrypt_root
To support one common PAI PMU device driver which handles
both PMUs pai_crypto and pai_ext, use a common naming scheme
for structures and variables suitable for both device drivers.
Rename structure paicrypt_root to pai_root.
No functional change.

Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Reviewed-by: Jan Polensky <japo@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2025-11-14 11:30:04 +01:00
Thomas Richter
a626e0d46a s390/pai_crypto: Rename structure paicrypt_map to pai_map
To support one common PAI PMU device driver which handles
both PMUs pai_crypto and pai_ext, use a common naming scheme
for structures and variables suitable for both device drivers.
Rename structure paicrypt_map to pai_map.
No functional change.

Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Reviewed-by: Jan Polensky <japo@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2025-11-14 11:30:04 +01:00
Thomas Richter
2706ea193a s390/pai_crypto: Rename structure paicrypt_mapptr to pai_mapptr
To support one common PAI PMU device driver which handles
both PMUs pai_crypto and pai_ext, use a common naming scheme
for structures and variables suitable for both device drivers.
Rename structure paicrypt_mapptr to pai_mapptr.
No functional change.

Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Reviewed-by: Jan Polensky <japo@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2025-11-14 11:30:04 +01:00
Thomas Richter
c124208b74 s390/pai_crypto: Rename member paicrypt_map::page
Rename member page in struct paicrypt_map to area. This rename
creates consistent naming for both PMU drivers paicrypto and PMU
paiext. No functional change.

Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Reviewed-by: Jan Polensky <japo@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2025-11-14 11:30:04 +01:00
Thomas Richter
abc524caa1 s390/pai_crypto: Rename variable cfm_dbg
The global variable cfm_dbg points to the s390dbf debug buffer.
Rename it to paidbg to better reflect its purpose.
No functional change.

Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Reviewed-by: Jan Polensky <japo@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2025-11-14 11:30:04 +01:00
Sascha Bischoff
a04fbfb8a1 arm64/sysreg: Add ICH_VMCR_EL2
Add the ICH_VMCR_EL2 register, which is required for the upcoming
GICv5 KVM support. This register has two different field encodings,
based on if it is used for GICv3 or GICv5-based VMs. The
GICv5-specific field encodings are generated with a FEAT_GCIE prefix.

This register is already described in the GICv3 KVM code
directly. This will be ported across to use the generated encodings as
part of an upcoming change.

Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-11-13 18:09:46 +00:00
Sascha Bischoff
a0b130eedd arm64/sysreg: Move generation of RES0/RES1/UNKN to function
The RESx and UNKN define generation happens in two places
(EndSysreg and EndSysregFields), and was using nearly identical
code. Split this out into a function, and call that instead, rather
then keeping the dupliated code.

There are no changes to the generated sysregs as part of this change.

Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-11-13 18:09:46 +00:00
Sascha Bischoff
fe2ef46995 arm64/sysreg: Support feature-specific fields with 'Prefix' descriptor
Some system register field encodings change based on, for example the
in-use architecture features, or the context in which they are
accessed. In order to support these different field encodings,
introduce the Prefix descriptor (Prefix, EndPrefix) for describing
such sysregs.

The Prefix descriptor can be used in the following way:

        Sysreg  EXAMPLE 0    1    2    3    4
        Prefix    FEAT_A
	Field   63:0    Foo
	EndPrefix
	Prefix    FEAT_B
	Field   63:1    Bar
 	Res0    0
        EndPrefix
        Field   63:0    Baz
        EndSysreg

This will generate a single set of system register encodings (REG_,
SYS_, ...), and then generate three sets of field definitions for the
system register called EXAMPLE. The first set is prefixed by FEAT_A,
e.g. FEAT_A_EXAMPLE_Foo. The second set is prefixed by FEAT_B, e.g.,
FEAT_B_EXAMPLE_Bar. The third set is not given a prefix at all,
e.g. EXAMPLE_BAZ. For each set, a corresponding set of defines for
Res0, Res1, and Unkn is generated.

The intent for the final prefix-less fields is to describe default or
legacy field encodings. This ensure that prefixed encodings can be
added to already-present sysregs without affecting existing legacy
code. Prefixed fields must be defined before those without a prefix,
and this is checked by the generator. This ensures consisnt ordering
within the sysregs definitions.

The Prefix descriptor can be used within Sysreg or SysregFields
blocks. Field, Res0, Res1, Unkn, Rax, SignedEnum, Enum can all be used
within a Prefix block. Fields and Mapping can not. Fields that vary
with features must be described as part of a SysregFields block,
instead. Mappings, which are just a code comment, make little sense in
this context, and have hence not been included.

There are no changes to the generated system register definitions as
part of this change.

Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
Reviewed-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-11-13 18:09:46 +00:00
Sascha Bischoff
0aab5772a5 arm64/sysreg: Fix checks for incomplete sysreg definitions
The checks for incomplete sysreg definitions were checking if the
next_bit was greater than 0, which is incorrect and missed occasions
where bit 0 hasn't been defined for a sysreg. The reason is that
next_bit is -1 when all bits have been processed (LSB - 1).

Change the checks to use >= 0, instead. Also, set next_bit in Mapping
to -1 instead of 0 to match these new checks.

There are no changes to the generated sysreg definitons as part of
this change, and conveniently no definitions lack definitions for bit
0.

Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
Reviewed-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-11-13 18:09:46 +00:00
Yang Shi
37cb0aab90 arm64: mm: make linear mapping permission update more robust for patial range
The commit fcf8dda8cc ("arm64: pageattr: Explicitly bail out when changing
permissions for vmalloc_huge mappings") made permission update for
partial range more robust. But the linear mapping permission update
still assumes update the whole range by iterating from the first page
all the way to the last page of the area.

Make it more robust by updating the linear mapping permission from the
page mapped by start address, and update the number of numpages.

Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: Dev Jain <dev.jain@arm.com>
Signed-off-by: Yang Shi <yang@os.amperecomputing.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-11-13 18:07:21 +00:00
Dev Jain
c320dbb7c8 arm64/mm: Elide TLB flush in certain pte protection transitions
Currently arm64 does an unconditional TLB flush in mprotect(). This is not
required for some cases, for example, when changing from PROT_NONE to
PROT_READ | PROT_WRITE (a real usecase - glibc malloc does this to emulate
growing into the non-main heaps), and unsetting uffd-wp in a range.

Therefore, implement pte_needs_flush() for arm64, which is already
implemented by some other arches as well.

Running a userspace program changing permissions back and forth between
PROT_NONE and PROT_READ | PROT_WRITE, and measuring the average time taken
for the none->rw transition, I get a reduction from 3.2 microseconds to
2.85 microseconds, giving a 12.3% improvement.

Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: Dev Jain <dev.jain@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-11-13 17:50:25 +00:00
Linu Cherian
1b214452b6 arm64/mm: Rename try_pgd_pgtable_alloc_init_mm
With BUG_ON in pgd_pgtable_alloc_init_mm moved up to higher layer,
gfp flags is the only difference between try_pgd_pgtable_alloc_init_mm
and pgd_pgtable_alloc_init_mm. Hence rename the "try" version
to pgd_pgtable_alloc_init_mm_gfp.

Reviewed-by: Dev Jain <dev.jain@arm.com>
Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: Kevin Brodsky <kevin.brodsky@arm.com>
Signed-off-by: Linu Cherian <linu.cherian@arm.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-11-13 16:00:19 +00:00
Chaitanya S Prakash
bfc184cb1b arm64/mm: Allow __create_pgd_mapping() to propagate pgtable_alloc() errors
arch_add_memory() is used to hotplug memory into a system but as a part
of its implementation it calls __create_pgd_mapping(), which uses
pgtable_alloc() in order to build intermediate page tables. As this path
was initally only used during early boot pgtable_alloc() is designed to
BUG_ON() on failure. However, in the event that memory hotplug is
attempted when the system's memory is extremely tight and the allocation
were to fail, it would lead to panicking the system, which is not
desirable. Hence update __create_pgd_mapping and all it's callers to be
non void and propagate -ENOMEM on allocation failure to allow system to
fail gracefully.

But during early boot if there is an allocation failure, we want the
system to panic, hence create a wrapper around __create_pgd_mapping()
called early_create_pgd_mapping() which is designed to panic, if ret
is non zero value. All the init calls are updated to use this wrapper
rather than the modified __create_pgd_mapping() to restore
functionality.

Fixes: 4ab2150615 ("arm64: Add memory hotplug support")
Reviewed-by: Dev Jain <dev.jain@arm.com>
Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: Kevin Brodsky <kevin.brodsky@arm.com>
Signed-off-by: Chaitanya S Prakash <chaitanyas.prakash@arm.com>
Signed-off-by: Linu Cherian <linu.cherian@arm.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-11-13 16:00:19 +00:00
Anshuman Khandual
b0a3f0e894 arm64/sysreg: Replace TCR_EL1 field macros
This just replaces all used TCR_EL1 field macros with tools sysreg variant
based fields and subsequently drops them from the header (pgtable-hwdef.h),
although while retaining the ones used for KVM (represented via the sysreg
tools format).

Cc: Will Deacon <will@kernel.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-11-13 15:58:30 +00:00
Xianwei Zhao
fc584d871c irqchip/meson-gpio: Add support for Amlogic S6 S7 and S7D SoCs
The Amlogic S6/S7/S7D SoCs support GPIO interrupt lines:

    S6 IRQ Number:
    - 99:98    2 pins on bank CC
    - 97       1 pin  on bank TESTN
    - 96:81   16 pins on bank A
    - 80:65   16 pins on bank Z
    - 64:45   20 pins on bank X
    - 44:37    8 pins on bank H offs H1
    - 36:32    5 pins on bank F
    - 31:25    7 pins on bank D
    - 24:22    3 pins on bank E
    - 21:14    8 pins on bank C
    - 13:0    14 pins on bank B

    S7 IRQ Number:
    - 83:82    2 pins on bank CC
    - 81       1 pin  on bank TESTN
    - 80:68   13 pins on bank Z
    - 67:48   20 pins on bank X
    - 47:36   12 pins on bank H
    - 35:24   12 pins on bank D
    - 23:22    2 pins on bank E
    - 21:14    8 pins on bank C
    - 13:0    14 pins on bank B

    S7D IRQ Number:
    - 83:82    2 pins on bank CC
    - 81:75    7 pins on bank DV
    - 74       1 pin  on bank TESTN
    - 73:61   13 pins on bank Z
    - 60:41   20 pins on bank X
    - 40:29   12 pins on bank H
    - 28:24    5 pins on bank D
    - 23:22    2 pins on bank E
    - 21:14    8 pins on bank C
    - 13:0    14 pins on bank B

Add the required compatibles and interrupt count initializers.

Signed-off-by: Xianwei Zhao <xianwei.zhao@amlogic.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Neil Armstrong <neil.armstrong@linaro.org>
Link: https://patch.msgid.link/20251105-irqchip-gpio-s6-s7-s7d-v1-2-b4d1fe4781c1@amlogic.com
2025-11-13 14:04:16 +01:00
Xianwei Zhao
e4ca152008 dt-bindings: interrupt-controller: Add support for Amlogic S6 S7 and S7D SoCs
Update the device tree binding document for GPIO interrupt controller of
Amlogic S6 S7 and S7D SoCs.

Signed-off-by: Xianwei Zhao <xianwei.zhao@amlogic.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Link: https://patch.msgid.link/20251105-irqchip-gpio-s6-s7-s7d-v1-1-b4d1fe4781c1@amlogic.com
2025-11-13 14:04:16 +01:00
Sean Christopherson
6276c67f2b x86: Restrict KVM-induced symbol exports to KVM modules where obvious/possible
Extend KVM's export macro framework to provide EXPORT_SYMBOL_FOR_KVM(),
and use the helper macro to export symbols for KVM throughout x86 if and
only if KVM will build one or more modules, and only for those modules.

To avoid unnecessary exports when CONFIG_KVM=m but kvm.ko will not be
built (because no vendor modules are selected), let arch code #define
EXPORT_SYMBOL_FOR_KVM to suppress/override the exports.

Note, the set of symbols to restrict to KVM was generated by manual search
and audit; any "misses" are due to human error, not some grand plan.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Kai Huang <kai.huang@intel.com>
Tested-by: Kai Huang <kai.huang@intel.com>
Link: https://patch.msgid.link/20251112173944.1380633-5-seanjc%40google.com
2025-11-12 15:29:38 -08:00
Sean Christopherson
e6f2d5866c x86/mm: Drop unnecessary export of "ptdump_walk_pgd_level_debugfs"
Don't export "ptdump_walk_pgd_level_debugfs" as its sole user is
arch/x86/mm/debug_pagetables.c, which can't be built as a module.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://patch.msgid.link/20251112173944.1380633-4-seanjc%40google.com
2025-11-12 15:24:42 -08:00
Sean Christopherson
9c26c91e10 x86/mtrr: Drop unnecessary export of "mtrr_state"
Don't export "mtrr_state" as usage is limited to arch/x86/kernel/cpu/mtrr
(and nothing outside of that directory even includes the local mtrr.h).

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://patch.msgid.link/20251112173944.1380633-3-seanjc%40google.com
2025-11-12 15:24:42 -08:00
Sean Christopherson
ed02882460 x86/bugs: Drop unnecessary export of "x86_spec_ctrl_base"
Don't export x86_spec_ctrl_base as it's used only in bugs.c and process.c,
neither of which can be built into a module.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://patch.msgid.link/20251112173944.1380633-2-seanjc%40google.com
2025-11-12 15:24:42 -08:00
Bo Liu
337f7e3a4b arm64: Fix double word in comments
Remove the repeated word "the" in comments.

Signed-off-by: Bo Liu <liubo03@inspur.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-11-12 17:07:59 +00:00
mrigendrachaubey
96ac403ea2 arm64: Fix typos and spelling errors in comments
This patch corrects several minor typographical and spelling errors
in comments across multiple arm64 source files.

No functional changes.

Signed-off-by: mrigendrachaubey <mrigendra.chaubey@gmail.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-11-12 17:06:21 +00:00
Ryan Chen
7083e14225 dt-bindings: interrupt-controller: aspeed,ast2700: Correct #interrupt-cells and interrupts count
Update the AST2700 interrupt controller binding to match the actual
hardware and the irq-aspeed-intc driver behavior.

 - Interrupts:

    First-level INTC banks request multiple interrupt lines to the root
    GIC, with a maximum of 10 per bank. Second-level INTC banks request
    only one interrupt line to their parent INTC-IC. Therefore, set the
    interrupts property to allow a minimum of 1 and a maximum of 10
    entries.

 - #interrupt-cells:

    Set '#interrupt-cells' to <1> since the aspeed intc driver does not
    support specifying a trigger type; only the interrupt index is used.

Signed-off-by: Ryan Chen <ryan_chen@aspeedtech.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Link: https://patch.msgid.link/20251030060155.2342604-2-ryan_chen@aspeedtech.com
2025-11-11 22:20:45 +01:00
Junhui Liu
47a4ebbf91 irqchip/aclint-sswi: Add Nuclei UX900 support
Reuse the generic ACLINT SSWI probe for Nuclei UX900 since it is
compliant with the ACLINT specification.

Signed-off-by: Junhui Liu <junhui.liu@pigmoral.tech>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://patch.msgid.link/20251021-dr1v90-basic-dt-v3-9-5478db4f664a@pigmoral.tech
2025-11-11 22:17:22 +01:00
Junhui Liu
a1c3a7d7ee dt-bindings: interrupt-controller: Add Anlogic DR1V90 ACLINT SSWI
Add SSWI support for Anlogic DR1V90 SoC, which uses Nuclei UX900 with a
TIMER unit compliant with the ACLINT specification.

Signed-off-by: Junhui Liu <junhui.liu@pigmoral.tech>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Rob Herring (Arm) <robh@kernel.org>
Link: https://patch.msgid.link/20251021-dr1v90-basic-dt-v3-6-5478db4f664a@pigmoral.tech
2025-11-11 22:17:21 +01:00
Junhui Liu
579951da64 dt-bindings: interrupt-controller: Add Anlogic DR1V90 ACLINT MSWI
Add MSWI support for Anlogic DR1V90 SoC, which uses Nuclei UX900 with a
TIMER unit compliant with the ACLINT specification.

Signed-off-by: Junhui Liu <junhui.liu@pigmoral.tech>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Rob Herring (Arm) <robh@kernel.org>
Link: https://patch.msgid.link/20251021-dr1v90-basic-dt-v3-5-5478db4f664a@pigmoral.tech
2025-11-11 22:17:21 +01:00
Junhui Liu
b90ac5fe32 dt-bindings: interrupt-controller: Add Anlogic DR1V90 PLIC
Add PLIC support for Anlogic DR1V90.

Signed-off-by: Junhui Liu <junhui.liu@pigmoral.tech>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Conor Dooley <conor.dooley@microchip.com>
Link: https://patch.msgid.link/20251021-dr1v90-basic-dt-v3-4-5478db4f664a@pigmoral.tech
2025-11-11 22:17:21 +01:00
Krzysztof Kozlowski
45cc441de7 irqchip/irq-bcm7038-l1: Remove unused reg_mask_status()
reg_mask_status() is not referenced anywhere leading to W=1 warning:

  irq-bcm7038-l1.c:85:28: error: unused function 'reg_mask_status' [-Werror,-Wunused-function]

Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Link: https://patch.msgid.link/20251106155200.337399-2-krzysztof.kozlowski@linaro.org
2025-11-11 22:11:17 +01:00
Charles Mirabile
a045359e72 irqchip/sifive-plic: Fix call to __plic_toggle() in M-Mode code path
The code path for M-Mode linux that disables interrupts for other contexts
was missed when refactoring __plic_toggle().

Since the new version caches updates to the state for the primary context,
its use in this codepath is no longer desireable even if it could be made
correct.

Replace the calls to __plic_toggle() with a loop that simply disables all
of the interrupts in groups of 32 with a direct mmio write.

Fixes: 14ff9e54dd ("irqchip/sifive-plic: Cache the interrupt enable state")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Charles Mirabile <cmirabil@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://patch.msgid.link/20251103161813.2437427-1-cmirabil@redhat.com
Closes: https://lore.kernel.org/oe-kbuild-all/202510271316.AQM7gCCy-lkp@intel.com/
2025-11-11 22:11:16 +01:00
Li Qiang
df717b9564 arm64: add unlikely hint to MTE async fault check in el0_svc_common
Add unlikely() hint to the _TIF_MTE_ASYNC_FAULT flag check in
el0_svc_common() since asynchronous MTE faults are expected to be
rare occurrences during normal system call execution.

This optimization helps the compiler to improve instruction caching
and branch prediction for the common case where no asynchronous
MTE faults are pending, while maintaining correct behavior for
the exceptional case where such faults need to be handled prior
to system call execution.

Signed-off-by: Li Qiang <liqiang01@kylinos.cn>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-11-11 19:49:19 +00:00
Thomas Huth
287d163322 arm64: Replace __ASSEMBLY__ with __ASSEMBLER__ in non-uapi headers
While the GCC and Clang compilers already define __ASSEMBLER__
automatically when compiling assembly code, __ASSEMBLY__ is a
macro that only gets defined by the Makefiles in the kernel.
This can be very confusing when switching between userspace
and kernelspace coding, or when dealing with uapi headers that
rather should use __ASSEMBLER__ instead. So let's standardize now
on the __ASSEMBLER__ macro that is provided by the compilers.

This is a mostly mechanical patch (done with a simple "sed -i"
statement), except for the following files where comments with
mis-spelled macros were tweaked manually:

 arch/arm64/include/asm/stacktrace/frame.h
 arch/arm64/include/asm/kvm_ptrauth.h
 arch/arm64/include/asm/debug-monitors.h
 arch/arm64/include/asm/esr.h
 arch/arm64/include/asm/scs.h
 arch/arm64/include/asm/memory.h

Signed-off-by: Thomas Huth <thuth@redhat.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-11-11 19:35:59 +00:00
Thomas Huth
639f08fc20 arm64: Replace __ASSEMBLY__ with __ASSEMBLER__ in uapi headers
__ASSEMBLY__ is only defined by the Makefile of the kernel, so
this is not really useful for uapi headers (unless the userspace
Makefile defines it, too). Let's switch to __ASSEMBLER__ which
gets set automatically by the compiler when compiling assembly
code.

Signed-off-by: Thomas Huth <thuth@redhat.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-11-11 19:35:58 +00:00
Osama Abdelkader
420cab0155 arm64: acpi: add newline to deferred APEI warning
missing newline in pr_warn_ratelimited in apei_claim_sea

Signed-off-by: Osama Abdelkader <osama.abdelkader@gmail.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-11-11 19:28:37 +00:00
Linus Walleij
555827a064 arm64: entry: Clean out some indirection
The conversion to generic IRQ entry left some functions
in the EL1 (kernel) IRQ entry path very shallow, so drop
the __inner_functions() where appropriate, saving some
time and stack.

This is not a fix but an optimization.

Drop stale comments about irqentry_enter/exit() while we
are at it.

Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-11-11 19:28:05 +00:00
Anshuman Khandual
e2e21a9757 arm64/mm: Ensure PGD_SIZE is aligned to 64 bytes when PA_BITS = 52
Although the comment clearly states about PGD table's alignment requirement
(when PA_BITS = 52) but the subsequent BUILD_BUG_ON() tests size comparison
to 64 bytes instead. So change it as an actual alignment test.

Cc: Will Deacon <will@kernel.org>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-11-11 19:13:03 +00:00
Ard Biesheuvel
a5baf582f4 arm64/efi: Call EFI runtime services without disabling preemption
The only remaining reason why EFI runtime services are invoked with
preemption disabled is the fact that the mm is swapped out behind the
back of the context switching code.

The kernel no longer disables preemption in kernel_neon_begin().
Furthermore, the EFI spec is being clarified to explicitly state that
only baseline FP/SIMD is permitted in EFI runtime service
implementations, and so the existing kernel mode NEON context switching
code is sufficient to preserve and restore the execution context of an
in-progress EFI runtime service call.

Most EFI calls are made from the efi_rts_wq, which is serviced by a
kthread. As kthreads never return to user space, they usually don't have
an mm, and so we can use the existing infrastructure to swap in the
efi_mm while the EFI call is in progress. This is visible to the
scheduler, which will therefore reactivate the selected mm when
switching out the kthread and back in again.

Given that the EFI spec explicitly permits runtime services to be called
with interrupts enabled, firmware code is already required to tolerate
interruptions. So rather than disable preemption, disable only migration
so that EFI runtime services are less likely to cause scheduling delays.
To avoid potential issues where runtime services are interrupted while
polling the secure firmware for async completions, keep migration
disabled so that a runtime service invocation does not resume on a
different CPU from the one it was started on.

Note, though, that the firmware executes at the same privilege level as
the kernel, and is therefore able to disable interrupts altogether.

Acked-by: Will Deacon <will@kernel.org>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-11-11 18:59:22 +00:00
Ard Biesheuvel
6b9c98e657 arm64/efi: Move uaccess en/disable out of efi_set_pgd()
efi_set_pgd() will no longer be called when invoking EFI runtime
services via the efi_rts_wq work queue, but the uaccess en/disable are
still needed when using PAN emulation using TTBR0 switching. So move
these into the callers.

Acked-by: Will Deacon <will@kernel.org>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-11-11 18:59:22 +00:00
Ard Biesheuvel
1068cb52e8 arm64/efi: Drop efi_rt_lock spinlock from EFI arch wrapper
Since commit

  5894cf571e ("acpi/prmt: Use EFI runtime sandbox to invoke PRM handlers")

all EFI runtime calls on arm64 are routed via the EFI runtime wrappers,
which are serialized using the efi_runtime_lock semaphore.

This means the efi_rt_lock spinlock in the arm64 arch wrapper code has
become redundant, and can be dropped. For robustness, replace it with an
assert that the EFI runtime lock is in fact held by 'current'.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-11-11 18:59:22 +00:00
Ard Biesheuvel
7137a203b2 arm64/fpsimd: Permit kernel mode NEON with IRQs off
Currently, may_use_simd() will return false when called from a context
where IRQs are disabled. One notable case where this happens is when
calling the ResetSystem() EFI runtime service from the reboot/poweroff
code path. For this case alone, there is a substantial amount of FP/SIMD
support code to handle the corner case where a EFI runtime service is
invoked with IRQs disabled.

The only reason kernel mode SIMD is not allowed when IRQs are disabled
is that re-enabling softirqs in this case produces a noisy diagnostic
when lockdep is enabled. The warning is valid, in the sense that
delivering pending softirqs over the back of the call to
local_bh_enable() is problematic when IRQs are disabled.

While the API lacks a facility to simply mask and unmask softirqs
without triggering their delivery, disabling softirqs is not needed to
begin with when IRQs are disabled, given that softirqs are only every
taken asynchronously over the back of a hard IRQ.

So dis/enable softirq processing conditionally, based on whether IRQs
are enabled, and relax the check in may_use_simd().

Acked-by: Will Deacon <will@kernel.org>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-11-11 18:59:22 +00:00
Ard Biesheuvel
1d038e8018 arm64/fpsimd: Don't warn when EFI execution context is preemptible
Kernel mode FP/SIMD no longer requires preemption to be disabled, so
only warn on uses of FP/SIMD from preemptible context if the fallback
path is taken for cases where kernel mode NEON would not be allowed
otherwise.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-11-11 18:59:22 +00:00
Ard Biesheuvel
a286050120 efi/runtime-wrappers: Keep track of the efi_runtime_lock owner
The EFI runtime wrappers use a file local semaphore to serialize access
to the EFI runtime services. This means that any calls to the arch
wrappers around the runtime services will also be serialized, removing
the need for redundant locking.

For robustness, add a facility that allows those arch wrappers to assert
that the semaphore was taken by the current task.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-11-11 18:59:22 +00:00
Ard Biesheuvel
40374d308e efi: Add missing static initializer for efi_mm::cpus_allowed_lock
Initialize the cpus_allowed_lock struct member of efi_mm.

Cc: stable@vger.kernel.org
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-11-11 18:59:22 +00:00
Borislav Petkov (AMD)
b2c1dd6c6f x86/coco/sev: Convert has_cpuflag() to use cpu_feature_enabled()
Drop one redundant definition, while at it.

There should be no functional changes.

Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://patch.msgid.link/20251031122122.GKaQSpwhLvkinKKbjG@fat_crate.local
2025-11-11 16:42:31 +01:00
Ma Ke
f18e71cd6c EDAC/ie31200: Fix error handling in ie31200_register_mci
ie31200_register_mci() calls device_initialize() for priv->dev
unconditionally. However, in the error path, put_device() is not
called, leading to an imbalance. Similarly, in the unload path,
put_device() is missing.

Although edac_mc_free() eventually frees the memory, it does not
release the device initialized by device_initialize(). For code
readability and proper pairing of device_initialize()/put_device(),
add put_device() calls in both error and unload paths.

Found by code review.

Signed-off-by: Ma Ke <make24@iscas.ac.cn>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Link: https://patch.msgid.link/20251106084735.35017-1-make24@iscas.ac.cn
2025-11-10 17:06:10 -08:00
Uros Bizjak
fd4e025526 x86/percpu: Use BIT_WORD() and BIT_MASK() macros
Use BIT_WORD() and BIT_MASK() macros from <linux/bits.h>
in <arch/x86/include/asm/percpu.h> instead of open-coding them.

No functional change intended.

Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://patch.msgid.link/20250907184915.78041-1-ubizjak@gmail.com
2025-11-10 11:55:54 +01:00
Anshuman Khandual
fc1abd4093 arm64/mm: Drop cpu_set_[default|idmap]_tcr_t0sz()
These TCR_El1 helpers don't have any other callers. Drop these redundant
indirections completely thus making this code more compact and readable.
No functional change.

Cc: Will Deacon <will@kernel.org>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Acked-by: Will Deacon <will@kernel.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-11-07 20:01:34 +00:00
Mark Rutland
a7717cad61 kselftest/arm64: Align zt-test register dumps
The zt-test output is awkward to read, as the 'Expected' value isn't
dumped on its own line and isn't aligned with the 'Got' value beneath.
For example:

  Mismatch: PID=5281, iteration=3270249   Expected [00a1146901a1146902a1146903a1146904a1146905a1146906a1146907a1146908a1146909a114690aa114690ba114690ca114690da114690ea114690fa11469]
          Got      [00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000]
          SVCR: 2

Add a newline, matching the other FPSIMD/SVE/SME tests, so that we get
output that can be read more easily:

  Mismatch: PID=5281, iteration=3270249
          Expected [00a1146901a1146902a1146903a1146904a1146905a1146906a1146907a1146908a1146909a114690aa114690ba114690ca114690da114690ea114690fa11469]
          Got      [00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000]
          SVCR: 2

Admittedly this isn't all that important when the 'Got' value is all
zeroes, but otherwise this would be a major help for identifying which
portion of the 'Got' value is not as expected.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Mark Brown <broonie@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Will Deacon <will@kernel.org>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-kselftest@vger.kernel.org
Reviewed-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-11-07 20:00:49 +00:00
Omar Sandoval
bf6b3fed18 arm64: remove unused ARCH_PFN_OFFSET
This is only relevant to the FLATMEM memory model, which isn't an option
since commit 782276b4d0 ("arm64: Force SPARSEMEM_VMEMMAP as the only
memory management model").

Signed-off-by: Omar Sandoval <osandov@fb.com>
Acked-by: Will Deacon <will@kernel.org>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-11-07 19:59:37 +00:00
Ryo Takakura
d3b570eba7 arm64: use SOFTIRQ_ON_OWN_STACK for enabling softirq stack
For those architectures with HAVE_SOFTIRQ_ON_OWN_STACK use
their dedicated softirq stack when !PREEMPT_RT. This condition
is ensured by SOFTIRQ_ON_OWN_STACK.

Let arm64 use SOFTIRQ_ON_OWN_STACK as well to select its
usage of the stack.

Signed-off-by: Ryo Takakura <ryotkkr98@gmail.com>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Will Deacon <will@kernel.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-11-07 19:55:52 +00:00
Dawei Li
4002068508 arm64: Remove assertion on CONFIG_VMAP_STACK
CONFIG_VMAP_STACK is selected by arm64 arch unconditionly since commit
ef6861b8e6 ("arm64: Mandate VMAP_STACK").

Remove the redundant assertion and headers.

Signed-off-by: Dawei Li <dawei.li@linux.dev>
Acked-by: Will Deacon <will@kernel.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2025-11-07 19:46:39 +00:00
Yicong Yang
7ab06ea41a arch_topology: Provide a stub topology_core_has_smt() for !CONFIG_GENERIC_ARCH_TOPOLOGY
The arm_pmu driver is using topology_core_has_smt() for retrieving
the SMT implementation which depends on CONFIG_GENERIC_ARCH_TOPOLOGY.
The config is optional on arm platforms so provide a
!CONFIG_GENERIC_ARCH_TOPOLOGY stub for topology_core_has_smt().

Fixes: c3d78c34ad ("perf: arm_pmuv3: Don't use PMCCNTR_EL0 on SMT cores")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202511041757.vuCGOmFc-lkp@intel.com/
Suggested-by: Will Deacon <will@kernel.org>
Signed-off-by: Yicong Yang <yangyccccc@gmail.com>
Reviewed-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Will Deacon <will@kernel.org>
2025-11-07 13:45:02 +00:00
Robin Murphy
2d7a824807 perf/arm-ni: Fix and optimise register offset calculation
LKP points out an operator precedence oversight in the new NoC S3
support that, annoyingly, my local W=1 build didn't flag. In fixing
that, we can also take the similarly-missed opportunity to cache the
version check itself at event_init time.

Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202511041749.ok8zDP6u-lkp@intel.com/
Fixes: 8fa08f8835 ("perf/arm-ni: Add NoC S3 support")
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Will Deacon <will@kernel.org>
2025-11-07 13:42:46 +00:00
Marco Crivellari
24e3848a2e RAS/CEC: Replace use of system_wq with system_percpu_wq
Switch to using system_percpu_wq because system_wq is going away as part of
a workqueue restructuring.

Currently if a user enqueues a work item using schedule_delayed_work() the
used workqueue is "system_wq" (per-cpu workqueue) while queue_delayed_work()
uses WORK_CPU_UNBOUND (used when a CPU is not specified). The same applies to
schedule_work() that is using system_wq and queue_work(), that makes use of
WORK_CPU_UNBOUND again.

This lack of consistency cannot be addressed without refactoring the API.
For more details see those commits and the Link tag below.

  128ea9f6cc ("workqueue: Add system_percpu_wq and system_dfl_wq")
  930c2ea566 ("workqueue: Add new WQ_PERCPU flag")

  [ bp: Massage commit message. ]

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/all/20250221112003.1dSuoGyc@linutronix.de
2025-11-07 13:48:28 +01:00
Sumanth Korikkar
c1287d67c3 s390/sclp_mem: Consider global memory_hotplug.memmap_on_memory setting
When the global kernel command line parameter
memory_hotplug.memmap_on_memory is set to false, per-memory-block
memmap_on_memory setting can still be set to true. However, when
configuring memory block, add_memory_resource() would configure it
without memmap_on_memory.

i.e.
Even if the MHP_MEMMAP_ON_MEMORY flag is set,
mhp_supports_memmap_on_memory() returns false unless the kernel command
line parameter "memory_hotplug.memmap_on_memory" is enabled. When both
the flag and the cmdline parameter are set, the memory block can be
configured with or without memmap_on_memory support.

To ensure consistent behavior, permit configuring per-memory-block
memmap_on_memory only when the memory_hotplug.memmap_on_memory kernel
command line parameter is enabled.

This is similar to commit 73954d379e ("dax: add a sysfs knob to
control memmap_on_memory behavior")

Fixes: ff18dcb19a ("s390/sclp: Add support for dynamic (de)configuration of memory")
Signed-off-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2025-11-06 14:18:23 +01:00
Mete Durlu
8840cc4520 s390/hiperdispatch: Decrease steal time threshold
Higher steal time thresholds favor low utilization scenarios, which is not
the common case for s390. Set steal time threshold to a lower value to
prioritize vertical high and medium CPUs sooner and allow high utilization
scenarios to benefit from it.

Suggested-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Mete Durlu <meted@linux.ibm.com>
Acked-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2025-11-06 14:17:28 +01:00
Thorsten Blum
eb3a9b405b s390/smp: Mark pcpu_delegate() and smp_call_ipl_cpu() as __noreturn
pcpu_delegate() never returns to its caller. If the target CPU is the
current CPU, it calls __pcpu_delegate(), whose delegate function is not
supposed to return. In any case, even if __pcpu_delegate() unexpectedly
returns, pcpu_delegate() sends SIGP_STOP to the current CPU and waits
in an infinite loop. Annotate pcpu_delegate() with the __noreturn
attribute to improve compiler optimizations.

Also annotate smp_call_ipl_cpu() accordingly since it always calls
pcpu_delegate().

[hca: Merge two patches from Thorsten Blum]

Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2025-11-06 14:17:28 +01:00
Thorsten Blum
f07ebfa5e4 s390/nmi: Annotate s390_handle_damage() with __noreturn
s390_handle_damage() ends by calling the non-returning function
disabled_wait() and therefore also never returns. Annotate it with the
__noreturn compiler attribute to improve compiler optimizations.

Remove the unreachable infinite while loop.

Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2025-11-06 14:17:28 +01:00
Bo Liu
858063c1ae s390: Fix double word in comments
Remove the repeated word "the" in comments.

Signed-off-by: Bo Liu <liubo03@inspur.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2025-11-06 14:17:27 +01:00
Heiko Carstens
547e9feb0e Merge branch 'dat-enhancement-1'
Heiko Carstens says:

====================

Add the Dat-Enhancement facility 1 to the list of facilities which are
required to start the kernel. The facility provides the CSPG and IDTE
instructions. In particular the CSPG instruction can be used to replace a
valid page table entry with a different page table entry, which also
differs in the page frame real address.

Without the CSPG instruction it is possible to use the CSP instruction to
change valid page table entries, however it only allows to change the lower
or higher 32 bits of such entries, which means it cannot be used to change
the page frame real address of valid page table entries.

Given that there is code around (e.g. HugeTLB vmemmap optimization) which
requires to change valid page table entries of the kernel mapping, without
the detour over an invalid page table entry, make the CSPG instruction
unconditionally available.

The Dat-Enhancement facility 1 is available since z990, which is older than
the currently supported minimum architecture (z10). Therefore adding this
the architecture level set shouldn't cause any problems.

====================

Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2025-11-06 14:14:00 +01:00
Heiko Carstens
68807a894f s390/mm: Replace the CSP instruction with CSPG
The CSPG instruction is part of the Dat-Enhancement facility 1, which
is always available. Given that it can be used everywhere where also
the CSP instruction can be used, replace CSP with CSPG everywhere.

This allows to remove the csp() inline assembly. Also remove the
unused gmap_pmdp_csp() function.

Acked-by: Alexander Gordeev <agordeev@linux.ibm.com>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2025-11-06 14:12:31 +01:00
Heiko Carstens
220d8e10d6 s390/mm: Remove cpu_has_idte()
Remove cpu_has_idte(). The IDTE instruction is part of the
Dat-Enhancement facility 1, which is always available.
Therefore remove the helper and now superfluous code.

Acked-by: Alexander Gordeev <agordeev@linux.ibm.com>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2025-11-06 14:12:31 +01:00
Heiko Carstens
73c4b5d728 s390: Add Dat-Enhancement facility 1 to architecture level set
Add the Dat-Enhancement facility 1 to the list of facilities which are
required to start the kernel. The facility provides the CSPG and IDTE
instructions. In particular the CSPG instruction can be used to replace a
valid page table entry with a different page table entry, which also
differs in the page frame real address.

Without the CSPG instruction it is possible to use the CSP instruction to
change valid page table entries, however it only allows to change the lower
or higher 32 bits of such entries, which means it cannot be used to change
the page frame real address of valid page table entries.

Given that there is code around (e.g. HugeTLB vmemmap optimization) which
requires to change valid page table entries of the kernel mapping, without
the detour over an invalid page table entry, make the CSPG instruction
unconditionally available.

The Dat-Enhancement facility 1 is available since z990, which is older than
the currently supported minimum architecture (z10). Therefore adding this
to the architecture level set shouldn't cause any problems.

Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2025-11-06 14:12:30 +01:00
Avadhut Naik
8616025ae6 EDAC: Remove the legacy EDAC sysfs interface
Commit

  1997471069 ("edac: add a new per-dimm API and make the old per-virtual-rank API obsolete")

introduced a new per-DIMM sysfs interface for EDAC making the old
per-virtual-rank sysfs interface obsolete.

Since this new sysfs interface was introduced more than a decade ago, remove
the obsolete legacy interface.

Signed-off-by: Avadhut Naik <avadhut.naik@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/20251106015727.1987246-1-avadhut.naik@amd.com
2025-11-06 13:21:29 +01:00
Avadhut Naik
6a85796915 EDAC/amd64: Remove NUM_CONTROLLERS macro
Currently, the NUM_CONTROLLERS macro is used to limit the amount of memory
controllers (UMCs) available per node. The number of UMCs available per node,
however, is already cached by the max_mcs variable of struct amd64_pvt.

Allocate the relevant data structures dynamically using the variable instead
of static allocation through the macro.

The max_mcs variable is used for legacy systems too. These systems have a max
of 2 controllers. Since the default value of max_mcs, set in per_family_init(),
is 2, these legacy systems are also covered.

Signed-off-by: Avadhut Naik <avadhut.naik@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/20251106015727.1987246-1-avadhut.naik@amd.com
2025-11-06 12:51:33 +01:00
Avadhut Naik
e9abd990ae EDAC/amd64: Generate ctl_name string at runtime
Currently, the ctl_name string is statically assigned based on the family and
model of the SOC when the amd64_edac module is loaded.

The same, however, is not exactly needed as the string can be generated and
assigned at runtime through scnprintf().

Remove all static assignments and generate the string at runtime. Also,
cleanup the switch cases which became defunct and consolidate identical cases.

Signed-off-by: Avadhut Naik <avadhut.naik@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/20251106015727.1987246-1-avadhut.naik@amd.com
2025-11-06 12:35:59 +01:00
Yazen Ghannam
56f17be67a x86/mce/amd: Define threshold restart function for banks
Prepare for CMCI storm support by moving the common bank/block iterator code
to a helper function.

Include a parameter to switch the interrupt enable. This will be used by the
CMCI storm handling function.

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
Link: https://lore.kernel.org/20251104-wip-mca-updates-v8-0-66c8eacf67b9@amd.com
2025-11-05 22:38:31 +01:00
Yazen Ghannam
3206b41604 x86/mce/amd: Remove redundant reset_block()
Many of the checks in reset_block() are done again in the block reset
function. So drop the redundant checks.

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/20251104-wip-mca-updates-v8-0-66c8eacf67b9@amd.com
2025-11-05 22:34:53 +01:00
Yazen Ghannam
4efaec6e16 x86/mce/amd: Support SMCA Corrected Error Interrupt
AMD systems optionally support MCA thresholding which provides the ability for
hardware to send an interrupt when a set error threshold is reached. This
feature counts errors of all severities, but it is commonly used to report
correctable errors with an interrupt rather than polling.

Scalable MCA systems allow the platform to take control of this feature. In
this case, the OS will not see the feature configuration and control bits in
the MCA_MISC* registers. The OS will not receive the MCA thresholding
interrupt, and it will need to poll for correctable errors.

A "corrected error interrupt" will be available on Scalable MCA systems. This
will be used in the same configuration where the platform controls MCA
thresholding. However, the platform will now be able to send the MCA
thresholding interrupt to the OS.

Check for, and enable, this feature during per-CPU SMCA init.

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Tested-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/20251104-wip-mca-updates-v8-0-66c8eacf67b9@amd.com
2025-11-05 22:10:23 +01:00
Usama Arif
8436112341 efi/libstub: Fix page table access in 5-level to 4-level paging transition
When transitioning from 5-level to 4-level paging, the existing code
incorrectly accesses page table entries by directly dereferencing CR3 and
applying PAGE_MASK. This approach has several issues:

- __native_read_cr3() returns the raw CR3 register value, which on x86_64
  includes not just the physical address but also flags Bits above the
  physical address width of the system (i.e. above __PHYSICAL_MASK_SHIFT) are
  also not masked.

- The pgd value is masked by PAGE_SIZE which doesn't take into account the
  higher bits such as _PAGE_BIT_NOPTISHADOW.

Replace this with proper accessor functions:

- native_read_cr3_pa(): Uses CR3_ADDR_MASK to additionally mask metadata out
  of CR3 (like SME or LAM bits). All remaining bits are real address bits or
  reserved and must be 0.

- mask pgd value with PTE_PFN_MASK instead of PAGE_MASK, accounting for flags
  above bit 51 (_PAGE_BIT_NOPTISHADOW in particular). Bits below 51, but above
  the max physical address are reserved and must be 0.

Fixes: cb1c9e02b0 ("x86/efistub: Perform 4/5 level paging switch from the stub")
Reported-by: Michael van der Westhuizen <rmikey@meta.com>
Reported-by: Tobias Fleig <tfleig@meta.com>
Co-developed-by: Kiryl Shutsemau <kas@kernel.org>
Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
Signed-off-by: Usama Arif <usamaarif642@gmail.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://patch.msgid.link/20251103141002.2280812-3-usamaarif642@gmail.com
2025-11-05 17:31:32 +01:00
Usama Arif
eb22663125 x86/boot: Fix page table access in 5-level to 4-level paging transition
When transitioning from 5-level to 4-level paging, the existing code
incorrectly accesses page table entries by directly dereferencing CR3 and
applying PAGE_MASK. This approach has several issues:

- __native_read_cr3() returns the raw CR3 register value, which on x86_64
  includes not just the physical address but also flags. Bits above the
  physical address width of the system i.e. above __PHYSICAL_MASK_SHIFT) are
  also not masked.

- The PGD entry is masked by PAGE_SIZE which doesn't take into account the
  higher bits such as _PAGE_BIT_NOPTISHADOW.

Replace this with proper accessor functions:

- native_read_cr3_pa(): Uses CR3_ADDR_MASK to additionally mask metadata out
  of CR3 (like SME or LAM bits). All remaining bits are real address bits or
  reserved and must be 0.

- mask pgd value with PTE_PFN_MASK instead of PAGE_MASK, accounting for flags
  above bit 51 (_PAGE_BIT_NOPTISHADOW in particular). Bits below 51, but above
  the max physical address are reserved and must be 0.

Fixes: e9d0e6330e ("x86/boot/compressed/64: Prepare new top-level page table for trampoline")
Reported-by: Michael van der Westhuizen <rmikey@meta.com>
Reported-by: Tobias Fleig <tfleig@meta.com>
Co-developed-by: Kiryl Shutsemau <kas@kernel.org>
Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
Signed-off-by: Usama Arif <usamaarif642@gmail.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/r/a482fd68-ce54-472d-8df1-33d6ac9f6bb5@intel.com
2025-11-05 17:19:11 +01:00
Yazen Ghannam
134b1eabe6 x86/mce/amd: Enable interrupt vectors once per-CPU on SMCA systems
Scalable MCA systems have a per-CPU register that gives the APIC LVT offset
for the thresholding and deferred error interrupts.

Currently, this register is read once to set up the deferred error interrupt
and then read again for each thresholding block. Furthermore, the APIC LVT
registers are configured each time, but they only need to be configured once
per-CPU.

Move the APIC LVT setup to the early part of CPU init, so that the registers
are set up once. Also, this ensures that the kernel is ready to service the
interrupts before the individual error sources (each MCA bank) are enabled.

Apply this change only to SMCA systems to avoid breaking any legacy behavior.
The deferred error interrupt is technically advertised by the SUCCOR feature.
However, this was first made available on SMCA systems.  Therefore, only set
up the deferred error interrupt on SMCA systems and simplify the code.

Guidance from hardware designers is that the LVT offsets provided from the
platform should be used. The kernel should not try to enforce specific values.
However, the kernel should check that an LVT offset is not reused for multiple
sources.

Therefore, remove the extra checking and value enforcement from the MCE code.
The "reuse/conflict" case is already handled in setup_APIC_eilvt().

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Tested-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/20251104-wip-mca-updates-v8-0-66c8eacf67b9@amd.com
2025-11-05 16:51:27 +01:00
Yazen Ghannam
7cb735d7c0 x86/mce: Unify AMD DFR handler with MCA Polling
AMD systems optionally support a deferred error interrupt. The interrupt
should be used as another signal to trigger MCA polling. This is similar to
how other MCA interrupts are handled.

Deferred errors do not require any special handling related to the interrupt,
e.g. resetting or rearming the interrupt, etc.

However, Scalable MCA systems include a pair of registers, MCA_DESTAT and
MCA_DEADDR, that should be checked for valid errors. This check should be done
whenever MCA registers are polled. Currently, the deferred error interrupt
does this check, but the MCA polling function does not.

Call the MCA polling function when handling the deferred error interrupt. This
keeps all "polling" cases in a common function.

Add an SMCA status check helper. This will do the same status check and
register clearing that the interrupt handler has done. And it extends the
common polling flow to find AMD deferred errors.

Clear the MCA_DESTAT register at the end of the handler rather than the
beginning. This maintains the procedure that the 'status' register must be
cleared as the final step.

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/20251104-wip-mca-updates-v8-0-66c8eacf67b9@amd.com
2025-11-05 16:41:32 +01:00
Yazen Ghannam
34da4a5d68 x86/mce: Unify AMD THR handler with MCA Polling
AMD systems optionally support an MCA thresholding interrupt. The interrupt
should be used as another signal to trigger MCA polling. This is similar to
how the Intel Corrected Machine Check interrupt (CMCI) is handled.

AMD MCA thresholding is managed using the MCA_MISC registers within an MCA
bank. The OS will need to modify the hardware error count field in order to
reset the threshold limit and rearm the interrupt. Management of the MCA_MISC
register should be done as a follow up to the basic MCA polling flow. It
should not be the main focus of the interrupt handler.

Furthermore, future systems will have the ability to send an MCA thresholding
interrupt to the OS even when the OS does not manage the feature, i.e.
MCA_MISC registers are Read-as-Zero/Locked.

Call the common MCA polling function when handling the MCA thresholding
interrupt. This will allow the OS to find any valid errors whether or not the
MCA thresholding feature is OS-managed. Also, this allows the common MCA
polling options and kernel parameters to apply to AMD systems.

Add a callback to the MCA polling function to check and reset any threshold
blocks that have reached their threshold limit.

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/20251104-wip-mca-updates-v8-0-66c8eacf67b9@amd.com
2025-11-05 13:41:18 +01:00
Marc Herbert
41f4767000 x86/msr: Add CPU_OUT_OF_SPEC taint name to "unrecognized" pr_warn(msg)
While restricting access,

  a7e1f67ed2 ("x86/msr: Filter MSR writes")

also added warning and started tainting the kernel.

But the warning message never mentioned tainting. Moreover, this uses the
"CPU_OUT_OF_SPEC" flag which is not clearly related to MSRs: that flag is
overloaded by several, fairly different situations, including some much
scarier ones.

So, without an expert around (thank you Dave Hansen), it would have been
practically impossible to root cause the tainting from just the log file at
hand. So it would be prudent to explicitly mention in the logs when the
tainting happens so that debugging crashes can be made easier.

Fix this by simply appending the CPU_OUT_OF_SPEC flag to the warning message.

This readability issue happened when staring at logs involving the Intel
Memory Latency Checker (among many other things going on in that log). The MLC
disables hardware prefetch.

  [ bp: Massage and extend commit message. ]

Signed-off-by: Marc Herbert <marc.herbert@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20251101-tainted-msr-v1-1-e00658ba04d4@linux.intel.com
2025-11-05 13:14:42 +01:00
Borislav Petkov (AMD)
47955b58cf x86/cpufeatures: Correct LKGS feature flag description
Quotation marks in cpufeatures.h comments are special and when the
comment begins with a quoted string, that string lands in /proc/cpuinfo,
turning it into a user-visible one.

The LKGS comment doesn't begin with a quoted string but just in case
drop the quoted "kernel" in there to avoid confusion. And while at it,
simply change the description into what the LKGS instruction does for
more clarity.

No functional changes.

Reviewed-by: Xin Li (Intel) <xin@zytor.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20251015103548.10194-1-bp@kernel.org
2025-11-04 23:09:34 +01:00
Peter Zijlstra
1fe4002cf7 x86/ptrace: Always inline trivial accessors
A KASAN build bloats these single load/store helpers such that
it fails to inline them:

  vmlinux.o: error: objtool: irqentry_exit+0x5e8: call to instruction_pointer_set() with UACCESS enabled

Make sure the compiler isn't allowed to do stupid.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://patch.msgid.link/20251031105435.GU4068168@noisy.programming.kicks-ass.net
2025-11-04 08:36:20 +01:00
Peter Zijlstra
323d93f043 cleanup: Always inline everything
KASAN bloat caused cleanup helper functions to not get inlined:

  vmlinux.o: error: objtool: irqentry_exit+0x323: call to class_user_rw_access_destructor() with UACCESS enabled

Force inline all the cleanup helpers like they already are on normal
builds.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://patch.msgid.link/20251031105435.GU4068168@noisy.programming.kicks-ass.net
2025-11-04 08:35:58 +01:00
Thomas Gleixner
32034df66b rseq: Switch to TIF_RSEQ if supported
TIF_NOTIFY_RESUME is a multiplexing TIF bit, which is suboptimal especially
with the RSEQ fast path depending on it, but not really handling it.

Define a separate TIF_RSEQ in the generic TIF space and enable the full
separation of fast and slow path for architectures which utilize that.

That avoids the hassle with invocations of resume_user_mode_work() from
hypervisors, which clear TIF_NOTIFY_RESUME. It makes the therefore required
re-evaluation at the end of vcpu_run() a NOOP on architectures which
utilize the generic TIF space and have a separate TIF_RSEQ.

The hypervisor TIF handling does not include the separate TIF_RSEQ as there
is no point in doing so. The guest does neither know nor care about the VMM
host applications RSEQ state. That state is only relevant when the ioctl()
returns to user space.

The fastpath implementation still utilizes TIF_NOTIFY_RESUME for failure
handling, but this only happens within exit_to_user_mode_loop(), so
arguably the hypervisor ioctl() code is long done when this happens.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084307.903622031@linutronix.de
2025-11-04 08:35:37 +01:00
Thomas Gleixner
7a5201ea19 rseq: Split up rseq_exit_to_user_mode()
Separate the interrupt and syscall exit handling. Syscall exit does not
require to clear the user_irq bit as it can't be set. On interrupt exit it
can be set when the interrupt did not result in a scheduling event and
therefore the return path did not invoke the TIF work handling, which would
have cleared it.

The debug check for the event state is also not really required even when
debug mode is enabled via the static key. Debug mode is largely aiding user
space by enabling a larger amount of validation checks, which cause a
segfault when a malformed critical section is detected. In production mode
the critical section handling takes the content mostly as is and lets user
space keep the pieces when it screwed up.

On kernel changes in that area the state check is useful, but that can be
done when lockdep is enabled, which is anyway a required test scenario for
fundamental changes.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084307.842785700@linutronix.de
2025-11-04 08:35:30 +01:00
Thomas Gleixner
70fe25a3bc entry: Split up exit_to_user_mode_prepare()
exit_to_user_mode_prepare() is used for both interrupts and syscalls, but
there is extra rseq work, which is only required for in the interrupt exit
case.

Split up the function and provide wrappers for syscalls and interrupts,
which allows to separate the rseq exit work in the next step.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084307.782234789@linutronix.de
2025-11-04 08:35:17 +01:00
Thomas Gleixner
3db6b38dfe rseq: Switch to fast path processing on exit to user
Now that all bits and pieces are in place, hook the RSEQ handling fast path
function into exit_to_user_mode_prepare() after the TIF work bits have been
handled. If case of fast path failure, TIF_NOTIFY_RESUME has been raised
and the caller needs to take another turn through the TIF handling slow
path.

This only works for architectures which use the generic entry code.
Architectures who still have their own incomplete hacks are not supported
and won't be.

This results in the following improvements:

  Kernel build	       Before		  After		      Reduction

  exit to user         80692981		  80514451
  signal checks:          32581		       121	       99%
  slowpath runs:        1201408   1.49%	       198 0.00%      100%
  fastpath runs:			    675941 0.84%       N/A
  id updates:           1233989   1.53%	     50541 0.06%       96%
  cs checks:            1125366   1.39%	         0 0.00%      100%
    cs cleared:         1125366      100%	 0            100%
    cs fixup:                 0        0%	 0

  RSEQ selftests      Before		  After		      Reduction

  exit to user:       386281778		  387373750
  signal checks:       35661203		          0           100%
  slowpath runs:      140542396 36.38%	        100  0.00%    100%
  fastpath runs:			    9509789  2.51%     N/A
  id updates:         176203599 45.62%	    9087994  2.35%     95%
  cs checks:          175587856 45.46%	    4728394  1.22%     98%
    cs cleared:       172359544   98.16%    1319307   27.90%   99%
    cs fixup:           3228312    1.84%    3409087   72.10%

The 'cs cleared' and 'cs fixup' percentages are not relative to the exit to
user invocations, they are relative to the actual 'cs check' invocations.

While some of this could have been avoided in the original code, like the
obvious clearing of CS when it's already clear, the main problem of going
through TIF_NOTIFY_RESUME cannot be solved. In some workloads the RSEQ
notify handler is invoked more than once before going out to user
space. Doing this once when everything has stabilized is the only solution
to avoid this.

The initial attempt to completely decouple it from the TIF work turned out
to be suboptimal for workloads, which do a lot of quick and short system
calls. Even if the fast path decision is only 4 instructions (including a
conditional branch), this adds up quickly and becomes measurable when the
rate for actually having to handle rseq is in the low single digit
percentage range of user/kernel transitions.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084307.701201365@linutronix.de
2025-11-04 08:34:39 +01:00
Thomas Gleixner
05b44aef70 rseq: Implement fast path for exit to user
Implement the actual logic for handling RSEQ updates in a fast path after
handling the TIF work and at the point where the task is actually returning
to user space.

This is the right point to do that because at this point the CPU and the MM
CID are stable and cannot longer change due to yet another reschedule.
That happens when the task is handling it via TIF_NOTIFY_RESUME in
resume_user_mode_work(), which is invoked from the exit to user mode work
loop.

The function is invoked after the TIF work is handled and runs with
interrupts disabled, which means it cannot resolve page faults. It
therefore disables page faults and in case the access to the user space
memory faults, it:

  - notes the fail in the event struct
  - raises TIF_NOTIFY_RESUME
  - returns false to the caller

The caller has to go back to the TIF work, which runs with interrupts
enabled and therefore can resolve the page faults. This happens mostly on
fork() when the memory is marked COW.

If the user memory inspection finds invalid data, the function returns
false as well and sets the fatal flag in the event struct along with
TIF_NOTIFY_RESUME. The slow path notify handler has to evaluate that flag
and terminate the task with SIGSEGV as documented.

The initial decision to invoke any of this is based on one flags in the
event struct: @sched_switch. The decision is in pseudo ASM:

      load	tsk::event::sched_switch
      jnz	inspect_user_space
      mov	$0, tsk::event::events
      ...
      leave

So for the common case where the task was not scheduled out, this really
boils down to three instructions before going out if the compiler is not
completely stupid (and yes, some of them are).

If the condition is true, then it checks, whether CPU ID or MM CID have
changed. If so, then the CPU/MM IDs have to be updated and are thereby
cached for the next round. The update unconditionally retrieves the user
space critical section address to spare another user*begin/end() pair.  If
that's not zero and tsk::event::user_irq is set, then the critical section
is analyzed and acted upon. If either zero or the entry came via syscall
the critical section analysis is skipped.

If the comparison is false then the critical section has to be analyzed
because the event flag is then only true when entry from user was by
interrupt.

This is provided without the actual hookup to let reviewers focus on the
implementation details. The hookup happens in the next step.

Note: As with quite some other optimizations this depends on the generic
entry infrastructure and is not enabled to be sucked into random
architecture implementations.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084307.638929615@linutronix.de
2025-11-04 08:34:18 +01:00
Thomas Gleixner
39a167560a rseq: Optimize event setting
After removing the various condition bits earlier it turns out that one
extra information is needed to avoid setting event::sched_switch and
TIF_NOTIFY_RESUME unconditionally on every context switch.

The update of the RSEQ user space memory is only required, when either

  the task was interrupted in user space and schedules

or

  the CPU or MM CID changes in schedule() independent of the entry mode

Right now only the interrupt from user information is available.

Add an event flag, which is set when the CPU or MM CID or both change.

Evaluate this event in the scheduler to decide whether the sched_switch
event and the TIF bit need to be set.

It's an extra conditional in context_switch(), but the downside of
unconditionally handling RSEQ after a context switch to user is way more
significant. The utilized boolean logic minimizes this to a single
conditional branch.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084307.578058898@linutronix.de
2025-11-04 08:34:03 +01:00
Thomas Gleixner
e2d4f42271 rseq: Rework the TIF_NOTIFY handler
Replace the whole logic with a new implementation, which is shared with
signal delivery and the upcoming exit fast path.

Contrary to the original implementation, this ignores invocations from
KVM/IO-uring, which invoke resume_user_mode_work() with the @regs argument
set to NULL.

The original implementation updated the CPU/Node/MM CID fields, but that
was just a side effect, which was addressing the problem that this
invocation cleared TIF_NOTIFY_RESUME, which in turn could cause an update
on return to user space to be lost.

This problem has been addressed differently, so that it's not longer
required to do that update before entering the guest.

That might be considered a user visible change, when the hosts thread TLS
memory is mapped into the guest, but as this was never intentionally
supported, this abuse of kernel internal implementation details is not
considered an ABI break.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084307.517640811@linutronix.de
2025-11-04 08:33:54 +01:00
Thomas Gleixner
9f6ffd4ceb rseq: Separate the signal delivery path
Completely separate the signal delivery path from the notify handler as
they have different semantics versus the event handling.

The signal delivery only needs to ensure that the interrupted user context
was not in a critical section or the section is aborted before it switches
to the signal frame context. The signal frame context does not have the
original instruction pointer anymore, so that can't be handled on exit to
user space.

No point in updating the CPU/CID ids as they might change again before the
task returns to user space for real.

The fast path optimization, which checks for the 'entry from user via
interrupt' condition is only available for architectures which use the
generic entry code.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084307.455429038@linutronix.de
2025-11-04 08:33:47 +01:00
Thomas Gleixner
0f085b4188 rseq: Provide and use rseq_set_ids()
Provide a new and straight forward implementation to set the IDs (CPU ID,
Node ID and MM CID), which can be later inlined into the fast path.

It does all operations in one scoped_user_rw_access() section and retrieves
also the critical section member (rseq::cs_rseq) from user space to avoid
another user..begin/end() pair. This is in preparation for optimizing the
fast path to avoid extra work when not required.

On rseq registration set the CPU ID fields to RSEQ_CPU_ID_UNINITIALIZED and
node and MM CID to zero. That's the same as the kernel internal reset
values. That makes the debug validation in the exit code work correctly on
the first exit to user space.

Use it to replace the whole related zoo in rseq.c

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084307.393972266@linutronix.de
2025-11-04 08:33:33 +01:00
Thomas Gleixner
eaa9088d56 rseq: Use static branch for syscall exit debug when GENERIC_IRQ_ENTRY=y
Make the syscall exit debug mechanism available via the static branch on
architectures which utilize the generic entry code.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084307.333440475@linutronix.de
2025-11-04 08:33:27 +01:00
Thomas Gleixner
c1cbad8f99 rseq: Make exit debugging static branch based
Disconnect it from the config switch and use the static debug branch. This
is a temporary measure for validating the rework. At the end this check
needs to be hidden behind lockdep as it has nothing to do with the other
debug infrastructure, which mainly aids user space debugging by enabling a
zoo of checks which terminate misbehaving tasks instead of letting them
keep the hard to diagnose pieces.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084307.272660745@linutronix.de
2025-11-04 08:33:20 +01:00
Thomas Gleixner
f7ee1964ac rseq: Replace the original debug implementation
Just utilize the new infrastructure and put the original one to rest.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084307.212510692@linutronix.de
2025-11-04 08:33:12 +01:00
Thomas Gleixner
abc850e761 rseq: Provide and use rseq_update_user_cs()
Provide a straight forward implementation to check for and eventually
clear/fixup critical sections in user space.

The non-debug version does only the minimal sanity checks and aims for
efficiency.

There are two attack vectors, which are checked for:

  1) An abort IP which is in the kernel address space. That would cause at
     least x86 to return to kernel space via IRET.

  2) A rogue critical section descriptor with an abort IP pointing to some
     arbitrary address, which is not preceded by the RSEQ signature.

If the section descriptors are invalid then the resulting misbehaviour of
the user space application is not the kernels problem.

The kernel provides a run-time switchable debug slow path, which implements
the full zoo of checks including termination of the task when one of the
gazillion conditions is not met.

Replace the zoo in rseq.c with it and invoke it from the TIF_NOTIFY_RESUME
handler. Move the remainders into the CONFIG_DEBUG_RSEQ section, which will
be replaced and removed in a subsequent step.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084307.151465632@linutronix.de
2025-11-04 08:32:57 +01:00
Thomas Gleixner
9c37cb6e80 rseq: Provide static branch for runtime debugging
Config based debug is rarely turned on and is not available easily when
things go wrong.

Provide a static branch to allow permanent integration of debug mechanisms
along with the usual toggles in Kconfig, command line and debugfs.

Requested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084307.089270547@linutronix.de
2025-11-04 08:32:49 +01:00
Thomas Gleixner
5412910487 rseq: Expose lightweight statistics in debugfs
Analyzing the call frequency without actually using tracing is helpful for
analysis of this infrastructure. The overhead is minimal as it just
increments a per CPU counter associated to each operation.

The debugfs readout provides a racy sum of all counters.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084307.027916598@linutronix.de
2025-11-04 08:32:41 +01:00
Thomas Gleixner
dab344753e rseq: Provide tracepoint wrappers for inline code
Provide tracepoint wrappers for the upcoming RSEQ exit to user space inline
fast path, so that the header can be safely included by code which defines
actual trace points.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084306.967114316@linutronix.de
2025-11-04 08:32:35 +01:00
Thomas Gleixner
2fc0e4b412 rseq: Record interrupt from user space
For RSEQ the only relevant reason to inspect and eventually fixup (abort)
user space critical sections is when user space was interrupted and the
task was scheduled out.

If the user to kernel entry was from a syscall no fixup is required. If
user space invokes a syscall from a critical section it can keep the
pieces as documented.

This is only supported on architectures which utilize the generic entry
code. If your architecture does not use it, bad luck.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084306.905067101@linutronix.de
2025-11-04 08:32:23 +01:00
Thomas Gleixner
4b7de6df20 rseq: Cache CPU ID and MM CID values
In preparation for rewriting RSEQ exit to user space handling provide
storage to cache the CPU ID and MM CID values which were written to user
space. That prepares for a quick check, which avoids the update when
nothing changed.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084306.841964081@linutronix.de
2025-11-04 08:32:14 +01:00
Thomas Gleixner
4fc9225d19 sched: Move MM CID related functions to sched.h
There is nothing mm specific in that and including mm.h can cause header
recursion hell.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084306.778457951@linutronix.de
2025-11-04 08:32:04 +01:00
Thomas Gleixner
7702a9c285 entry: Inline irqentry_enter/exit_from/to_user_mode()
There is no point to have this as a function which just inlines
enter_from_user_mode(). The function call overhead is larger than the
function itself.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084306.715309918@linutronix.de
2025-11-04 08:31:47 +01:00
Thomas Gleixner
54a5ab5624 entry: Remove syscall_enter_from_user_mode_prepare()
Open code the only user in the x86 syscall code and reduce the zoo of
functions.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084306.652839989@linutronix.de
2025-11-04 08:31:37 +01:00
Thomas Gleixner
5204be1679 entry: Clean up header
Clean up the include ordering, kernel-doc and other trivialities before
making further changes.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084306.590338411@linutronix.de
2025-11-04 08:31:14 +01:00
Thomas Gleixner
faba9d250e rseq: Introduce struct rseq_data
In preparation for a major rewrite of this code, provide a data structure
for rseq management.

Put all the rseq related data into it (except for the debug part), which
allows to simplify fork/execve by using memset() and memcpy() instead of
adding new fields to initialize over and over.

Create a storage struct for event management as well and put the
sched_switch event and a indicator for RSEQ on a task into it as a
start. That uses a union, which allows to mask and clear the whole lot
efficiently.

The indicators are explicitly not a bit field. Bit fields generate abysmal
code.

The boolean members are defined as u8 as that actually guarantees that it
fits. There seem to be strange architecture ABIs which need more than 8
bits for a boolean.

The has_rseq member is redundant vs. task::rseq, but it turns out that
boolean operations and quick checks on the union generate better code than
fiddling with separate entities and data types.

This struct will be extended over time to carry more information.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084306.527086690@linutronix.de
2025-11-04 08:30:50 +01:00
Thomas Gleixner
566d8015f7 rseq: Avoid CPU/MM CID updates when no event pending
There is no need to update these values unconditionally if there is no
event pending.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084306.462964916@linutronix.de
2025-11-04 08:30:43 +01:00
Thomas Gleixner
83409986f4 rseq, virt: Retrigger RSEQ after vcpu_run()
Hypervisors invoke resume_user_mode_work() before entering the guest, which
clears TIF_NOTIFY_RESUME. The @regs argument is NULL as there is no user
space context available to them, so the rseq notify handler skips
inspecting the critical section, but updates the CPU/MM CID values
unconditionally so that the eventual pending rseq event is not lost on the
way to user space.

This is a pointless exercise as the task might be rescheduled before
actually returning to user space and it creates unnecessary work in the
vcpu_run() loops.

It's way more efficient to ignore that invocation based on @regs == NULL
and let the hypervisors re-raise TIF_NOTIFY_RESUME after returning from the
vcpu_run() loop before returning from the ioctl().

This ensures that a pending RSEQ update is not lost and the IDs are updated
before returning to user space.

Once the RSEQ handling is decoupled from TIF_NOTIFY_RESUME, this turns into
a NOOP.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Sean Christopherson <seanjc@google.com>
Link: https://patch.msgid.link/20251027084306.399495855@linutronix.de
2025-11-04 08:30:23 +01:00
Thomas Gleixner
d923739e2e rseq: Simplify the event notification
Since commit 0190e4198e ("rseq: Deprecate RSEQ_CS_FLAG_NO_RESTART_ON_*
flags") the bits in task::rseq_event_mask are meaningless and just extra
work in terms of setting them individually.

Aside of that the only relevant point where an event has to be raised is
context switch. Neither the CPU nor MM CID can change without going through
a context switch.

Collapse them all into a single boolean which simplifies the code a lot and
remove the pointless invocations which have been sprinkled all over the
place for no value.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084306.336978188@linutronix.de
2025-11-04 08:30:09 +01:00
Thomas Gleixner
067b3b41b4 rseq: Simplify registration
There is no point to read the critical section element in the newly
registered user space RSEQ struct first in order to clear it.

Just clear it and be done with it.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084306.274661227@linutronix.de
2025-11-04 08:30:05 +01:00
Thomas Gleixner
41b43a6ba3 rseq: Remove the ksig argument from rseq_handle_notify_resume()
There is no point for this being visible in the resume_to_user_mode()
handling.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084306.211520245@linutronix.de
2025-11-04 08:30:01 +01:00
Thomas Gleixner
77f19e4d4f rseq: Move algorithm comment to top
Move the comment which documents the RSEQ algorithm to the top of the file,
so it does not create horrible diffs later when the actual implementation
is fed into the mincer.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084306.149519580@linutronix.de
2025-11-04 08:29:52 +01:00
Thomas Gleixner
fdc0f39d28 rseq: Condense the inline stubs
Scrolling over tons of pointless

	{
	}

lines to find the actual code is annoying at best.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084306.085971048@linutronix.de
2025-11-04 08:29:08 +01:00
Thomas Gleixner
3ca59da7aa rseq: Avoid pointless evaluation in __rseq_notify_resume()
The RSEQ critical section mechanism only clears the event mask when a
critical section is registered, otherwise it is stale and collects
bits.

That means once a critical section is installed the first invocation of
that code when TIF_NOTIFY_RESUME is set will abort the critical section,
even when the TIF bit was not raised by the rseq preempt/migrate/signal
helpers.

This also has a performance implication because TIF_NOTIFY_RESUME is a
multiplexing TIF bit, which is utilized by quite some infrastructure. That
means every invocation of __rseq_notify_resume() goes unconditionally
through the heavy lifting of user space access and consistency checks even
if there is no reason to do so.

Keeping the stale event mask around when exiting to user space also
prevents it from being utilized by the upcoming time slice extension
mechanism.

Avoid this by reading and clearing the event mask before doing the user
space critical section access with interrupts or preemption disabled, which
ensures that the read and clear operation is CPU local atomic versus
scheduling and the membarrier IPI.

This is correct as after re-enabling interrupts/preemption any relevant
event will set the bit again and raise TIF_NOTIFY_RESUME, which makes the
user space exit code take another round of TIF bit clearing.

If the event mask was non-zero, invoke the slow path. On debug kernels the
slow path is invoked unconditionally and the result of the event mask
evaluation is handed in.

Add a exit path check after the TIF bit loop, which validates on debug
kernels that the event mask is zero before exiting to user space.

While at it reword the convoluted comment why the pt_regs pointer can be
NULL under certain circumstances.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084306.022571576@linutronix.de
2025-11-04 08:28:38 +01:00
Thomas Gleixner
3ce17e6909 select: Convert to scoped user access
Replace the open coded implementation with the scoped user access guard.

No functional change intended.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027083745.862419776@linutronix.de
2025-11-04 08:28:34 +01:00
Thomas Gleixner
e02718c986 x86/futex: Convert to scoped user access
Replace the open coded implementation with the scoped user access
guards

No functional change intended.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://patch.msgid.link/20251027083745.799714344@linutronix.de
2025-11-04 08:28:29 +01:00
Thomas Gleixner
e4e28fd698 futex: Convert to get/put_user_inline()
Replace the open coded implementation with the new get/put_user_inline()
helpers. This might be replaced by a regular get/put_user(), but that needs
a proper performance evaluation.

No functional change intended.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://patch.msgid.link/20251027083745.736737934@linutronix.de
2025-11-04 08:28:23 +01:00
Thomas Gleixner
b2cfc0cd68 uaccess: Provide put/get_user_inline()
Provide convenience wrappers around scoped user access similar to
put/get_user(), which reduce the usage sites to:

       if (!get_user_inline(val, ptr))
       		return -EFAULT;

Should only be used if there is a demonstrable performance benefit.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027083745.609031602@linutronix.de
2025-11-04 08:28:15 +01:00
Thomas Gleixner
e497310b4f uaccess: Provide scoped user access regions
User space access regions are tedious and require similar code patterns all
over the place:

     	if (!user_read_access_begin(from, sizeof(*from)))
		return -EFAULT;
	unsafe_get_user(val, from, Efault);
	user_read_access_end();
	return 0;
Efault:
	user_read_access_end();
	return -EFAULT;

This got worse with the recent addition of masked user access, which
optimizes the speculation prevention:

	if (can_do_masked_user_access())
		from = masked_user_read_access_begin((from));
	else if (!user_read_access_begin(from, sizeof(*from)))
		return -EFAULT;
	unsafe_get_user(val, from, Efault);
	user_read_access_end();
	return 0;
Efault:
	user_read_access_end();
	return -EFAULT;

There have been issues with using the wrong user_*_access_end() variant in
the error path and other typical Copy&Pasta problems, e.g. using the wrong
fault label in the user accessor which ends up using the wrong accesss end
variant.

These patterns beg for scopes with automatic cleanup. The resulting outcome
is:
    	scoped_user_read_access(from, Efault)
		unsafe_get_user(val, from, Efault);
	return 0;
  Efault:
	return -EFAULT;

The scope guarantees the proper cleanup for the access mode is invoked both
in the success and the failure (fault) path.

The scoped_user_$MODE_access() macros are implemented as self terminating
nested for() loops. Thanks to Andrew Cooper for pointing me at them. The
scope can therefore be left with 'break', 'goto' and 'return'.  Even
'continue' "works" due to the self termination mechanism. Both GCC and
clang optimize all the convoluted macro maze out and the above results with
clang in:

 b80:	f3 0f 1e fa          	       endbr64
 b84:	48 b8 ef cd ab 89 67 45 23 01  movabs $0x123456789abcdef,%rax
 b8e:	48 39 c7    	               cmp    %rax,%rdi
 b91:	48 0f 47 f8          	       cmova  %rax,%rdi
 b95:	90                   	       nop
 b96:	90                   	       nop
 b97:	90                   	       nop
 b98:	31 c9                	       xor    %ecx,%ecx
 b9a:	8b 07                	       mov    (%rdi),%eax
 b9c:	89 06                	       mov    %eax,(%rsi)
 b9e:	85 c9                	       test   %ecx,%ecx
 ba0:	0f 94 c0             	       sete   %al
 ba3:	90                   	       nop
 ba4:	90                   	       nop
 ba5:	90                   	       nop
 ba6:	c3                   	       ret

Which looks as compact as it gets. The NOPs are placeholder for STAC/CLAC.
GCC emits the fault path seperately:

 bf0:	f3 0f 1e fa          	       endbr64
 bf4:	48 b8 ef cd ab 89 67 45 23 01  movabs $0x123456789abcdef,%rax
 bfe:	48 39 c7             	       cmp    %rax,%rdi
 c01:	48 0f 47 f8          	       cmova  %rax,%rdi
 c05:	90                   	       nop
 c06:	90                   	       nop
 c07:	90                   	       nop
 c08:	31 d2                	       xor    %edx,%edx
 c0a:	8b 07                	       mov    (%rdi),%eax
 c0c:	89 06                	       mov    %eax,(%rsi)
 c0e:	85 d2                	       test   %edx,%edx
 c10:	75 09                	       jne    c1b <afoo+0x2b>
 c12:	90                   	       nop
 c13:	90                   	       nop
 c14:	90                   	       nop
 c15:	b8 01 00 00 00       	       mov    $0x1,%eax
 c1a:	c3                   	       ret
 c1b:	90                   	       nop
 c1c:	90                   	       nop
 c1d:	90                   	       nop
 c1e:	31 c0                	       xor    %eax,%eax
 c20:	c3                   	       ret

The fault labels for the scoped*() macros and the fault labels for the
actual user space accessors can be shared and must be placed outside of the
scope.

If masked user access is enabled on an architecture, then the pointer
handed in to scoped_user_$MODE_access() can be modified to point to a
guaranteed faulting user address. This modification is only scope local as
the pointer is aliased inside the scope. When the scope is left the alias
is not longer in effect. IOW the original pointer value is preserved so it
can be used e.g. for fixup or diagnostic purposes in the fault path.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027083745.546420421@linutronix.de
2025-11-04 08:27:52 +01:00
Thomas Gleixner
2db48d8bf8 arm64: uaccess: Use unsafe wrappers for ASM GOTO
Clang propagates a provided label, which is outside of a cleanup scope to
ASM GOTO despite the fact that __raw_get_mem() has a local label for that
purpose:

  "error: cannot jump from this asm goto statement to one of its possible targets"

Using the unsafe wrapper with the extra local label indirection cures that.

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-11-04 08:27:20 +01:00
Thomas Gleixner
43cc54d8db s390/uaccess: Use unsafe wrappers for ASM GOTO
ASM GOTO is miscompiled by GCC when it is used inside a auto cleanup scope:

bool foo(u32 __user *p, u32 val)
{
	scoped_guard(pagefault)
		unsafe_put_user(val, p, efault);
	return true;
efault:
	return false;
}

It ends up leaking the pagefault disable counter in the fault path. clang
at least fails the build.

S390 is not affected for unsafe_*_user() as it uses its own local label
already, but __get/put_kernel_nofault() lack that.

Rename them to arch_*_kernel_nofault() which makes the generic uaccess
header wrap it with a local label that makes both compilers emit correct
code.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Link: https://patch.msgid.link/20251027083745.483079889@linutronix.de
2025-11-03 15:26:10 +01:00
Thomas Gleixner
0988ea18c6 riscv/uaccess: Use unsafe wrappers for ASM GOTO
ASM GOTO is miscompiled by GCC when it is used inside a auto cleanup scope:

bool foo(u32 __user *p, u32 val)
{
	scoped_guard(pagefault)
		unsafe_put_user(val, p, efault);
	return true;
efault:
	return false;
}

It ends up leaking the pagefault disable counter in the fault path. clang
at least fails the build.

Rename unsafe_*_user() to arch_unsafe_*_user() which makes the generic
uaccess header wrap it with a local label that makes both compilers emit
correct code. Same for the kernel_nofault() variants.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027083745.419351819@linutronix.de
2025-11-03 15:26:10 +01:00
Thomas Gleixner
5002dd5314 powerpc/uaccess: Use unsafe wrappers for ASM GOTO
ASM GOTO is miscompiled by GCC when it is used inside a auto cleanup scope:

bool foo(u32 __user *p, u32 val)
{
	scoped_guard(pagefault)
		unsafe_put_user(val, p, efault);
	return true;
efault:
	return false;
}

It ends up leaking the pagefault disable counter in the fault path. clang
at least fails the build.

Rename unsafe_*_user() to arch_unsafe_*_user() which makes the generic
uaccess header wrap it with a local label that makes both compilers emit
correct code. Same for the kernel_nofault() variants.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027083745.356628509@linutronix.de
2025-11-03 15:26:09 +01:00
Thomas Gleixner
14219398e3 x86/uaccess: Use unsafe wrappers for ASM GOTO
ASM GOTO is miscompiled by GCC when it is used inside a auto cleanup scope:

bool foo(u32 __user *p, u32 val)
{
	scoped_guard(pagefault)
		unsafe_put_user(val, p, efault);
	return true;
efault:
	return false;
}

It ends up leaking the pagefault disable counter in the fault path. clang
at least fails the build.

Rename unsafe_*_user() to arch_unsafe_*_user() which makes the generic
uaccess header wrap it with a local label that makes both compilers emit
correct code. Same for the kernel_nofault() variants.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027083745.294359925@linutronix.de
2025-11-03 15:26:09 +01:00
Thomas Gleixner
3eb6660f26 uaccess: Provide ASM GOTO safe wrappers for unsafe_*_user()
ASM GOTO is miscompiled by GCC when it is used inside a auto cleanup scope:

bool foo(u32 __user *p, u32 val)
{
	scoped_guard(pagefault)
		unsafe_put_user(val, p, efault);
	return true;
efault:
	return false;
}

 e80:	e8 00 00 00 00       	call   e85 <foo+0x5>
 e85:	65 48 8b 05 00 00 00 00 mov    %gs:0x0(%rip),%rax
 e8d:	83 80 04 14 00 00 01 	addl   $0x1,0x1404(%rax)   // pf_disable++
 e94:	89 37                	mov    %esi,(%rdi)
 e96:	83 a8 04 14 00 00 01 	subl   $0x1,0x1404(%rax)   // pf_disable--
 e9d:	b8 01 00 00 00       	mov    $0x1,%eax           // success
 ea2:	e9 00 00 00 00       	jmp    ea7 <foo+0x27>      // ret
 ea7:	31 c0                	xor    %eax,%eax           // fail
 ea9:	e9 00 00 00 00       	jmp    eae <foo+0x2e>      // ret

which is broken as it leaks the pagefault disable counter on failure.

Clang at least fails the build.

Linus suggested to add a local label into the macro scope and let that
jump to the actual caller supplied error label.

       	__label__ local_label;                                  \
        arch_unsafe_get_user(x, ptr, local_label);              \
	if (0) {                                                \
	local_label:                                            \
		goto label;                                     \

That works for both GCC and clang.

clang:

 c80:	0f 1f 44 00 00       	   nopl   0x0(%rax,%rax,1)
 c85:	65 48 8b 0c 25 00 00 00 00 mov    %gs:0x0,%rcx
 c8e:	ff 81 04 14 00 00    	   incl   0x1404(%rcx)	   // pf_disable++
 c94:	31 c0                	   xor    %eax,%eax        // set retval to false
 c96:	89 37                      mov    %esi,(%rdi)      // write
 c98:	b0 01                	   mov    $0x1,%al         // set retval to true
 c9a:	ff 89 04 14 00 00    	   decl   0x1404(%rcx)     // pf_disable--
 ca0:	2e e9 00 00 00 00    	   cs jmp ca6 <foo+0x26>   // ret

The exception table entry points correctly to c9a

GCC:

 f70:   e8 00 00 00 00          call   f75 <baz+0x5>
 f75:   65 48 8b 05 00 00 00 00 mov    %gs:0x0(%rip),%rax
 f7d:   83 80 04 14 00 00 01    addl   $0x1,0x1404(%rax)  // pf_disable++
 f84:   8b 17                   mov    (%rdi),%edx
 f86:   89 16                   mov    %edx,(%rsi)
 f88:   83 a8 04 14 00 00 01    subl   $0x1,0x1404(%rax) // pf_disable--
 f8f:   b8 01 00 00 00          mov    $0x1,%eax         // success
 f94:   e9 00 00 00 00          jmp    f99 <baz+0x29>    // ret
 f99:   83 a8 04 14 00 00 01    subl   $0x1,0x1404(%rax) // pf_disable--
 fa0:   31 c0                   xor    %eax,%eax         // fail
 fa2:   e9 00 00 00 00          jmp    fa7 <baz+0x37>    // ret

The exception table entry points correctly to f99

So both compilers optimize out the extra goto and emit correct and
efficient code.

Provide a generic wrapper to do that to avoid modifying all the affected
architecture specific implementation with that workaround.

The only change required for architectures is to rename unsafe_*_user() to
arch_unsafe_*_user(). That's done in subsequent changes.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/877bweujtn.ffs@tglx
2025-11-03 15:26:09 +01:00
Thomas Gleixner
44c5b6768e ARM: uaccess: Implement missing __get_user_asm_dword()
When CONFIG_CPU_SPECTRE=n then get_user() is missing the 8 byte ASM variant
for no real good reason. This prevents using get_user(u64) in generic code.

Implement it as a sequence of two 4-byte reads with LE/BE awareness and
make the unsigned long (or long long) type for the intermediate variable to
read into dependend on the the target type.

The __long_type() macro and idea was lifted from PowerPC. Thanks to
Christophe for pointing it out.

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202509120155.pFgwfeUD-lkp@intel.com/
Link: https://patch.msgid.link/20251027083745.168468637@linutronix.de
2025-11-03 15:26:09 +01:00
Rob Herring (Arm)
989b40b757 perf: arm_pmuv3: Add new Cortex and C1 CPU PMUs
Add CPU PMU compatible strings for Cortex-A320, Cortex-A520AE,
Cortex-A720AE, and C1 Nano/Premium/Pro/Ultra.

Signed-off-by: Rob Herring (Arm) <robh@kernel.org>
Signed-off-by: Will Deacon <will@kernel.org>
2025-11-03 14:21:55 +00:00
Ma Ke
970e1e4180 perf: arm_cspmu: fix error handling in arm_cspmu_impl_unregister()
driver_find_device() calls get_device() to increment the reference
count once a matching device is found. device_release_driver()
releases the driver, but it does not decrease the reference count that
was incremented by driver_find_device(). At the end of the loop, there
is no put_device() to balance the reference count. To avoid reference
count leakage, add put_device() to decrease the reference count.

Found by code review.

Cc: stable@vger.kernel.org
Fixes: bfc653aa89 ("perf: arm_cspmu: Separate Arm and vendor module")
Signed-off-by: Ma Ke <make24@iscas.ac.cn>
Signed-off-by: Will Deacon <will@kernel.org>
2025-11-03 14:20:03 +00:00
Robin Murphy
8fa08f8835 perf/arm-ni: Add NoC S3 support
NoC S3 and its SI L1 sibling look largely similar to their predecessors,
but add the notion of subfeatures to the discovery process, which we now
use to find the event muxes for each device node. Plus, as ever, more
mildly annoying shuffling around of some of the PMU registers (this time
it's the counters...)

Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Will Deacon <will@kernel.org>
2025-11-03 14:19:17 +00:00
Besar Wicaksono
decc3684c2 perf/arm_cspmu: nvidia: Add pmevfiltr2 support
Support NVIDIA PMU that utilizes the optional event filter2 register.

Reviewed-by: Ilkka Koskinen <ilkka@os.amperecomputing.com>
Signed-off-by: Besar Wicaksono <bwicaksono@nvidia.com>
Signed-off-by: Will Deacon <will@kernel.org>
2025-11-03 13:35:07 +00:00
Besar Wicaksono
82dfd72bfb perf/arm_cspmu: nvidia: Add revision id matching
Distinguish NVIDIA devices by revision and variant bits
in PMIIDR register in addition to product id.

Reviewed-by: Ilkka Koskinen <ilkka@os.amperecomputing.com>
Signed-off-by: Besar Wicaksono <bwicaksono@nvidia.com>
Signed-off-by: Will Deacon <will@kernel.org>
2025-11-03 13:35:07 +00:00
Besar Wicaksono
04330be8dc perf/arm_cspmu: Add pmpidr support
The PMIIDR value is composed by the values in PMPIDR registers.
We can use PMPIDR registers as alternative for device
identification for systems that do not implement PMIIDR.

Reviewed-by: Ilkka Koskinen <ilkka@os.amperecomputing.com>
Signed-off-by: Besar Wicaksono <bwicaksono@nvidia.com>
Signed-off-by: Will Deacon <will@kernel.org>
2025-11-03 13:35:07 +00:00
Besar Wicaksono
a2573bc790 perf/arm_cspmu: Add callback to reset filter config
Implementer may need to reset a filter config when
stopping a counter, thus adding a callback for this.

Reviewed-by: Ilkka Koskinen <ilkka@os.amperecomputing.com>
Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Signed-off-by: Besar Wicaksono <bwicaksono@nvidia.com>
Signed-off-by: Will Deacon <will@kernel.org>
2025-11-03 13:35:07 +00:00
Yicong Yang
c3d78c34ad perf: arm_pmuv3: Don't use PMCCNTR_EL0 on SMT cores
CPU_CYCLES is expected to count the logical CPU (PE) clock. Currently it's
preferred to use PMCCNTR_EL0 for counting CPU_CYCLES, but it'll count
processor clock rather than the PE clock (ARM DDI0487 L.b D13.1.3) if
one of the SMT siblings is not idle on a multi-threaded implementation.
So don't use it on SMT cores.

Introduce topology_core_has_smt() for knowing the SMT implementation and
cached it in arm_pmu::has_smt during allocation.

When counting cycles on SMT CPU 2-3 and CPU 3 is idle, without this
patch we'll get:
[root@client1 tmp]# perf stat -e cycles -A -C 2-3 -- stress-ng -c 1
--taskset 2 --timeout 1
[...]
 Performance counter stats for 'CPU(s) 2-3':

CPU2           2880457316      cycles
CPU3           2880459810      cycles
       1.254688470 seconds time elapsed

With this patch the idle state of CPU3 is observed as expected:
[root@client1 ~]#  perf stat -e cycles -A -C 2-3 -- stress-ng -c 1
--taskset 2 --timeout 1
[...]
 Performance counter stats for 'CPU(s) 2-3':

CPU2           2558580492      cycles
CPU3               305749      cycles
       1.113626410 seconds time elapsed

Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
Signed-off-by: Will Deacon <will@kernel.org>
2025-11-03 13:28:48 +00:00
Lukas Wunner
51d0656959 genirq/manage: Reduce priority of forced secondary interrupt handler
Crystal reports that the PCIe Advanced Error Reporting driver gets stuck
in an infinite loop on PREEMPT_RT:

Both the primary interrupt handler aer_irq() as well as the secondary
handler aer_isr() are forced into threads with identical priority.
Crystal writes that on the ARM system in question, the primary handler
has to clear an error in the Root Error Status register...

   "before the next error happens, or else the hardware will set the
    Multiple ERR_COR Received bit.  If that bit is set, then aer_isr()
    can't rely on the Error Source Identification register, so it scans
    through all devices looking for errors -- and for some reason, on
    this system, accessing the AER registers (or any Config Space above
    0x400, even though there are capabilities located there) generates
    an Unsupported Request Error (but returns valid data).  Since this
    happens more than once, without aer_irq() preempting, it causes
    another multi error and we get stuck in a loop."

The issue does not show on non-PREEMPT_RT because the primary handler
runs in hardirq context and thus can preempt the threaded secondary
handler, clear the Root Error Status register and prevent the secondary
handler from getting stuck.

Emulate the same behavior on PREEMPT_RT by assigning a lower default
priority to the secondary handler if the primary handler is forced into
a thread.

Reported-by: Crystal Wood <crwood@redhat.com>
Signed-off-by: Lukas Wunner <lukas@wunner.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Crystal Wood <crwood@redhat.com>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://patch.msgid.link/f6dcdb41be2694886b8dbf4fe7b3ab89e9d5114c.1761569303.git.lukas@wunner.de
Closes: https://lore.kernel.org/r/20250902224441.368483-1-crwood@redhat.com/
2025-11-01 21:30:02 +01:00
Frederic Weisbecker
ba14500e4b timers/migration: Remove dead code handling idle CPU checking for remote timers
Idle migrators don't walk the whole tree in order to find out if there
are timers to migrate because they recorded the next deadline to be
verified within a single check in tmigr_requires_handle_remote().

Remove the related dead code and data.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://patch.msgid.link/20251024132536.39841-7-frederic@kernel.org
2025-11-01 20:38:25 +01:00
Frederic Weisbecker
93643b90d6 timers/migration: Remove unused "cpu" parameter from tmigr_get_group()
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://patch.msgid.link/20251024132536.39841-6-frederic@kernel.org
2025-11-01 20:38:25 +01:00
Frederic Weisbecker
3c8eb36e2a timers/migration: Assert that hotplug preparing CPU is part of stable active hierarchy
The CPU doing the prepare work for a remote target must be online from
the tree point of view and its hierarchy must be active, otherwise
propagating its active state up to the new root branch would be either
incorrect or racy.

Assert those conditions with more sanity checks.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://patch.msgid.link/20251024132536.39841-5-frederic@kernel.org
2025-11-01 20:38:25 +01:00
Frederic Weisbecker
5eb579dfd4 timers/migration: Fix imbalanced NUMA trees
When a CPU from a new node boots, the old root may happen to be
connected to the new root even if their node mismatch, as depicted in
the following scenario:

1) CPU 0 boots and creates the first group for node 0.

   [GRP0:0]
    node 0
      |
    CPU 0

2) CPU 1 from node 1 boots and creates a new top that corresponds to
   node 1, but it also connects the old root from node 0 to the new root
   from node 1 by mistake.

             [GRP1:0]
              node 1
            /        \
           /          \
   [GRP0:0]             [GRP0:1]
    node 0               node 1
      |                    |
    CPU 0                CPU 1

3) This eventually leads to an imbalanced tree where some node 0 CPUs
   migrate node 1 timers (and vice versa) way before reaching the
   crossnode groups, resulting in more frequent remote memory accesses
   than expected.

                      [GRP2:0]
                      NUMA_NO_NODE
                     /             \
             [GRP1:0]              [GRP1:1]
              node 1               node 0
            /        \                |
           /          \             [...]
   [GRP0:0]             [GRP0:1]
    node 0               node 1
      |                    |
    CPU 0...              CPU 1...

A balanced tree should only contain groups having children that belong
to the same node:

                      [GRP2:0]
                      NUMA_NO_NODE
                     /             \
             [GRP1:0]              [GRP1:0]
              node 0               node 1
            /        \             /      \
           /          \           /        \
   [GRP0:0]          [...]      [...]    [GRP0:1]
    node 0                                node 1
      |                                     |
    CPU 0...                              CPU 1...

In order to fix this, the hierarchy must be unfolded up to the crossnode
level as soon as a node mismatch is detected. For example the stage 2
above should lead to this layout:

                      [GRP2:0]
                      NUMA_NO_NODE
                     /             \
             [GRP1:0]              [GRP1:1]
              node 0               node 1
              /                         \
             /                           \
        [GRP0:0]                        [GRP0:1]
        node 0                           node 1
          |                                |
       CPU 0                             CPU 1

This means that not only GRP1:0 must be created but also GRP1:1 and
GRP2:0 in order to prepare a balanced tree for next CPUs to boot.

Fixes: 7ee9887703 ("timers: Implement the hierarchical pull model")
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://patch.msgid.link/20251024132536.39841-4-frederic@kernel.org
2025-11-01 20:38:25 +01:00
Frederic Weisbecker
fa9620355d timers/migration: Remove locking on group connection
Initializing the tmc's group, the group's number of children and the
group's parent can all be done without locking because:

  1) Reading the group's parent and its group mask is done locklessly.

  2) The connections prepared for a given CPU hierarchy are visible to the
     target CPU once online, thanks to the CPU hotplug enforced memory
     ordering.

  3) In case of a newly created upper level, the new root and its
     connections and initialization are made visible by the CPU which made
     the connections. When that CPUs goes idle in the future, the new link
     is published by tmigr_inactive_up() through the atomic RmW on
     ->migr_state.

  4) If CPUs were still walking up the active hierarchy, they could observe
     the new root earlier. In this case the ordering is enforced by an
     early initialization of the group mask and by barriers that maintain
     address dependency as explained in:

     b729cc1ec2 ("timers/migration: Fix another race between hotplug and idle entry/exit")
     de3ced72a7 ("timers/migration: Enforce group initialization visibility to tree walkers")

  5) Timers are propagated by a chain of group locking from the bottom to
     the top. And while doing so, the tree also propagates groups links
     and initialization. Therefore remote expiration, which also relies
     on group locking, will observe those links and initialization while
     holding the root lock before walking the tree remotely and update
     remote timers. This is especially important for migrators in the
     active hierarchy that may observe the new root early.

Therefore the locking is unnecessary at initialization. If anything, it
just brings confusion. Remove it.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://patch.msgid.link/20251024132536.39841-3-frederic@kernel.org
2025-11-01 20:38:25 +01:00
Frederic Weisbecker
6c181b5667 timers/migration: Convert "while" loops to use "for"
Both the "do while" and "while" loops in tmigr_setup_groups() eventually
mimic the behaviour of "for" loops.

Simplify accordingly.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://patch.msgid.link/20251024132536.39841-2-frederic@kernel.org
2025-11-01 20:38:24 +01:00
Steve Wahl
4138787408 tick/sched: Limit non-timekeeper CPUs calling jiffies update
On large NUMA systems, while running a test program that saturates the
inter-processor and inter-NUMA links, acquiring the jiffies_lock can be
very expensive.

If the cpu designated to do jiffies updates (tick_do_timer_cpu) gets
delayed and other cpus decide to do the jiffies update themselves, a large
number of them decide to do so at the same time.

The inexpensive check against tick_next_period is far quicker than actually
acquiring the lock, so most of these get in line to obtain the lock.  If
obtaining the lock is slow enough, this spirals into the vast majority of
CPUs continuously being stuck waiting for this lock, just to obtain it and
find out that time has already been updated by another cpu. For example, on
one random entry to kdb by manually-injected NMI, 2912 of 3840 CPUs were
observed to be stuck there.

To avoid this, allow only one non-timekeeper CPU to call
tick_do_update_jiffies64() at any given time, resetting ts->stalled jiffies
only if the jiffies update function is actually called.

With this change, manually interrupting the test at most two CPUs are
observed to invoke tick_do_update_jiffies64() - the timekeeper and one
other.

Signed-off-by: Steve Wahl <steve.wahl@hpe.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Link: https://patch.msgid.link/20251027183456.343407-1-steve.wahl@hpe.com
2025-11-01 20:25:53 +01:00
Muchun Song
9ea2b810d5 genirq/proc: Fix race in show_irq_affinity()
Reading /proc/irq/N/smp_affinity* races with irq_set_affinity() and
irq_move_masked_irq(), leading to old or torn output for users.

After a user writes a new CPU mask to /proc/irq/N/affinity*, the syscall
returns success, yet a subsequent read of the same file immediately returns
a value different from what was just written.

That's due to a race between show_irq_affinity() and irq_move_masked_irq()
which lets the read observe a transient, inconsistent affinity mask.

Cure it by guarding the read with irq_desc::lock.

[ tglx: Massaged change log ]

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://patch.msgid.link/20251028090408.76331-1-songmuchun@bytedance.com
2025-10-31 22:30:05 +01:00
Marc Zyngier
68c4c159a0 genirq: Fix percpu_devid irq affinity documentation
Stephen points out that some of the percpu_devid irq affinity
documentation is either missing or not matching the data structures.

Address all the issues in one go.

Fixes: 87b0031f7f ("irqdomain: Add firmware info reporting interface")
Fixes: 258e7d28a3 ("genirq: Add affinity to percpu_devid interrupt requests")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://patch.msgid.link/20251030143032.2035987-1-maz@kernel.org
2025-10-31 22:25:34 +01:00
John Allen
92ad6505a4 x86/sev: Include XSS value in GHCB CPUID request
When a guest issues a CPUID instruction for Fn0000000D_x01, the hypervisor may
be intercepting the CPUID instruction and need to access the guest XSS value.
For SEV-ES, the XSS value is encrypted and needs to be included in the GHCB to
be visible to the hypervisor.

Signed-off-by: John Allen <john.allen@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://patch.msgid.link/all/20250924200852.4452-3-john.allen@amd.com/
2025-10-30 17:47:49 +01:00
John Allen
9249bcdea0 x86/boot: Move boot_*msr helpers to asm/shared/msr.h
The boot_{rdmsr,wrmsr}() helpers are *just* the barebones MSR access
functionality, without any tracing or exception handling glue as it is done in
kernel proper.

Move these helpers to asm/shared/msr.h and rename to raw_{rdmsr,wrmsr}() to
indicate what they are.

  [ bp: Correct the reason why those helpers exist. I should've caught that in
    the original patch that added them:
      176db62257 ("x86/boot: Introduce helpers for MSR reads/writes"
    but oh well...
    - fixup include path delimiters to <> ]

Signed-off-by: John Allen <john.allen@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://patch.msgid.link/all/20250924200852.4452-2-john.allen@amd.com
2025-10-30 16:29:53 +01:00
Yu Peng
ca8313fd83 x86/microcode: Mark early_parse_cmdline() as __init
Fix section mismatch warning reported by modpost:

  .text:early_parse_cmdline() -> .init.data:boot_command_line

The function early_parse_cmdline() is only called during init and accesses
init data, so mark it __init to match its usage.

  [ bp: This happens only when the toolchain fails to inline the function and
    I haven't been able to reproduce it with any toolchain I'm using. Patch is
    obviously correct regardless. ]

Signed-off-by: Yu Peng <pengyu@kylinos.cn>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://patch.msgid.link/all/20251030123757.1410904-1-pengyu@kylinos.cn
2025-10-30 14:33:31 +01:00
Borislav Petkov (AMD)
8d17104506 x86/microcode/AMD: Select which microcode patch to load
All microcode patches up to the proper BIOS Entrysign fix are loaded
only after the sha256 signature carried in the driver has been verified.

Microcode patches after the Entrysign fix has been applied, do not need
that signature verification anymore.

In order to not abandon machines which haven't received the BIOS update
yet, add the capability to select which microcode patch to load.

The corresponding microcode container supplied through firmware-linux
has been modified to carry two patches per CPU type
(family/model/stepping) so that the proper one gets selected.

Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Waiman Long <longman@redhat.com>
Link: https://patch.msgid.link/20251027133818.4363-1-bp@kernel.org
2025-10-30 14:29:54 +01:00
Yazen Ghannam
187d1b27a1 RAS/AMD/ATL: Require PRM support for future systems
Currently, the AMD Address Translation Library will fail to load for new,
unrecognized systems (based on Data Fabric revision). The intention is to
prevent the code from executing on new systems and returning incorrect
results.

Recent AMD systems, however, may provide UEFI PRM handlers for address
translation. This is code provided by the platform through BIOS tables.  These
are the preferred method for translation, and the Linux native code can be
used as a fallback.

Future AMD systems are expected to provide PRM handlers by default. And Linux
native code will not be used.

Adjust the ATL init code so that new, unrecognized systems will default to
using PRM handlers only.

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: "Mario Limonciello (AMD)" <superm1@kernel.org>
Link: https://patch.msgid.link/all/20251017-wip-atl-prm-v2-2-7ab1df4a5fbc@amd.com
2025-10-27 19:56:41 +01:00
Marc Zyngier
fa9d277738 perf: arm_pmu: Kill last use of per-CPU cpu_armpmu pointer
Having removed the use of the cpu_armpmu per-CPU variable from the
interrupt handling, the only user left is the BRBE scheduler hook.

It is easy to drop the use of this variable by following the pointer to the
generic PMU structure, and get the arm_pmu structure from there.

Perform the conversion and kill cpu_armpmu altogether.

Suggested-by: Will Deacon <will@kernel.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Will Deacon <will@kernel.org>
Link: https://patch.msgid.link/20251020122944.3074811-27-maz@kernel.org
2025-10-27 17:16:37 +01:00
Marc Zyngier
ebac4649fc irqdomain: Kill of_node_to_fwnode() helper
There is no in-tree users of this helper since b13b41cc3d ("misc:
ti_fpc202: Switch to of_fwnode_handle()"), and is replaced with
of_fwnode_handle().

Get rid of it.

Suggested-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Will Deacon <will@kernel.org>
Link: https://patch.msgid.link/20251020122944.3074811-26-maz@kernel.org
2025-10-27 17:16:37 +01:00
Marc Zyngier
ee2d50a9f5 genirq: Kill irq_{g,s}et_percpu_devid_partition()
These two helpers do not have any user anymore, and can be removed,
together with the affinity field kept in the irqdesc structure.

Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Will Deacon <will@kernel.org>
Link: https://patch.msgid.link/20251020122944.3074811-25-maz@kernel.org
2025-10-27 17:16:37 +01:00
Marc Zyngier
c620438ef2 irqchip: Kill irq-partition-percpu
This code is now completely unused, and nobody will ever miss it.

Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Will Deacon <will@kernel.org>
Link: https://patch.msgid.link/20251020122944.3074811-24-maz@kernel.org
2025-10-27 17:16:36 +01:00
Marc Zyngier
7443813f10 irqchip/apple-aic: Drop support for custom PMU irq partitions
Similarly to what has been done for GICv3, drop the irq partitioning
support from the AIC driver, effectively merging the two per-cpu interrupts
for the PMU.

Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Will Deacon <will@kernel.org>
Reviewed-by: Sven Peter <sven@kernel.org>
Link: https://patch.msgid.link/20251020122944.3074811-23-maz@kernel.org
2025-10-27 17:16:36 +01:00
Marc Zyngier
64b9738eaa irqchip/gic-v3: Drop support for custom PPI partitions
The only thing getting in the way of correctly handling PPIs the way they
were intended is the GICv3 hack that deals with PPI partitions.

Remove that code, allowing the common code to kick in.

Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Will Deacon <will@kernel.org>
Link: https://patch.msgid.link/20251020122944.3074811-22-maz@kernel.org
2025-10-27 17:16:36 +01:00
Marc Zyngier
4cdf4813f5 coresight: trbe: Request specific affinities for per CPU interrupts
Let the TRBE driver request interrupts with an affinity mask matching the
TRBE implementation affinity.

Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Will Deacon <will@kernel.org>
Acked-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Link: https://patch.msgid.link/20251020122944.3074811-21-maz@kernel.org
2025-10-27 17:16:36 +01:00
Marc Zyngier
f8112d29ba perf: arm_spe_pmu: Request specific affinities for per CPU interrupts
Let the SPE driver request interrupts with an affinity mask matching the SPE
implementation affinity.

Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Will Deacon <will@kernel.org>
Link: https://patch.msgid.link/20251020122944.3074811-20-maz@kernel.org
2025-10-27 17:16:36 +01:00
Will Deacon
54b350fa8e perf: arm_pmu: Request specific affinities for per CPU NMIs/interrupts
Let the PMU driver request both NMIs and normal interrupts with an affinity mask
matching the PMU affinity.

Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Will Deacon <will@kernel.org>
Link: https://patch.msgid.link/20251020122944.3074811-19-maz@kernel.org
2025-10-27 17:16:35 +01:00
Marc Zyngier
c734af3b2b genirq: Add request_percpu_irq_affinity() helper
While it would be nice to simply make request_percpu_irq() take an affinity
mask, the churn is likely to be on the irritating side given that most
drivers do not give a damn about affinities.

So take the more innocuous path to provide a helper that parallels
request_percpu_irq(), with an affinity as a bonus argument.

Yes, request_percpu_irq_affinity() is a bit of a mouthful.

Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Will Deacon <will@kernel.org>
Link: https://patch.msgid.link/20251020122944.3074811-18-maz@kernel.org
2025-10-27 17:16:35 +01:00
Marc Zyngier
bdf4e2ac29 genirq: Allow per-cpu interrupt sharing for non-overlapping affinities
Interrupt sharing for percpu-devid interrupts is forbidden, and for good
reasons. These are interrupts generated *from* a CPU and handled by itself
(timer, for example). Nobody in their right mind would put two devices on
the same pin (and if they have, they get to keep the pieces...).

But this also prevents more benign cases, where devices are connected
to groups of CPUs, and for which the affinities are not overlapping.
Effectively, the only thing they share is the interrupt number, and
nothing else.

Tweak the definition of IRQF_SHARED applied to percpu_devid interrupts to
allow this particular use case. This results in extra validation at the
point of the interrupt being setup and freed, as well as a tiny bit of
extra complexity for interrupts at handling time (to pick the correct
irqaction).

Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Will Deacon <will@kernel.org>
Link: https://patch.msgid.link/20251020122944.3074811-17-maz@kernel.org
2025-10-27 17:16:35 +01:00
Marc Zyngier
b9c6aa9efc genirq: Update request_percpu_nmi() to take an affinity
Continue spreading the notion of affinity to the per CPU interrupt request
code by updating the call sites that use request_percpu_nmi() (all two of
them) to take an affinity pointer. This pointer is firmly NULL for now.

Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Will Deacon <will@kernel.org>
Link: https://patch.msgid.link/20251020122944.3074811-16-maz@kernel.org
2025-10-27 17:16:35 +01:00
Marc Zyngier
258e7d28a3 genirq: Add affinity to percpu_devid interrupt requests
Add an affinity field to both the irqaction structure and the interrupt
request primitives. Nothing is making use of it yet, and the only value
used it NULL, which is used as a shorthand for cpu_possible_mask.

This will shortly get used with actual affinities.

Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Will Deacon <will@kernel.org>
Link: https://patch.msgid.link/20251020122944.3074811-15-maz@kernel.org
2025-10-27 17:16:34 +01:00
Marc Zyngier
9047a39daa genirq: Factor-in percpu irqaction creation
Move the code creating a per-cpu irqaction into its own helper, so that
future changes to this code can be kept localised.

At the same time, fix the documentation which appears to say the wrong
thing when it comes to interrupts being automatically enabled
(percpu_devid interrupts never are).

Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Will Deacon <will@kernel.org>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Link: https://patch.msgid.link/20251020122944.3074811-14-maz@kernel.org
2025-10-27 17:16:34 +01:00
Marc Zyngier
5c2b2cc472 genirq: Merge irqaction::{dev_id,percpu_dev_id}
When irqaction::percpu_dev_id was introduced, it was hoped that it could be
part of an anonymous union with dev_id, as the two fields are mutually
exclusive.

However, toolchains used at the time were often showing terrible support
for anonymous unions, breaking the build on a number of architectures. It
was therefore decided to keep the two fields separate and address this down
the line.

14 years later, the compiler dark age is over, and there is universal
support for anonymous unions. Get a whole pointer back that can immediately
be spent on something else.

Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Will Deacon <will@kernel.org>
Link: https://patch.msgid.link/20251020122944.3074811-13-maz@kernel.org
2025-10-27 17:16:34 +01:00
Marc Zyngier
5ff78c8de9 genirq: Kill handle_percpu_devid_fasteoi_nmi()
There is no in-tree user of this flow handler anymore, so simply remove it.

Suggested-by: Will Deacon <will@kernel.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Will Deacon <will@kernel.org>
Link: https://patch.msgid.link/20251020122944.3074811-12-maz@kernel.org
2025-10-27 17:16:34 +01:00
Marc Zyngier
21bbbc50f3 irqchip/gic-v3: Switch high priority PPIs over to handle_percpu_devid_irq()
It so appears that handle_percpu_devid_irq() is extremely similar to
handle_percpu_devid_fasteoi_nmi(), and that the differences do no justify
the horrid machinery in the GICv3 driver to handle the flow handler switch.

Stick with the standard flow handler, even for NMIs.

Suggested-by: Will Deacon <will@kernel.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Will Deacon <will@kernel.org>
Link: https://patch.msgid.link/20251020122944.3074811-11-maz@kernel.org
2025-10-27 17:16:34 +01:00
Marc Zyngier
f6c8aced7c perf: arm_spe_pmu: Convert to new interrupt affinity retrieval API
Now that the relevant interrupt controllers are equipped with a callback
returning the affinity of per-CPU interrupts, switch the ARM SPE driver
over to this new method.

Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Will Deacon <will@kernel.org>
Reviewed-by: Jinjie Ruan <ruanjinjie@huawei.com>
Link: https://patch.msgid.link/20251020122944.3074811-10-maz@kernel.org
2025-10-27 17:16:33 +01:00
Marc Zyngier
663783e001 perf: arm_pmu: Convert to the new interrupt affinity retrieval API
Now that the relevant interrupt controllers are equipped with a callback
returning the affinity of per-CPU interrupts, switch the OF side of the ARM
PMU driver over to this new method.

Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Will Deacon <will@kernel.org>
Reviewed-by: Jinjie Ruan <ruanjinjie@huawei.com>
Link: https://patch.msgid.link/20251020122944.3074811-9-maz@kernel.org
2025-10-27 17:16:33 +01:00
Marc Zyngier
541454dd20 coresight: trbe: Convert to the new interrupt affinity retrieval API
Now that the relevant interrupt controllers are equipped with a callback
returning the affinity of per-CPU interrupts, switch the TRBE driver over
to this new method.

Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Will Deacon <will@kernel.org>
Acked-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Link: https://patch.msgid.link/20251020122944.3074811-8-maz@kernel.org
2025-10-27 17:16:33 +01:00
Marc Zyngier
de575de83c irqchip/apple-aic: Add FW info retrieval support
Plug the new .get_fwspec_info() callback into the Apple AIC driver, using
some of the existing FIQ affinity handling infrastructure.

Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Will Deacon <will@kernel.org>
Acked-by: Sven Peter <sven@kernel.org>
Link: https://patch.msgid.link/20251020122944.3074811-7-maz@kernel.org
2025-10-27 17:16:33 +01:00
Marc Zyngier
68905ea65c irqchip/gic-v3: Add FW info retrieval support
Plug the new .get_fwspec_info() callback into the GICv3 core driver, using
some of the existing PPI affinity handling infrastructure.

Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Will Deacon <will@kernel.org>
Link: https://patch.msgid.link/20251020122944.3074811-6-maz@kernel.org
2025-10-27 17:16:33 +01:00
Marc Zyngier
0d5daa938c platform: Add firmware-agnostic irq and affinity retrieval interface
Expand platform_get_irq_optional() to also return an affinity if available,
renaming it to platform_get_irq_affinity() in the process.

platform_get_irq_optional() is preserved with its current semantics by
calling into the new helper with a NULL affinity pointer.

Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Will Deacon <will@kernel.org>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Link: https://patch.msgid.link/20251020122944.3074811-5-maz@kernel.org
2025-10-27 17:16:32 +01:00
Marc Zyngier
5404f5c06d of/irq: Add interrupt affinity reporting interface
Plug the irq_populate_fwspec_info() helper into the OF layer to offer an
interrupt affinity reporting function.

Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Will Deacon <will@kernel.org>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Link: https://patch.msgid.link/20251020122944.3074811-4-maz@kernel.org
2025-10-27 17:16:32 +01:00
Marc Zyngier
5324fe21ba ACPI: irq: Add interrupt affinity reporting interface
Plug the irq_populate_fwspec_info() helper into the ACPI layer to offer an
interrupt affinity reporting function. This is currently only supported for
the CONFIG_ACPI_GENERIC_GSI configurations, but could later be extended to
legacy architectures if necessary.

Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Will Deacon <will@kernel.org>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Acked-by: Rafael J. Wysocki (Intel) <rafael@kernel.org>
Link: https://patch.msgid.link/20251020122944.3074811-3-maz@kernel.org
2025-10-27 17:16:32 +01:00
Marc Zyngier
87b0031f7f irqdomain: Add firmware info reporting interface
Add an irqdomain callback to report firmware-provided information that is
otherwise not available in a generic way. This is reported using a new data
structure (struct irq_fwspec_info).

This callback is optional and the only information that can be reported
currently is the affinity of an interrupt. However, the containing
structure is designed to be extensible, allowing other potentially relevant
information to be reported in the future.

Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Will Deacon <will@kernel.org>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Link: https://patch.msgid.link/20251020122944.3074811-2-maz@kernel.org
2025-10-27 17:16:32 +01:00
Yazen Ghannam
83be4bee57 ACPI: PRM: Add acpi_prm_handler_available()
Add a helper function to check if a PRM handler/module is present.

This can be used during init time by code that depends on a particular
handler. If the handler is not present, then the code does not need to
be loaded.

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: "Mario Limonciello (AMD)" <superm1@kernel.org>
Acked-by: "Rafael J. Wysocki (Intel)" <rafael@kernel.org>
Link: https://patch.msgid.link/all/20251017-wip-atl-prm-v2-1-7ab1df4a5fbc@amd.com
2025-10-27 15:45:22 +01:00
Borislav Petkov (AMD)
4058386498 Merge tag 'x86_urgent_for_v6.18_rc3' into x86/microcode
Pick up the below urgent upstream change in order to base more work
ontop:

- Correct the last Zen1 microcode revision for which Entrysign sha256 check is
  needed

Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
2025-10-27 14:06:38 +01:00
Charles Mirabile
539d147ef6 irqchip/sifive-plic: Add support for UltraRISC DP1000 PLIC
Add a new compatible for the plic found in UltraRISC DP1000 with a quirk to
work around a known hardware bug with IRQ claiming in the UR-CP100 cores.

When claiming an interrupt on UR-CP100 cores, all other interrupts must be
disabled before the claim register is accessed to prevent incorrect
handling of the interrupt. This is a hardware bug in the CP100 core
implementation, not specific to the DP1000 SoC.

When the PLIC_QUIRK_CP100_CLAIM_REGISTER_ERRATUM flag is present, a
specialized handler (plic_handle_irq_cp100) disables all interrupts except
for the first pending one before reading the claim register, and then
restores the interrupts before further processing of the claimed interrupt
continues.

This implementation leverages the enable_save optimization, which maintains
the current interrupt enable state in memory, avoiding additional register
reads during the workaround.

The driver matches on "ultrarisc,cp100-plic" to apply the quirk to all
SoCs using UR-CP100 cores, regardless of the specific SoC implementation.
This has no impact on other platforms.

[ tglx: Condensed the code a bit, massaged change log and comments ]

Co-developed-by: Zhang Xincheng <zhangxincheng@ultrarisc.com>
Signed-off-by: Zhang Xincheng <zhangxincheng@ultrarisc.com>
Signed-off-by: Charles Mirabile <cmirabil@redhat.com>
Signed-off-by: Lucas Zampieri <lzampier@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Samuel Holland <samuel.holland@sifive.com>
Link: https://patch.msgid.link/20251024083647.475239-5-lzampier@redhat.com
2025-10-27 12:11:56 +01:00
Brendan Jackman
5385dec724 x86/mm: Unify __phys_addr_symbol()
There are two implementations on 64-bit, depending on CONFIG_DEBUG_VIRTUAL,
but they differ only regarding the presence of VIRTUAL_BUG_ON, which is
already ifdef'd on CONFIG_DEBUG_VIRTUAL.

To avoid adding a function call on non-LTO non-DEBUG_VIRTUAL builds, move the
function into the header. (Note the function is already only used on 64-bit).

Signed-off-by: Brendan Jackman <jackmanb@google.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://patch.msgid.link/all/20250813-phys-addr-cleanup-v1-1-19e334b1c466@google.com/
2025-10-24 22:13:00 +02:00
Matthew Wilcox (Oracle)
70e0a80a1f treewide: Remove in_irq()
This old alias for in_hardirq() has been marked as deprecated since
2020; remove the stragglers.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://patch.msgid.link/20251024180654.1691095-1-willy@infradead.org
2025-10-24 21:39:27 +02:00
Charles Mirabile
14ff9e54dd irqchip/sifive-plic: Cache the interrupt enable state
Optimize the PLIC driver by maintaining the interrupt enable state in the
handler's enable_save array during normal operation rather than only during
suspend/resume. This eliminates the need to read enable registers during
suspend and makes the enable state immediately available for other
purposes.

Let __plic_toggle() update both the hardware registers and the cached
enable_save state atomically within the existing enable_lock protection.

That allows to remove the suspend-time enable register reading since
handler::enable_save now always reflects the current state.

[ tglx: Massaged change log ]

Signed-off-by: Charles Mirabile <cmirabil@redhat.com>
Signed-off-by: Lucas Zampieri <lzampier@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://patch.msgid.link/20251024083647.475239-4-lzampier@redhat.com
2025-10-24 21:34:32 +02:00
Charles Mirabile
9dfb295a93 dt-bindings: interrupt-controller: Add UltraRISC DP1000 PLIC
Add compatible strings for the PLIC found in UltraRISC DP1000 SoC.

The PLIC is part of the UR-CP100 core and has a hardware bug requiring
a workaround.

Signed-off-by: Charles Mirabile <cmirabil@redhat.com>
Signed-off-by: Lucas Zampieri <lzampier@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Conor Dooley <conor.dooley@microchip.com>
Link: https://patch.msgid.link/20251024083647.475239-3-lzampier@redhat.com
2025-10-24 21:34:32 +02:00
Lucas Zampieri
e95f66dd0e dt-bindings: vendor-prefixes: Add UltraRISC
Add vendor prefix for UltraRISC Technology Co., Ltd.

Signed-off-by: Lucas Zampieri <lzampier@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Rob Herring (Arm) <robh@kernel.org>
Link: https://patch.msgid.link/20251024083647.475239-2-lzampier@redhat.com
2025-10-24 21:34:31 +02:00
Petr Tesarik
e9cc99142a x86/tsx: Get the tsx= command line parameter with early_param()
Use early_param() to get the value of the tsx= command line parameter. It is
an early parameter, because it must be parsed before tsx_init(), which is
called long before kernel_init(), where normal parameters are parsed.

Although cmdline_find_option() from tsx_init() works fine, the option is later
reported as unknown and passed to user space. The latter is not a real issue,
but the former is confusing and makes people wonder if the tsx= parameter had
any effect and double-check for typos unnecessarily.

The behavior changes slightly if "tsx" is given without any argument (which is
invalid syntax). Until now, the kernel logged an error message and disabled
TSX. Now, the kernel still issues a warning (Malformed early option 'tsx'),
but TSX state is unchanged. The new behavior is consistent with other
parameters, e.g. "tsx_async_abort".

   [ bp: Fixup minor formatting request during review. ]

Suggested-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Petr Tesarik <ptesarik@suse.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/all/cover.1758906115.git.ptesarik@suse.com
2025-10-24 18:35:17 +02:00
Petr Tesarik
f018fca8f9 x86/tsx: Make tsx_ctrl_state static
Move all definitions related to tsx_ctrl_state to tsx.c. They are
never referenced outside this file.

No functional change.

Signed-off-by: Petr Tesarik <ptesarik@suse.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
Link: https://lore.kernel.org/all/cover.1758906115.git.ptesarik@suse.com
2025-10-24 18:24:42 +02:00
Heiko Carstens
020d5dc578 s390/ap: Don't leak debug feature files if AP instructions are not available
If no AP instructions are available the AP bus module leaks registered
debug feature files. Change function call order to fix this.

Fixes: cccd85bfb7 ("s390/zcrypt: Rework debug feature invocations.")
Reviewed-by: Harald Freudenberger <freude@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2025-10-24 15:25:56 +02:00
Jens Remus
f25d952ab6 s390/ptrace: Explicitly include <linux/typecheck.h>
The psw_bits() macro makes use of typecheck() without that typecheck.h
is included. Add the missing include to avoid potential future compile
problems.

[hca@linux.ibm.com: change commit message]

Signed-off-by: Jens Remus <jremus@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2025-10-24 15:25:56 +02:00
Harald Freudenberger
51d921a613 s390/ap: Expose ap_bindings_complete_count counter via sysfs
The AP bus udev event BINDINGS=complete is sent out when the
first time all devices detected by the AP bus scan have been
bound to device drivers. This is the ideal time to for example
change the AP bus masks apmask and aqmask to re-establish a
persistent change on the decision about which cards/domains
should be available for the host and which should go into the
pool for kvm guests.

However, if exactly this initial udev event is sent out early
in the boot process a udev rule may not have been established
yet and thus this event will never be recognized. To have
some indication about if the AP bus binding complete has
already happened, the internal ap_bindings_complete_count
counter is exposed via sysfs with this patch.

Suggested-by: Matthew Rosato <mjrosato@linux.ibm.com>
Signed-off-by: Harald Freudenberger <freude@linux.ibm.com>
Tested-by: Matthew Rosato <mjrosato@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2025-10-23 16:11:38 +02:00
Heiko Carstens
07a75d08cf s390/smp: Fix fallback CPU detection
In case SCLP CPU detection does not work a fallback mechanism using SIGP is
in place. Since a cleanup this does not work correctly anymore: new CPUs
are only considered if their type matches the boot CPU.

Before the cleanup the information if a CPU type should be considered was
also part of a structure generated by the fallback mechanism and indicated
that a CPU type should not be considered when adding CPUs.

Since the rework a global SCLP state is used instead. If the global SCLP
state indicates that the CPU type should be considered and the fallback
mechanism is used, there may be a mismatch with CPU types if CPUs are
added. This can lead to a system with only a single CPU even tough there
are many more CPUs.

Address this by simply copying the boot cpu type into the generated data
structure from the fallback mechanism.

Reported-by: Alexander Egorenkov <egorenar@linux.ibm.com>
Fixes: d08d94306e ("s390/smp: cleanup core vs. cpu in the SCLP interface")
Reviewed-by: Mete Durlu <meted@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2025-10-23 16:11:38 +02:00
Gerd Bayer
564ebcae6a s390/pci: Highlight failure to enable PCI function
Emit an error log when a PCI function cannot be enabled for use, despite
being reported as configured to the system.

This brings to attention situations where functions might go missing
without notice. Going unnoticed is less likely when functions are added
to the system through hotplug, but will produce the same error log.

Reviewed-by: Niklas Schnelle <schnelle@linux.ibm.com>
Reviewed-by: Alexandra Winter <wintera@linux.ibm.com>
Signed-off-by: Gerd Bayer <gbayer@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2025-10-23 16:11:37 +02:00
Heiko Carstens
5c02c74dd4 Merge branch 'ap-bus-trace-events'
Harald Freudenberger says:

====================

Investigations related to runtime of crypto requests has revealed a lack of
performance or runtime information with crypto requests. There are the two
zcrypt ioctl trace events covering the entry and exit of an ioctl with
crypto requests giving the overall runtime within the kernel. However,
there is no way to figure out the time where a request is enqueued into the
AP bus queue but not pushed into the firmware queue. Then there is no
information about the runtime of an request during processing in the
firmware. And finally some info about pulling the reply from the firmware
and delivering it into user space is missing.

This series is aiming to provide a way to collect measurements which can be
used to cover these runtime information for each crypto request/reply.

====================

Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2025-10-21 11:10:55 +02:00
Harald Freudenberger
9c11918040 s390/ap: Introduce new AP nqap and dqap trace events
Introduce two new AP bus related tracepoint events:
- There is a tracepoint s390_ap_nqap event immediately after a request
  has been pushed into the AP firmware queue with the NQAP AP command.
- The other tracepoint s390_ap_dqap event fires immediately after a
  reply has been pulled out of the AP firmware queue via DQAP AP
  command.
Both events are triggered unconditional and may need filtering.
Filtering can be done based on the status value which is part of
the nqap and dqap trace. So for example a
  echo "!(status & 0x00ff0000)" >.../s390_ap_dqap/filter
filters out all trace events which have a response_code != 0
leaving just the successful nqap and dqap invocations.

The idea of these two trace events focuses on performance to measure
the runtime of a crypto request/reply as close as possible at the
firmware level. In combination with the two zcrypt tracepoints (see
the zcrypt.h trace event definition file) this gives measurement data
about the runtime of a request/reply within the zcrpyt and AP bus
layer. However, with having the status of these AP commands in hand
also other usage may be possible.

Signed-off-by: Harald Freudenberger <freude@linux.ibm.com>
Reviewed-by: Holger Dengler <dengler@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2025-10-21 11:09:21 +02:00
Harald Freudenberger
7f124d78d4 s390/ap: Extend struct ap_queue_status with some convenience fields
Sometimes there is a different view of the AP status word
needed. So here is slight rework of the struct ap_queue_status
to open up the possibility to have different ways of accessing
the AP status bits and fields.

The new struct ap_queue_status

struct ap_queue_status {
	union {
		unsigned int value			: 32;
		struct {
			unsigned int status_bits	: 8;
			unsigned int rc			: 8;
			unsigned int			: 16;
		};
		struct {
			unsigned int queue_empty	: 1;
			unsigned int replies_waiting	: 1;
			unsigned int queue_full		: 1;
			unsigned int			: 3;
			unsigned int async		: 1;
			unsigned int irq_enabled	: 1;
			unsigned int response_code	: 8;
			unsigned int			: 16;
		};
	};
};

comprises the old struct ap_queue_status but extends it
to have this also accessible as an unsigned int required
for example for a simple print or trace of the whole value.

Note that this rework is fully backward compatible to the
existing code exploiting the struct ap_queue_status.

Signed-off-by: Harald Freudenberger <freude@linux.ibm.com>
Reviewed-by: Anthony Krowiak <akrowiak@linux.ibm.com>
Reviewed-by: Holger Dengler <dengler@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2025-10-21 11:09:21 +02:00
Harald Freudenberger
507cff242a s390/zcrypt: Rework zcrypt request and reply trace event definition
This is a slight rework of the s390_zcrypt_req and s390_zcrypt_rep
trace event:
- the psmid has been added to the s390_zcrypt_rep
- "dev" renamed to "card"
- "domain" renamed to "dom"
The motivation of these changes is to make these traces more
aligned to new upcoming traces for AP bus related trace events.
Additionally the psmid is needed to match the reply (and thus
indirect the request) to AP bus related trace events where only
the psmid is unique identifying AP messages.

Signed-off-by: Harald Freudenberger <freude@linux.ibm.com>
Reviewed-by: Anthony Krowiak <akrowiak@linux.ibm.com>
Reviewed-by: Holger Dengler <dengler@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2025-10-21 11:09:21 +02:00
Josephine Pfeiffer
215231deea s390/ptdump: Use seq_puts() in pt_dump_seq_puts() macro
The pt_dump_seq_puts() macro incorrectly uses seq_printf() instead of
seq_puts(). This is both a performance issue and conceptually wrong,
as the macro name suggests plain string output (puts) but the
implementation uses formatted output (printf).

The macro is used in dump_pagetables.c:67-68 and 131 to output
constant strings. Using seq_printf() adds unnecessary overhead for
format string parsing.

Signed-off-by: Josephine Pfeiffer <hi@josie.lol>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2025-10-21 10:29:50 +02:00
Heiko Carstens
5e09c0a03e Merge branch 'tape-block-sizes'
Jan Höppner says:

====================

The tape device driver is limited to a block size of 65535 bytes since a
single CCW can only transfer up to 64K-1 bytes (The count field is a
16bit value). This series introduces data chaining for all read/write
functions to support block sizes larger than 65535.

The tape device type 3490 (emulated) and 3590/3592 can handle up to
256K. [1]

[1] https://www.ibm.com/docs/en/zos/3.1.0?topic=blksize-system-determined-block-size

====================

Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2025-10-21 10:26:26 +02:00
Jan Höppner
319d3d6653 s390/tape: Add support for bigger block sizes
The tape device type 3590/3592 and emulated 3490 VTS can handle a block
size of up to 256K bytes. Currently the tape device driver is limited to
a block size of 65535 bytes (64K-1). This limitation stems from the
maximum of 65535 bytes of data that can be transferred with one
Channel-Command Word (CCW).

To work around this limitation data chaining is used which uses several
CCW to transfer an entire 256K block of data. A single CCW holds a
maximum of 65535 bytes of data.

Set MAX_BLOCKSIZE to 262144 (= 256K) to allow for data transfers with
larger block sizes. The read_block() and write_block() discipline
functions calculate the number of CCWs required based on the IDAL buffer
array size that was created for a given block size. If there is more
than one CCW required for the data transfer, the new helper function
tape_ccw_dc_idal() is used to build the data chain accordingly.

The Interruption-Repsonse Block (irb) is added to the tape_request
struct so that the tapechar_read/write() functions can analyze what data
was read or written accordingly.

Signed-off-by: Jan Höppner <hoeppner@linux.ibm.com>
Reviewed-by: Jens Remus <jremus@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2025-10-21 10:25:55 +02:00
Jan Höppner
574817d6c0 s390/tape: Introduce idal buffer array
The tape device driver uses a single idal_buffer for I/O. While the
buffer itself can be arbitrary big, the limit for data transfer for a
single Channel-Command Word is at 65535 bytes (64K-1) since the count
field specifying the amount of data designated by the CCW is a 16-bit
unsigned value.

Provide functionality that allocates an array of multiple IDAL buffer
with the limitation mentioned above in mind.
A call to idal_buffer_array_alloc() allocates an array with a certain
amount of IDAL buffers which is determined based on the total size of
@size. Each individual buffer is limited to a size of CCW_MAX_BYTE_COUNT
(65535 bytes).

Add helper functions that determine the size (# of elements) and the
total data size covered by the array as well.

Current users of the single IDAL buffer are adapted to use the new
functions with one buffer to allocate.

The single IDAL buffer is removed from the tape_char_data struct.

Signed-off-by: Jan Höppner <hoeppner@linux.ibm.com>
Reviewed-by: Jens Remus <jremus@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2025-10-21 10:25:55 +02:00
Jan Höppner
a5e2ca22c1 s390/tape: Move idal allocation to core functions
Currently tapechar_check_idalbuffer() is part of tape_char.c and is used
to ensure the idal buffer is big enough for the requested I/O and
reallocates a new one if required. The same is done in tape_std.c when a
fixed block size is set using the mtsetblk command. This is essentially
duplicate code.

The allocation of the buffer that is required for I/O can be considered
core functionality. Move the idal buffer allocation to tape_core.c,
make it generally available, and reduce code duplication.

Signed-off-by: Jan Höppner <hoeppner@linux.ibm.com>
Reviewed-by: Jens Remus <jremus@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2025-10-21 10:25:55 +02:00
Jan Höppner
e039400f75 s390/tape: Fix return value of ccw helper functions
In contrast to all other helper functions used to build CCW chains,
tape_ccw_cc_idal() and tape_ccw_end_idal() return values using
post-increments, which results in returning the same CCW pointer.

Though, the intent of the CCW helper functions is to return the _next_
CCW in the chain, which can then be processed.

There is currently no actual issue, as tape_ccw_cc_idal() is not used
yet and tape_ccw_end_idal() is only used at the end of a chain.

Change both functions return statement to ccw + 1 and bring them in line
with the other helper functions.

Signed-off-by: Jan Höppner <hoeppner@linux.ibm.com>
Reviewed-by: Jens Remus <jremus@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2025-10-21 10:25:55 +02:00
Jan Höppner
83cff1b124 s390/tape: Remove extra CCW allocation for error recovery
The Read Opposite error recovery code required 2 extra CCWs to be
allocated in order to transform the request. As this error recovery code
for both 34xx and 3590 was removed the additional allocation isn't
required anymore. Reduce it to two.

Signed-off-by: Jan Höppner <hoeppner@linux.ibm.com>
Reviewed-by: Jens Remus <jremus@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2025-10-21 10:25:54 +02:00
Jan Höppner
a984d71277 s390/tape: Remove 3590 Read Opposite error recovery
On old native type 3590 tape devices a Read Opposite error recovery
procedure on Error Recovery Action Code (ERA) 26 was issued if a Read
Forward command failed. This recovery procedure was implemented with the
Read Backward command. This is no longer supported.

Remove 3590 ERA 26 and Read Backward related recovery code.

Signed-off-by: Jan Höppner <hoeppner@linux.ibm.com>
Reviewed-by: Jens Remus <jremus@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2025-10-21 10:25:54 +02:00
Jan Höppner
1b9df1a28f s390/tape: Remove 34xx Read Opposite error recovery
On old native type 3490 tape devices a Read Opposite error recovery
procedure on Error Recovery Action Code (ERA) 26 was issued if a Read
Forward command failed. This recovery procedure was implemented with the
Read Backward command.

As a preparation for a subsequent commit, that adds support for bigger
block sizes, remove the 34xx ERA 26 related recovery code. The recovery
code would need to be adapted to the bigger block sizes, without any
possibility to be tested, as modern Virtual Tape Servers (VTS) do
neither report ERA 26 on a Read Forward command failure nor support the
error recovery procedure anymore.

Signed-off-by: Jan Höppner <hoeppner@linux.ibm.com>
Reviewed-by: Jens Remus <jremus@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2025-10-21 10:25:54 +02:00
Jan Höppner
39376c77a5 s390/tape: Remove count parameter from read/write_block functions
The count parameter of the read/write_block discipline functions was
never used. Remove it.

Signed-off-by: Jan Höppner <hoeppner@linux.ibm.com>
Reviewed-by: Jens Remus <jremus@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2025-10-21 10:25:54 +02:00
Heiko Carstens
ec9b3b85ea Merge branch 'memory-hotplug'
Sumanth Korikkar says:

====================

Provide a new interface for dynamic configuration and
deconfiguration of hotplug memory on s390, allowing with/without
memmap_on_memory support. It is a follow up on the discussion with David
when introducing memmap_on_memory support for s390 and support dynamic
(de)configuration of memory:
https://lore.kernel.org/all/ee492da8-74b4-4a97-8b24-73e07257f01d@redhat.com/
https://lore.kernel.org/all/20241202082732.3959803-1-sumanthk@linux.ibm.com/

The original motivation for introducing memmap_on_memory on s390 was to
avoid using online memory to store struct pages metadata, particularly
for standby memory blocks. This became critical in cases where there was
an imbalance between standby and online memory, potentially leading to
boot failures due to insufficient memory for metadata allocation.

To address this, memmap_on_memory was utilized on s390. However, in its
current form, it adds struct pages metadata at the start of each memory
block at the time of addition (only standby memory), and this
configuration is static. It cannot be changed at runtime  (When the user
needs continuous physical memory).

Inorder to provide more flexibility to the user and overcome the above
limitation, add an option to dynamically configure and deconfigure
hotpluggable memory block with/without memmap_on_memory.

With the new interface, s390 will not add all possible hotplug memory in
advance, like before, to make it visible in sysfs for online/offline
actions. Instead, before memory block can be set online, it has to be
configured via a new interface in /sys/firmware/memory/memoryX/config,
which makes s390 similar to others.  i.e. Adding of hotpluggable memory is
controlled by the user instead of adding it at boottime.

s390 kernel sysfs interface to configure/deconfigure memory with
memmap_on_memory (with upcoming lsmem changes):

* Initial memory layout:
lsmem -o RANGE,SIZE,STATE,BLOCK,CONFIGURED,MEMMAP_ON_MEMORY
RANGE                 SIZE   STATE BLOCK CONFIGURED MEMMAP_ON_MEMORY
0x00000000-0x7fffffff   2G  online 0-15  yes        no
0x80000000-0xffffffff   2G offline 16-31 no         yes

* Configure memory
echo 1 > /sys/firmware/memory/memory16/config
lsmem -o RANGE,SIZE,STATE,BLOCK,CONFIGURED,MEMMAP_ON_MEMORY
RANGE                  SIZE  STATE   BLOCK CONFIGURED MEMMAP_ON_MEMORY
0x00000000-0x7fffffff    2G  online  0-15  yes        no
0x80000000-0x87ffffff  128M offline    16  yes        yes
0x88000000-0xffffffff  1.9G offline 17-31  no         yes

* Deconfigure memory
echo 0 > /sys/firmware/memory/memory16/config
lsmem -o RANGE,SIZE,STATE,BLOCK,CONFIGURED,MEMMAP_ON_MEMORY
RANGE                 SIZE   STATE BLOCK CONFIGURED MEMMAP_ON_MEMORY
0x00000000-0x7fffffff   2G  online 0-15  yes        no
0x80000000-0xffffffff   2G offline 16-31 no         yes

* Enable memmap_on_memory and online it.
(Deconfigure first)
echo 0 > /sys/devices/system/memory/memory5/online
echo 0 > /sys/firmware/memory/memory5/config

lsmem -o RANGE,SIZE,STATE,BLOCK,CONFIGURED,MEMMAP_ON_MEMORY
RANGE                  SIZE  STATE  BLOCK CONFIGURED MEMMAP_ON_MEMORY
0x00000000-0x27ffffff  640M  online 0-4   yes        no
0x28000000-0x2fffffff  128M offline 5     no         no
0x30000000-0x7fffffff  1.3G  online 6-15  yes        no
0x80000000-0xffffffff    2G offline 16-31 no         yes

(Enable memmap_on_memory and online it)
echo 1 > /sys/firmware/memory/memory5/memmap_on_memory
echo 1 > /sys/firmware/memory/memory5/config
echo 1 > /sys/devices/system/memory/memory5/online

lsmem -o RANGE,SIZE,STATE,BLOCK,CONFIGURED,MEMMAP_ON_MEMORY
RANGE                  SIZE  STATE   BLOCK CONFIGURED MEMMAP_ON_MEMORY
0x00000000-0x27ffffff  640M  online  0-4   yes        no
0x28000000-0x2fffffff  128M  online  5     yes        yes
0x30000000-0x7fffffff  1.3G  online  6-15  yes        no
0x80000000-0xffffffff    2G  offline 16-31 no         yes

* Disable memmap_on_memory and online it.
(Deconfigure first)
echo 0 > /sys/devices/system/memory/memory5/online
echo 0 > /sys/firmware/memory/memory5/config

lsmem -o RANGE,SIZE,STATE,BLOCK,CONFIGURED,MEMMAP_ON_MEMORY
RANGE                  SIZE  STATE  BLOCK CONFIGURED MEMMAP_ON_MEMORY
0x00000000-0x27ffffff  640M  online 0-4   yes        no
0x28000000-0x2fffffff  128M offline 5     no         yes
0x30000000-0x7fffffff  1.3G  online 6-15  yes        no
0x80000000-0xffffffff    2G offline 16-31 no         yes

(Disable memmap_on_memory and online it)
echo 0 > /sys/firmware/memory/memory5/memmap_on_memory
echo 1 > /sys/firmware/memory/memory5/config
echo 1 > /sys/devices/system/memory/memory5/online

lsmem -o RANGE,SIZE,STATE,BLOCK,CONFIGURED,MEMMAP_ON_MEMORY
RANGE                  SIZE  STATE   BLOCK CONFIGURED MEMMAP_ON_MEMORY
0x00000000-0x7fffffff  2G    online  0-15  yes        no
0x80000000-0xffffffff  2G    offline 16-31 no         yes

* Userspace changes:
lsmem/chmem tool is also changed to use the new interface. I will send
it to util-linux soon.

Patch 1 adds support for removal of boot-allocated memory blocks.

Patch 2 provides option to dynamically configure and deconfigure memory
with/without memmap_on_memory.

Patch 3 removes MHP_OFFLINE_INACCESSIBLE from s390. The mhp flag was
used to mark memory as not accessible until memory hotplug online phase
begins.  However, with patch 2, it is no longer essential. Memory can be
brought to accessible state before adding memory, as the memory is added
during runttime now instead of boottime.

Patch 4 removes the MEM_PREPARE_ONLINE/MEM_FINISH_OFFLINE notifiers. It
is no longer needed.  Memory can be brought to accessible state before
adding memory now, with runtime (de)configuration of memory.

Note: The patches apply to the linux-next branch.

v3:
Thanks David
* Avoid goto label in create_standby_sclp_mems().
* Use unsigned long instead of u64.
* Add Acked-by.

v2:
Thanks David
* Rename struct mblock/mblock_arg with struct sclp_mem/sclp_mem_arg.
* Rename all mblocks/mblock references with sclp_mems/sclp_mem -
  structures, functions.
* Rename create_online_mblock() with create_configured_sclp_mem().
* Rename config_mblock_show()/config_mblock_store() with
  config_sclp_mem_show()/config_sclp_mem_store().
* Remove contains_standby_increment() and
  sclp_mem_notifier. sclp mem state change is performed when
  adding/removing memory. sclp memory notifier - no longer needed with
  this patchset.
* Recover sclp mem state when add_memory() fails.
* Refactor and add function init_sclp_mem().
* Use unsigned long instead of unsigned long long.
* Simplify and correct kobj handling. Thanks Heiko.

====================

Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2025-10-21 10:19:30 +02:00
Heiko Carstens
c97689345c s390/con3270: Use scnprintf() instead of sprintf()
Use scnprintf() instead of sprintf() for those cases where the destination
is an array and the size of the array is known at compile time.

This prevents theoretical buffer overflows, but also avoids that people
again and again spend time to figure out if the code is actually safe.

Reviewed-by: Jan Polensky <japo@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2025-10-21 10:17:30 +02:00
Heiko Carstens
c769941de8 s390/tape: Use scnprintf() instead of sprintf()
Use scnprintf() instead of sprintf() for those cases where the destination
is an array and the size of the array is known at compile time.

This prevents theoretical buffer overflows, but also avoids that people
again and again spend time to figure out if the code is actually safe.

Reviewed-by: Jan Höppner <hoeppner@linux.ibm.com>
Reviewed-by: Jan Polensky <japo@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2025-10-21 10:17:30 +02:00
Heiko Carstens
ffb5d3af5e s390/dcss: Use scnprintf() instead of sprintf()
Use scnprintf() instead of sprintf() for those cases where the destination
is an array and the size of the array is known at compile time.

This prevents theoretical buffer overflows, but also avoids that people
again and again spend time to figure out if the code is actually safe.

Reviewed-by: Jan Polensky <japo@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2025-10-21 10:17:30 +02:00
Heiko Carstens
ba06238bbe s390/cio: Use scnprintf() instead of sprintf()
Use scnprintf() instead of sprintf() for those cases where the destination
is an array and the size of the array is known at compile time.

This prevents theoretical buffer overflows, but also avoids that people
again and again spend time to figure out if the code is actually safe.

Reviewed-by: Jan Polensky <japo@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2025-10-21 10:17:30 +02:00
Heiko Carstens
6850221116 s390/early: Use scnprintf() instead of sprintf()
Use scnprintf() instead of sprintf() for those cases where the destination
is an array and the size of the array is known at compile time.

This prevents theoretical buffer overflows, but also avoids that people
again and again spend time to figure out if the code is actually safe.

Reviewed-by: Jan Polensky <japo@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2025-10-21 10:17:30 +02:00
Thomas Richter
4d065f3c80 s390/pai_crypto: Adjust paicrypt_copy() return statement
Adjust the return statement in paicrypt_copy() to the same statement
as in paiext_copy(). Use one common style. No functional change.

Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Reviewed-by: Jan Polensky <japo@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2025-10-21 10:17:29 +02:00
Josephine Pfeiffer
4738e11662 s390/sysinfo: Replace sprintf() with snprintf() for buffer safety
Replace sprintf() with snprintf() when formatting symlink target name
to prevent potential buffer overflow. The link_to buffer is only 10
bytes, and using snprintf() ensures proper bounds checking if the
topology nesting limit value is unexpectedly large.

Signed-off-by: Josephine Pfeiffer <hi@josie.lol>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2025-10-21 10:17:29 +02:00
Josephine Pfeiffer
5379879a76 s390/extmem: Replace sprintf() with snprintf() for buffer safety
Replace unsafe sprintf() calls with snprintf() in segment_save() to
prevent potential buffer overflows. The function builds command strings
by repeatedly appending to a fixed-size buffer, which could overflow if
segment ranges are numerous or values are large.

Signed-off-by: Josephine Pfeiffer <hi@josie.lol>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2025-10-21 10:17:29 +02:00
Josephine Pfeiffer
dd7d1d34ae s390/cmm: Replace sprintf() with scnprintf() for buffer safety
Replace sprintf() with scnprintf() in cmm_timeout_handler() to prevent
potential buffer overflow. The scnprintf() function ensures we don't
write beyond the buffer size and provides safer string formatting.

Signed-off-by: Josephine Pfeiffer <hi@josie.lol>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2025-10-21 10:17:20 +02:00
Christophe JAILLET
ac646f4495 genirq/msi: Slightly simplify msi_domain_alloc()
The return value of irq_find_mapping() is only tested, not used for
anything else.

Replaced it by irq_resolve_mapping() which is internally used by
irq_find_mapping() and allows a simple boolean decision.

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://patch.msgid.link/1ce680114cdb8d40b072c54d7f015696a540e5a6.1760863194.git.christophe.jaillet@wanadoo.fr
2025-10-20 20:18:48 +02:00
Johan Hovold
a7f25e00c4 irqchip/qcom-irq-combiner: Rename driver structure
The "_probe" suffix of the driver structure name prevents modpost from
warning about section mismatches so replace it to catch any future
issues like the recently fixed probe function being incorrectly marked
as __init.

Signed-off-by: Johan Hovold <johan@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2025-10-17 15:18:18 +02:00
Yazen Ghannam
6553c68bc7 RAS/AMD/ATL: Return error codes from helper functions
Pass up error codes from helper functions rather than discarding them.

Suggested-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
2025-10-17 14:38:42 +02:00
Elena Reshetova
0f2753efc5 x86/sgx: Enable automatic SVN updates for SGX enclaves
== Background ==

ENCLS[EUPDATESVN] is a new SGX instruction [1] which allows enclave
attestation to include information about updated microcode SVN without a
reboot. Before an EUPDATESVN operation can be successful, all SGX memory
(aka. EPC) must be marked as “unused” in the SGX hardware metadata
(aka.EPCM). This requirement ensures that no compromised enclave can
survive the EUPDATESVN procedure and provides an opportunity to generate
new cryptographic assets.

== Solution ==

Attempt to execute ENCLS[EUPDATESVN] every time the first file descriptor
is obtained via sgx_(vepc_)open(). In the most common case the microcode
SVN is already up-to-date, and the operation succeeds without updating SVN.

Note: while in such cases the underlying crypto assets are regenerated, it
does not affect enclaves' visible keys obtained via EGETKEY instruction.

If it fails with any other error code than SGX_INSUFFICIENT_ENTROPY, this
is considered unexpected and the *open() returns an error. This should not
happen in practice.

On contrary, SGX_INSUFFICIENT_ENTROPY might happen due to a pressure on the
system's DRNG (RDSEED) and therefore the *open() can be safely retried to
allow normal enclave operation.

[1] Runtime Microcode Updates with Intel Software Guard Extensions,
https://cdrdv2.intel.com/v1/dl/getContent/648682

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Tested-by: Nataliia Bondarevska <bondarn@google.com>
2025-10-16 14:42:09 -07:00
Elena Reshetova
4e75697faa x86/sgx: Implement ENCLS[EUPDATESVN]
All running enclaves and cryptographic assets (such as internal SGX
encryption keys) are assumed to be compromised whenever an SGX-related
microcode update occurs. To mitigate this assumed compromise the new
supervisor SGX instruction ENCLS[EUPDATESVN] can generate fresh
cryptographic assets.

Before executing EUPDATESVN, all SGX memory must be marked as unused. This
requirement ensures that no potentially compromised enclave survives the
update and allows the system to safely regenerate cryptographic assets.

Add the method to perform ENCLS[EUPDATESVN]. However, until the follow up
patch that wires calling sgx_update_svn() from sgx_inc_usage_count(), this
code is not reachable.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Tested-by: Nataliia Bondarevska <bondarn@google.com>
2025-10-16 14:42:09 -07:00
Elena Reshetova
7b502832ee x86/sgx: Define error codes for use by ENCLS[EUPDATESVN]
Add error codes for ENCLS[EUPDATESVN], then SGX CPUSVN update process can
know the execution state of EUPDATESVN and notify userspace.

EUPDATESVN will be called when no active SGX users is guaranteed. Only add
the error codes that can legally happen. E.g., it could also fail due to
"SGX not ready" when there's SGX users but it wouldn't happen in this
implementation.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Tested-by: Nataliia Bondarevska <bondarn@google.com>
2025-10-16 14:42:09 -07:00
Elena Reshetova
6ffdb49101 x86/cpufeatures: Add X86_FEATURE_SGX_EUPDATESVN feature flag
Add a flag indicating whenever ENCLS[EUPDATESVN] SGX instruction is
supported. This will be used by SGX driver to perform CPU SVN updates.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Tested-by: Nataliia Bondarevska <bondarn@google.com>
2025-10-16 14:42:08 -07:00
Elena Reshetova
483fc19e9c x86/sgx: Introduce functions to count the sgx_(vepc_)open()
Currently, when SGX is compromised and the microcode update fix is applied,
the machine needs to be rebooted to invalidate old SGX crypto-assets and
make SGX be in an updated safe state. It's not friendly for the cloud.

To avoid having to reboot, a new ENCLS[EUPDATESVN] is introduced to update
SGX environment at runtime. This process needs to be done when there's no
SGX users to make sure no compromised enclaves can survive from the update
and allow the system to regenerate crypto-assets.

For now there's no counter to track the active SGX users of host enclave
and virtual EPC. Introduce such counter mechanism so that the EUPDATESVN
can be done only when there's no SGX users.

Define placeholder functions sgx_inc/dec_usage_count() that are used to
increment and decrement such a counter. Also, wire the call sites for
these functions. Encapsulate the current sgx_(vepc_)open() to
__sgx_(vepc_)open() to make the new sgx_(vepc_)open() easy to read.

The definition of the counter itself and the actual implementation of
sgx_inc/dec_usage_count() functions come next.

Note: The EUPDATESVN, which may fail, will be done in
sgx_inc_usage_count(). Make it return 'int' to make subsequent patches
which implement EUPDATESVN easier to review. For now it always returns
success.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Tested-by: Nataliia Bondarevska <bondarn@google.com>
2025-10-16 14:42:08 -07:00
Nam Cao
dce7450093 PCI/MSI: Delete pci_msi_create_irq_domain()
pci_msi_create_irq_domain() is now unused. Delete it.

Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
2025-10-16 21:09:52 +02:00
Samuel Holland
3a16b05384 irqchip/riscv-imsic: Inline imsic_vector_from_local_id()
This function is only called from one place, which is in the interrupt
handling hot path. Inline it to improve code generation and to take
advantage of this_cpu operations. lpriv and imsic->base_domain can never be
NULL because irq_set_chained_handler() is called after they are allocated.

Signed-off-by: Samuel Holland <samuel.holland@sifive.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2025-10-16 18:17:28 +02:00
Samuel Holland
79eaabc61d irqchip/riscv-imsic: Embed the vector array in lpriv
Reduce pointer chasing and the number of allocations by using a flexible
array member for the vector array instead of a separate allocation.

Signed-off-by: Samuel Holland <samuel.holland@sifive.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2025-10-16 18:17:28 +02:00
Samuel Holland
c475c0b713 irqchip/riscv-imsic: Remove redundant irq_data lookups
imsic_irq_set_affinity() already takes the irq_data pointer as a
parameter, so it is pointless to look it up again from the IRQ number.

Signed-off-by: Samuel Holland <samuel.holland@sifive.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2025-10-16 18:17:28 +02:00
Johan Hovold
dcc31768ff irqchip/ts4800: Drop unused module alias
The driver has never supported anything but OF probing so drop the
unused platform alias.

Signed-off-by: Johan Hovold <johan@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2025-10-16 18:17:28 +02:00
Johan Hovold
b03127a4e7 irqchip/mvebu-pic: Drop unused module alias
The driver has never supported anything but OF probing so drop the
unused platform alias.

Signed-off-by: Johan Hovold <johan@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
2025-10-16 18:17:28 +02:00
Johan Hovold
867c6aa283 irqchip/meson-gpio: Drop unused module alias
The driver has never supported anything but OF probing so drop the
unused platform alias that was erroneously added by commit a947aa00ed
("irqchip/meson-gpio: Make it possible to build as a module").

Signed-off-by: Johan Hovold <johan@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2025-10-16 18:17:27 +02:00
Johan Hovold
1230fbb225 irqchip: Enable compile testing of Broadcom drivers
There seems to be nothing preventing the Broadcom drivers from being
compile tested so enable that for wider build coverage.

Signed-off-by: Johan Hovold <johan@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
2025-10-16 18:17:27 +02:00
Johan Hovold
1e3e330c07 irqchip: Pass platform device to platform drivers
The IRQCHIP_PLATFORM_DRIVER macros can be used to convert OF irqchip
drivers to platform drivers but currently reuse the OF init callback
prototype that only takes OF nodes as arguments. This forces drivers to
do reverse lookups of their struct devices during probe if they need
them for things like dev_printk() and device managed resources.

Half of the drivers doing reverse lookups also currently fail to release
the additional reference taken during the lookup, while other drivers
have had the reference leak plugged in various ways (e.g. using
non-intuitive cleanup constructs which still confuse static checkers).

Switch to using a probe callback that takes a platform device as its
first argument to simplify drivers and plug the remaining (mostly
benign) reference leaks.

Fixes: 32c6c05466 ("irqchip: Add Broadcom BCM2712 MSI-X interrupt controller")
Fixes: 70afdab904 ("irqchip: Add IMX MU MSI controller driver")
Fixes: a6199bb514 ("irqchip: Add Qualcomm MPM controller driver")
Signed-off-by: Johan Hovold <johan@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Reviewed-by: Changhuang Liang <changhuang.liang@starfivetech.com>
2025-10-16 18:17:27 +02:00
Randy Dunlap
762a3d1ca2 x86/idtentry: Add missing '*' to kernel-doc lines
Fix kernel-doc warnings by adding the missing '*' to each line.

  Warning: include/asm/idtentry.h:395 bad line:    when raised from kernel mode
  Warning: include/asm/idtentry.h:405 bad line:    when raised from user mode

Since this is in a kernel-doc block, these lines need a leading
" *" on each line to prevent the warnings.

Fixes: a13644f3a5 ("x86/entry/64: Add entry code for #VC handler")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
2025-10-16 17:45:42 +02:00
Johan Hovold
3540d99c03 irqchip: Drop leftover brackets
Drop some unnecessary brackets in platform_irqchip_probe() mistakenly
left by commit 9322d1915f ("irqchip: Plug a OF node reference leak in
platform_irqchip_probe()").

Signed-off-by: Johan Hovold <johan@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be>
2025-10-16 11:30:38 +02:00
Johan Hovold
9b685058ca irqchip/qcom-irq-combiner: Fix section mismatch
Platform drivers can be probed after their init sections have been
discarded so the probe callback must not live in init.

Fixes: f20cc9b00c ("irqchip/qcom: Add IRQ combiner driver")
Signed-off-by: Johan Hovold <johan@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2025-10-16 11:30:38 +02:00
Johan Hovold
f798bdb9aa irqchip/starfive-jh8100: Fix section mismatch
Platform drivers can be probed after their init sections have been
discarded so the irqchip init callback must not live in init.

Fixes: e4e5350361 ("irqchip: Add StarFive external interrupt controller")
Signed-off-by: Johan Hovold <johan@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Changhuang Liang <changhuang.liang@starfivetech.com>
2025-10-16 11:30:38 +02:00
Johan Hovold
5b338fbb2b irqchip/renesas-rzg2l: Fix section mismatch
Platform drivers can be probed after their init sections have been
discarded so the irqchip init callbacks must not live in init.

Fixes: d011c022ef ("irqchip/renesas-rzg2l: Add support for RZ/Five SoC")
Signed-off-by: Johan Hovold <johan@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be>
2025-10-16 11:30:38 +02:00
Johan Hovold
64acfd8e68 irqchip/imx-mu-msi: Fix section mismatch
Platform drivers can be probed after their init sections have been
discarded so the irqchip init callbacks must not live in init.

Fixes: 70afdab904 ("irqchip: Add IMX MU MSI controller driver")
Signed-off-by: Johan Hovold <johan@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2025-10-16 11:30:38 +02:00
Johan Hovold
bbe1775924 irqchip/irq-brcmstb-l2: Fix section mismatch
Platform drivers can be probed after their init sections have been
discarded so the irqchip init callbacks must not live in init.

Fixes: 51d9db5c8f ("irqchip/irq-brcmstb-l2: Switch to IRQCHIP_PLATFORM_DRIVER")
Signed-off-by: Johan Hovold <johan@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
2025-10-16 11:30:37 +02:00
Johan Hovold
bfc0c5beab irqchip/irq-bcm7120-l2: Fix section mismatch
Platform drivers can be probed after their init sections have been
discarded so the irqchip init callbacks must not live in init.

Fixes: 3ac268d5ed ("irqchip/irq-bcm7120-l2: Switch to IRQCHIP_PLATFORM_DRIVER")
Signed-off-by: Johan Hovold <johan@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
2025-10-16 11:30:37 +02:00
Johan Hovold
e9db5332ca irqchip/irq-bcm7038-l1: Fix section mismatch
Platform drivers can be probed after their init sections have been
discarded so the irqchip init callback must not live in init.

Fixes: c057c799e3 ("irqchip/irq-bcm7038-l1: Switch to IRQCHIP_PLATFORM_DRIVER")
Signed-off-by: Johan Hovold <johan@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
2025-10-16 11:30:37 +02:00
Johan Hovold
a8452d1d59 irqchip/bcm2712-mip: Fix section mismatch
Platform drivers can be probed after their init sections have been
discarded so the irqchip init callback must not live in init.

Fixes: 32c6c05466 ("irqchip: Add Broadcom BCM2712 MSI-X interrupt controller")
Signed-off-by: Johan Hovold <johan@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
2025-10-16 11:30:37 +02:00
Johan Hovold
0435bcc4e5 irqchip/bcm2712-mip: Fix OF node reference imbalance
The init callback must not decrement the reference count of the provided
irqchip OF node.

This should not cause any trouble currently, but if the driver ever
starts probe deferring it could lead to warnings about reference
underflow and saturation.

Fixes: 32c6c05466 ("irqchip: Add Broadcom BCM2712 MSI-X interrupt controller")
Signed-off-by: Johan Hovold <johan@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
2025-10-16 11:30:37 +02:00
Chang S. Bae
bffeb2fd0b x86/microcode/intel: Enable staging when available
With staging support implemented, enable it when the CPU reports the
feature.

  [ bp: Sort in the MSR properly. ]

Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Tested-by: Anselm Busse <abusse@amazon.de>
Link: https://lore.kernel.org/20250320234104.8288-1-chang.seok.bae@intel.com
2025-10-15 16:47:50 +02:00
Chang S. Bae
4ab410287b x86/microcode/intel: Support mailbox transfer
The functions for sending microcode data and retrieving the next offset
were previously placeholders, as they need to handle a specific mailbox
format.

While the kernel supports similar mailboxes, none of them are compatible
with this one. Attempts to share code led to unnecessary complexity, so
add a dedicated implementation instead.

  [ bp: Sort the include properly. ]

Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Tested-by: Anselm Busse <abusse@amazon.de>
Link: https://lore.kernel.org/20250320234104.8288-1-chang.seok.bae@intel.com
2025-10-15 16:47:43 +02:00
Chang S. Bae
afc3b50954 x86/microcode/intel: Implement staging handler
Previously, per-package staging invocations and their associated state
data were established. The next step is to implement the actual staging
handler according to the specified protocol. Below are key aspects to
note:

  (a)  Each staging process must begin by resetting the staging hardware.

  (b)  The staging hardware processes up to a page-sized chunk of the
       microcode image per iteration, requiring software to submit data
       incrementally.

  (c)  Once a data chunk is processed, the hardware responds with an
       offset in the image for the next chunk.

  (d)  The offset may indicate completion or request retransmission of an
       already transferred chunk. As long as the total transferred data
       remains within the predefined limit (twice the image size),
       retransmissions should be acceptable.

Incorporate them in the handler, while data transmission and mailbox
format handling are implemented separately.

  [ bp: Sort the headers in a reversed name-length order. ]

Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Tested-by: Anselm Busse <abusse@amazon.de>
Link: https://lore.kernel.org/20250320234104.8288-1-chang.seok.bae@intel.com
2025-10-15 16:47:37 +02:00
Chang S. Bae
079b90d4ba x86/microcode/intel: Define staging state struct
Define a staging_state struct to simplify function prototypes by consolidating
relevant data, instead of passing multiple local variables.

Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Tested-by: Anselm Busse <abusse@amazon.de>
Link: https://lore.kernel.org/20250320234104.8288-1-chang.seok.bae@intel.com
2025-10-15 16:47:31 +02:00
Chang S. Bae
740144bc6b x86/microcode/intel: Establish staging control logic
When microcode staging is initiated, operations are carried out through
an MMIO interface. Each package has a unique interface specified by the
IA32_MCU_STAGING_MBOX_ADDR MSR, which maps to a set of 32-bit registers.

Prepare staging with the following steps:

  1.  Ensure the microcode image is 32-bit aligned to match the MMIO
      register size.

  2.  Identify each MMIO interface based on its per-package scope.

  3.  Invoke the staging function for each identified interface, which
      will be implemented separately.

  [ bp: Improve error logging. ]

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Tested-by: Anselm Busse <abusse@amazon.de>
Link: https://lore.kernel.org/all/871pznq229.ffs@tglx
2025-10-15 16:47:20 +02:00
Chang S. Bae
7cdda85ed9 x86/microcode: Introduce staging step to reduce late-loading time
As microcode patch sizes continue to grow, late-loading latency spikes can
lead to timeouts and disruptions in running workloads. This trend of
increasing patch sizes is expected to continue, so a foundational solution is
needed to address the issue.

To mitigate the problem, introduce a microcode staging feature. This option
processes most of the microcode update (excluding activation) on
a non-critical path, allowing CPUs to remain operational during the majority
of the update. By offloading work from the critical path, staging can
significantly reduce latency spikes.

Integrate staging as a preparatory step in late-loading. Introduce a new
callback for staging, which is invoked at the beginning of
load_late_stop_cpus(), before CPUs enter the rendezvous phase.

Staging follows an opportunistic model:

  *  If successful, it reduces CPU rendezvous time
  *  Even though it fails, the process falls back to the legacy path to
     finish the loading process but with potentially higher latency.

Extend struct microcode_ops to incorporate staging properties, which will be
implemented in the vendor code separately.

Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Tested-by: Anselm Busse <abusse@amazon.de>
Link: https://lore.kernel.org/20250320234104.8288-1-chang.seok.bae@intel.com
2025-10-15 16:46:58 +02:00
Chang S. Bae
ed44a5625f x86/cpu/topology: Make primary thread mask available with SMP=n
cpu_primary_thread_mask is only defined when CONFIG_SMP=y. However, even
in UP kernels there is always exactly one CPU, which can reasonably be
treated as the primary thread.

Historically, topology_is_primary_thread() always returned true with
CONFIG_SMP=n. A recent commit:

  4b455f5994 ("cpu/SMT: Provide a default topology_is_primary_thread()")

replaced it with a generic implementation with the note:

  "When disabling SMT, the primary thread of the SMT will remain
   enabled/active. Architectures that have a special primary thread (e.g.
   x86) need to override this function. ..."

For consistency and clarity, make the primary thread mask available
regardless of SMP, similar to cpu_possible_mask and cpu_present_mask.

Move __cpu_primary_thread_mask into common code to prevent build issues.
Let cpu_mark_primary_thread() configure the mask even for UP kernels,
alongside other masks. Then, topology_is_primary_thread() can
consistently reference it.

Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20250320234104.8288-1-chang.seok.bae@intel.com
2025-10-15 16:46:11 +02:00
Sumanth Korikkar
300709fbef mm/memory_hotplug: Remove MEM_PREPARE_ONLINE/MEM_FINISH_OFFLINE notifiers
MEM_PREPARE_ONLINE/MEM_FINISH_OFFLINE memory notifiers were introduced
to prepare the transition of memory to and from a physically accessible
state. This enhancement was crucial for implementing the "memmap on memory"
feature for s390.

With introduction of dynamic (de)configuration of hotpluggable memory,
memory can be brought to accessible state before add_memory(). Memory
can be brought to inaccessible state before remove_memory(). Hence,
there is no need of MEM_PREPARE_ONLINE/MEM_FINISH_OFFLINE memory
notifiers anymore.

This basically reverts commit
c5f1e2d189 ("mm/memory_hotplug: introduce MEM_PREPARE_ONLINE/MEM_FINISH_OFFLINE notifiers")
Additionally, apply minor adjustments to the function parameters of
move_pfn_range_to_zone() and mhp_supports_memmap_on_memory() to ensure
compatibility with the latest branch.

Acked-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2025-10-14 14:24:53 +02:00
Sumanth Korikkar
ce2071e02d s390/sclp: Remove MHP_OFFLINE_INACCESSIBLE
mhp_flag MHP_OFFLINE_INACCESSIBLE was used to mark memory as not
accessible until memory hotplug online phase begins.

Earlier, standby memory blocks were added upfront during boottime and
MHP_OFFLINE_INACCESSIBLE flag avoided page_init_poison() on memmap
during mhp addition phase.

However with dynamic runtime configuration of memory, standby memory can
be brought to accessible state before performing add_memory(). Hence,
remove MHP_OFFLINE_INACCESSIBLE.

Acked-by: Heiko Carstens <hca@linux.ibm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2025-10-14 14:24:53 +02:00
Sumanth Korikkar
ff18dcb19a s390/sclp: Add support for dynamic (de)configuration of memory
Provide a new interface for dynamic configuration and deconfiguration of
hotplug memory, allowing with/without memmap_on_memory support. It is a
follow up on the discussion with David when introducing memmap_on_memory
support for s390 and support dynamic (de)configuration of memory:
https://lore.kernel.org/all/ee492da8-74b4-4a97-8b24-73e07257f01d@redhat.com/
https://lore.kernel.org/all/20241202082732.3959803-1-sumanthk@linux.ibm.com/

The original motivation for introducing memmap_on_memory on s390 was to
avoid using online memory to store struct pages metadata, particularly
for standby memory blocks. This became critical in cases where there was
an imbalance between standby and online memory, potentially leading to
boot failures due to insufficient memory for metadata allocation.

To address this, memmap_on_memory was utilized on s390. However, in its
current form, it adds struct pages metadata at the start of each memory
block at the time of addition and this configuration is static. It
cannot be changed at runtime. (When the user needs continuous physical
memory).

Inorder to provide more flexibility to the user and overcome the above
limitation, add option to dynamically configure and deconfigure
hotpluggable memory block with/without memmap_on_memory.

With the new interface, s390 will not add all possible hotplug memory in
advance, like before, to make it visible in sysfs for online/offline
actions. Instead, before memory block can be set online, it has to be
configured via a new interface in /sys/firmware/memory/memoryX/config,
which makes s390 similar to others.  i.e. Adding of hotpluggable memory is
controlled by the user instead of adding it at boottime.

The s390 kernel sysfs interface to configure and deconfigure memory is
as follows (considering the upcoming lsmem changes):

* Initial memory layout:
lsmem -o RANGE,SIZE,STATE,BLOCK,CONFIGURED,MEMMAP_ON_MEMORY
RANGE                 SIZE   STATE BLOCK CONFIGURED MEMMAP_ON_MEMORY
0x00000000-0x7fffffff   2G  online 0-15  yes        no
0x80000000-0xffffffff   2G offline 16-31 no         yes

* Configure memory
sys="/sys"
echo 1 > $sys/firmware/memory/memory16/config
lsmem -o RANGE,SIZE,STATE,BLOCK,CONFIGURED,MEMMAP_ON_MEMORY
RANGE                  SIZE  STATE   BLOCK CONFIGURED MEMMAP_ON_MEMORY
0x00000000-0x7fffffff    2G  online  0-15  yes        no
0x80000000-0x87ffffff  128M offline    16  yes        yes
0x88000000-0xffffffff  1.9G offline 17-31  no         yes

* Deconfigure memory
echo 0 > $sys/firmware/memory/memory16/config
lsmem -o RANGE,SIZE,STATE,BLOCK,CONFIGURED,MEMMAP_ON_MEMORY
RANGE                 SIZE   STATE BLOCK CONFIGURED MEMMAP_ON_MEMORY
0x00000000-0x7fffffff   2G  online 0-15  yes        no
0x80000000-0xffffffff   2G offline 16-31 no         yes

3. Enable memmap_on_memory and online it.
echo 0 > $sys/devices/system/memory/memory5/online
echo 0 > $sys/firmware/memory/memory5/config

lsmem -o RANGE,SIZE,STATE,BLOCK,CONFIGURED,MEMMAP_ON_MEMORY
RANGE                  SIZE  STATE  BLOCK CONFIGURED MEMMAP_ON_MEMORY
0x00000000-0x27ffffff  640M  online 0-4   yes        no
0x28000000-0x2fffffff  128M offline 5     no         no
0x30000000-0x7fffffff  1.3G  online 6-15  yes        no
0x80000000-0xffffffff    2G offline 16-31 no         yes

echo 1 > $sys/firmware/memory/memory5/memmap_on_memory
echo 1 > $sys/firmware/memory/memory5/config
echo 1 > $sys/devices/system/memory/memory5/online

lsmem -o RANGE,SIZE,STATE,BLOCK,CONFIGURED,MEMMAP_ON_MEMORY
RANGE                  SIZE  STATE   BLOCK CONFIGURED MEMMAP_ON_MEMORY
0x00000000-0x27ffffff  640M  online  0-4   yes        no
0x28000000-0x2fffffff  128M  online  5     yes        yes
0x30000000-0x7fffffff  1.3G  online  6-15  yes        no
0x80000000-0xffffffff    2G  offline 16-31 no         yes

4. Disable memmap_on_memory and online it.
echo 0 > $sys/devices/system/memory/memory5/online
echo 0 > $sys/firmware/memory/memory5/config

lsmem -o RANGE,SIZE,STATE,BLOCK,CONFIGURED,MEMMAP_ON_MEMORY
RANGE                  SIZE  STATE  BLOCK CONFIGURED MEMMAP_ON_MEMORY
0x00000000-0x27ffffff  640M  online 0-4   yes        no
0x28000000-0x2fffffff  128M offline 5     no         yes
0x30000000-0x7fffffff  1.3G  online 6-15  yes        no
0x80000000-0xffffffff    2G offline 16-31 no         yes

echo 0 > $sys/firmware/memory/memory5/memmap_on_memory
echo 1 > $sys/firmware/memory/memory5/config
echo 1 > $sys/devices/system/memory/memory5/online

lsmem -o RANGE,SIZE,STATE,BLOCK,CONFIGURED,MEMMAP_ON_MEMORY
RANGE                  SIZE  STATE   BLOCK CONFIGURED MEMMAP_ON_MEMORY
0x00000000-0x7fffffff  2G    online  0-15  yes        no
0x80000000-0xffffffff  2G    offline 16-31 no         yes

Acked-by: Heiko Carstens <hca@linux.ibm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2025-10-14 14:24:53 +02:00
Sumanth Korikkar
d5e88d32de s390/mm: Support removal of boot-allocated virtual memory map
On s390, memory blocks are not currently removed via
arch_remove_memory(). With upcoming dynamic memory (de)configuration
support, runtime removal of memory blocks is possible. This internally
involves tearing down identity mapping, virtual memory mappings and
freeing the physical memory backing the struct pages metadata.

During early boot, physical memory used to back the struct pages
metadata in vmemmap is allocated through:

setup_arch()
  -> sparse_init()
    -> sparse_init_nid()
      -> __populate_section_memmap()
        -> vmemmap_alloc_block_buf()
          -> sparse_buffer_alloc()
            -> memblock_alloc()

Here, sparse_init_nid() sets up virtual-to-physical mapping for struct
pages backed by memblock_alloc(). This differs from runtime addition of
hotplug memory which uses the buddy allocator later.

To correctly free identity mappings, vmemmap mappings during hot-remove,
boot-time and runtime allocations must be distinguished using the
PageReserved bit:

* Boot-time memory, such as identity-mapped page tables allocated via
  boot_crst_alloc() and reserved via reserve_pgtables() is marked
  PageReserved in memmap_init_reserved_pages().

* Physical memory backing vmemmap (struct pages from memblock_alloc())
  is also marked PageReserved similarly.

During teardown, PageReserved bit is checked to distinguish between
boot-time allocation or buddy allocation.

This is similar to commit 645d5ce2f7 ("powerpc/mm/radix: Fix PTE/PMD
fragment count for early page table mappings")

Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2025-10-14 14:24:53 +02:00
Andrew Cooper
4ab13be5ed x86/fred: Fix 64bit identifier in fred_ss
FRED can only be enabled in Long Mode.  This is the 64bit mode (as opposed to
compatibility mode) identifier, rather than being something hard-wired at 1.

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Xin Li (Intel) <xin@zytor.com>
Reviewed-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Acked-by: H. Peter Anvin (Intel) <hpa@zytor.com>
2025-10-13 14:05:42 -07:00
Chen Yu
a0a0999507 x86/resctrl: Support Sub-NUMA Cluster (SNC) mode on Clearwater Forest
Clearwater Forest supports SNC mode. Add it to the snc_cpu_ids[] table.

Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Acked-by: Tony Luck <tony.luck@intel.com>
2025-10-13 16:59:55 +02:00
Borislav Petkov (AMD)
ddde4abaa0 x86/cpufeatures: Make X86_FEATURE leaf 17 Linux-specific
That cpuinfo_x86.x86_capability[] element was supposed to mirror CPUID flags
from CPUID_0x80000007_EBX but that leaf has still to this day only three bits
defined in it. So move those bits to scattered.c and free the capability
element for synthetic flags.

No functional changes.

Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
2025-10-13 16:21:25 +02:00
696 changed files with 16100 additions and 11418 deletions

View File

@@ -406,24 +406,8 @@ index of the MC::
|->mc2
....
Under each ``mcX`` directory each ``csrowX`` is again represented by a
``csrowX``, where ``X`` is the csrow index::
.../mc/mc0/
|
|->csrow0
|->csrow2
|->csrow3
....
Notice that there is no csrow1, which indicates that csrow0 is composed
of a single ranked DIMMs. This should also apply in both Channels, in
order to have dual-channel mode be operational. Since both csrow2 and
csrow3 are populated, this indicates a dual ranked set of DIMMs for
channels 0 and 1.
Within each of the ``mcX`` and ``csrowX`` directories are several EDAC
control and attribute files.
Within each of the ``mcX`` directory are several EDAC control and
attribute files.
``mcX`` directories
-------------------
@@ -569,7 +553,7 @@ this ``X`` memory module:
- Unbuffered-DDR
.. [#f5] On some systems, the memory controller doesn't have any logic
to identify the memory module. On such systems, the directory is called ``rankX`` and works on a similar way as the ``csrowX`` directories.
to identify the memory module. On such systems, the directory is called ``rankX``.
On modern Intel memory controllers, the memory controller identifies the
memory modules directly. On such systems, the directory is called ``dimmX``.
@@ -577,126 +561,6 @@ this ``X`` memory module:
symlinks inside the sysfs mapping that are automatically created by
the sysfs subsystem. Currently, they serve no purpose.
``csrowX`` directories
----------------------
When CONFIG_EDAC_LEGACY_SYSFS is enabled, sysfs will contain the ``csrowX``
directories. As this API doesn't work properly for Rambus, FB-DIMMs and
modern Intel Memory Controllers, this is being deprecated in favor of
``dimmX`` directories.
In the ``csrowX`` directories are EDAC control and attribute files for
this ``X`` instance of csrow:
- ``ue_count`` - Total Uncorrectable Errors count attribute file
This attribute file displays the total count of uncorrectable
errors that have occurred on this csrow. If panic_on_ue is set
this counter will not have a chance to increment, since EDAC
will panic the system.
- ``ce_count`` - Total Correctable Errors count attribute file
This attribute file displays the total count of correctable
errors that have occurred on this csrow. This count is very
important to examine. CEs provide early indications that a
DIMM is beginning to fail. This count field should be
monitored for non-zero values and report such information
to the system administrator.
- ``size_mb`` - Total memory managed by this csrow attribute file
This attribute file displays, in count of megabytes, the memory
that this csrow contains.
- ``mem_type`` - Memory Type attribute file
This attribute file will display what type of memory is currently
on this csrow. Normally, either buffered or unbuffered memory.
Examples:
- Registered-DDR
- Unbuffered-DDR
- ``edac_mode`` - EDAC Mode of operation attribute file
This attribute file will display what type of Error detection
and correction is being utilized.
- ``dev_type`` - Device type attribute file
This attribute file will display what type of DRAM device is
being utilized on this DIMM.
Examples:
- x1
- x2
- x4
- x8
- ``ch0_ce_count`` - Channel 0 CE Count attribute file
This attribute file will display the count of CEs on this
DIMM located in channel 0.
- ``ch0_ue_count`` - Channel 0 UE Count attribute file
This attribute file will display the count of UEs on this
DIMM located in channel 0.
- ``ch0_dimm_label`` - Channel 0 DIMM Label control file
This control file allows this DIMM to have a label assigned
to it. With this label in the module, when errors occur
the output can provide the DIMM label in the system log.
This becomes vital for panic events to isolate the
cause of the UE event.
DIMM Labels must be assigned after booting, with information
that correctly identifies the physical slot with its
silk screen label. This information is currently very
motherboard specific and determination of this information
must occur in userland at this time.
- ``ch1_ce_count`` - Channel 1 CE Count attribute file
This attribute file will display the count of CEs on this
DIMM located in channel 1.
- ``ch1_ue_count`` - Channel 1 UE Count attribute file
This attribute file will display the count of UEs on this
DIMM located in channel 0.
- ``ch1_dimm_label`` - Channel 1 DIMM Label control file
This control file allows this DIMM to have a label assigned
to it. With this label in the module, when errors occur
the output can provide the DIMM label in the system log.
This becomes vital for panic events to isolate the
cause of the UE event.
DIMM Labels must be assigned after booting, with information
that correctly identifies the physical slot with its
silk screen label. This information is currently very
motherboard specific and determination of this information
must occur in userland at this time.
System Logging
--------------

View File

@@ -6207,7 +6207,7 @@
rdt= [HW,X86,RDT]
Turn on/off individual RDT features. List is:
cmt, mbmtotal, mbmlocal, l3cat, l3cdp, l2cat, l2cdp,
mba, smba, bmec, abmc.
mba, smba, bmec, abmc, sdciae.
E.g. to turn on cmt and turn off mba use:
rdt=cmt,!mba
@@ -6500,6 +6500,10 @@
Memory area to be used by remote processor image,
managed by CMA.
rseq_debug= [KNL] Enable or disable restartable sequence
debug mode. Defaults to CONFIG_RSEQ_DEBUG_DEFAULT_ENABLE.
Format: <bool>
rt_group_sched= [KNL] Enable or disable SCHED_RR/FIFO group scheduling
when CONFIG_RT_GROUP_SCHED=y. Defaults to
!CONFIG_RT_GROUP_SCHED_DEFAULT_DISABLED.

View File

@@ -391,13 +391,13 @@ Before jumping into the kernel, the following conditions must be met:
- SMCR_EL2.LEN must be initialised to the same value for all CPUs the
kernel will execute on.
- HWFGRTR_EL2.nTPIDR2_EL0 (bit 55) must be initialised to 0b01.
- HFGRTR_EL2.nTPIDR2_EL0 (bit 55) must be initialised to 0b01.
- HWFGWTR_EL2.nTPIDR2_EL0 (bit 55) must be initialised to 0b01.
- HFGWTR_EL2.nTPIDR2_EL0 (bit 55) must be initialised to 0b01.
- HWFGRTR_EL2.nSMPRI_EL1 (bit 54) must be initialised to 0b01.
- HFGRTR_EL2.nSMPRI_EL1 (bit 54) must be initialised to 0b01.
- HWFGWTR_EL2.nSMPRI_EL1 (bit 54) must be initialised to 0b01.
- HFGWTR_EL2.nSMPRI_EL1 (bit 54) must be initialised to 0b01.
For CPUs with the Scalable Matrix Extension FA64 feature (FEAT_SME_FA64):

View File

@@ -402,6 +402,11 @@ The regset data starts with struct user_sve_header, containing:
streaming mode and any SETREGSET of NT_ARM_SSVE will enter streaming mode
if the target was not in streaming mode.
* On systems that do not support SVE it is permitted to use SETREGSET to
write SVE_PT_REGS_FPSIMD formatted data via NT_ARM_SVE, in this case the
vector length should be specified as 0. This allows streaming mode to be
disabled on systems with SME but not SVE.
* If any register data is provided along with SVE_PT_VL_ONEXEC then the
registers data will be interpreted with the current vector length, not
the vector length configured for use on exec.

View File

@@ -243,9 +243,8 @@ Examples:
Changing the size of debug areas
------------------------------------
It is possible the change the size of debug areas through piping
the number of pages to the debugfs file "pages". The resize request will
also flush the debug areas.
To resize a debug area, write the desired page count to the "pages" file.
Existing data is preserved if it fits; otherwise, oldest entries are dropped.
Example:

View File

@@ -39,6 +39,9 @@ properties:
- amlogic,a4-gpio-ao-intc
- amlogic,a5-gpio-intc
- amlogic,c3-gpio-intc
- amlogic,s6-gpio-intc
- amlogic,s7-gpio-intc
- amlogic,s7d-gpio-intc
- amlogic,t7-gpio-intc
- const: amlogic,meson-gpio-intc

View File

@@ -25,13 +25,14 @@ properties:
interrupt-controller: true
'#interrupt-cells':
const: 2
const: 1
description:
The first cell is the IRQ number, the second cell is the trigger
type as defined in interrupt.txt in this directory.
interrupts:
maxItems: 6
minItems: 1
maxItems: 10
description: |
Depend to which INTC0 or INTC1 used.
INTC0 and INTC1 are two kinds of interrupt controller with enable and raw
@@ -74,13 +75,17 @@ examples:
interrupt-controller@12101b00 {
compatible = "aspeed,ast2700-intc-ic";
reg = <0 0x12101b00 0 0x10>;
#interrupt-cells = <2>;
#interrupt-cells = <1>;
interrupt-controller;
interrupts = <GIC_SPI 192 IRQ_TYPE_LEVEL_HIGH>,
<GIC_SPI 193 IRQ_TYPE_LEVEL_HIGH>,
<GIC_SPI 194 IRQ_TYPE_LEVEL_HIGH>,
<GIC_SPI 195 IRQ_TYPE_LEVEL_HIGH>,
<GIC_SPI 196 IRQ_TYPE_LEVEL_HIGH>,
<GIC_SPI 197 IRQ_TYPE_LEVEL_HIGH>;
<GIC_SPI 197 IRQ_TYPE_LEVEL_HIGH>,
<GIC_SPI 198 IRQ_TYPE_LEVEL_HIGH>,
<GIC_SPI 199 IRQ_TYPE_LEVEL_HIGH>,
<GIC_SPI 200 IRQ_TYPE_LEVEL_HIGH>,
<GIC_SPI 201 IRQ_TYPE_LEVEL_HIGH>;
};
};

View File

@@ -58,6 +58,7 @@ properties:
- const: andestech,nceplic100
- items:
- enum:
- anlogic,dr1v90-plic
- canaan,k210-plic
- eswin,eic7700-plic
- sifive,fu540-c000-plic
@@ -75,6 +76,9 @@ properties:
- sophgo,sg2044-plic
- thead,th1520-plic
- const: thead,c900-plic
- items:
- const: ultrarisc,dp1000-plic
- const: ultrarisc,cp100-plic
- items:
- const: sifive,plic-1.0.0
- const: riscv,plic0

View File

@@ -4,18 +4,23 @@
$id: http://devicetree.org/schemas/interrupt-controller/thead,c900-aclint-mswi.yaml#
$schema: http://devicetree.org/meta-schemas/core.yaml#
title: Sophgo sg2042 CLINT Machine-level Software Interrupt Device
title: ACLINT Machine-level Software Interrupt Device
maintainers:
- Inochi Amaoto <inochiama@outlook.com>
properties:
compatible:
items:
- enum:
- sophgo,sg2042-aclint-mswi
- sophgo,sg2044-aclint-mswi
- const: thead,c900-aclint-mswi
oneOf:
- items:
- enum:
- sophgo,sg2042-aclint-mswi
- sophgo,sg2044-aclint-mswi
- const: thead,c900-aclint-mswi
- items:
- enum:
- anlogic,dr1v90-aclint-mswi
- const: nuclei,ux900-aclint-mswi
reg:
maxItems: 1

View File

@@ -30,6 +30,10 @@ properties:
- const: thead,c900-aclint-sswi
- items:
- const: mips,p8700-aclint-sswi
- items:
- enum:
- anlogic,dr1v90-aclint-sswi
- const: nuclei,ux900-aclint-sswi
reg:
maxItems: 1

View File

@@ -14,6 +14,7 @@ properties:
oneOf:
- enum:
- fsl,imx8-ddr-pmu
- fsl,imx8dxl-db-pmu
- fsl,imx8m-ddr-pmu
- fsl,imx8mq-ddr-pmu
- fsl,imx8mm-ddr-pmu
@@ -28,7 +29,10 @@ properties:
- fsl,imx8mp-ddr-pmu
- const: fsl,imx8m-ddr-pmu
- items:
- const: fsl,imx8dxl-ddr-pmu
- enum:
- fsl,imx8dxl-ddr-pmu
- fsl,imx8qm-ddr-pmu
- fsl,imx8qxp-ddr-pmu
- const: fsl,imx8-ddr-pmu
- items:
- enum:
@@ -43,6 +47,14 @@ properties:
interrupts:
maxItems: 1
clocks:
maxItems: 2
clock-names:
items:
- const: ipg
- const: cnt
required:
- compatible
- reg
@@ -50,6 +62,21 @@ required:
additionalProperties: false
allOf:
- if:
properties:
compatible:
contains:
const: fsl,imx8dxl-db-pmu
then:
required:
- clocks
- clock-names
else:
properties:
clocks: false
clock-names: false
examples:
- |
#include <dt-bindings/interrupt-controller/arm-gic.h>

View File

@@ -0,0 +1,47 @@
# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
%YAML 1.2
---
$id: http://devicetree.org/schemas/timer/realtek,rtd1625-systimer.yaml#
$schema: http://devicetree.org/meta-schemas/core.yaml#
title: Realtek System Timer
maintainers:
- Hao-Wen Ting <haowen.ting@realtek.com>
description:
The Realtek SYSTIMER (System Timer) is a 64-bit global hardware counter operating
at a fixed 1MHz frequency. Thanks to its compare match interrupt capability,
the timer natively supports oneshot mode for tick broadcast functionality.
properties:
compatible:
oneOf:
- const: realtek,rtd1625-systimer
- items:
- const: realtek,rtd1635-systimer
- const: realtek,rtd1625-systimer
reg:
maxItems: 1
interrupts:
maxItems: 1
required:
- compatible
- reg
- interrupts
additionalProperties: false
examples:
- |
#include <dt-bindings/interrupt-controller/arm-gic.h>
timer@89420 {
compatible = "realtek,rtd1635-systimer",
"realtek,rtd1625-systimer";
reg = <0x89420 0x18>;
interrupts = <GIC_SPI 112 IRQ_TYPE_LEVEL_HIGH>;
};

View File

@@ -1705,6 +1705,8 @@ patternProperties:
description: Universal Scientific Industrial Co., Ltd.
"^usr,.*":
description: U.S. Robotics Corporation
"^ultrarisc,.*":
description: UltraRISC Technology Co., Ltd.
"^ultratronik,.*":
description: Ultratronik GmbH
"^utoo,.*":

View File

@@ -17,17 +17,18 @@ AMD refers to this feature as AMD Platform Quality of Service(AMD QoS).
This feature is enabled by the CONFIG_X86_CPU_RESCTRL and the x86 /proc/cpuinfo
flag bits:
=============================================== ================================
RDT (Resource Director Technology) Allocation "rdt_a"
CAT (Cache Allocation Technology) "cat_l3", "cat_l2"
CDP (Code and Data Prioritization) "cdp_l3", "cdp_l2"
CQM (Cache QoS Monitoring) "cqm_llc", "cqm_occup_llc"
MBM (Memory Bandwidth Monitoring) "cqm_mbm_total", "cqm_mbm_local"
MBA (Memory Bandwidth Allocation) "mba"
SMBA (Slow Memory Bandwidth Allocation) ""
BMEC (Bandwidth Monitoring Event Configuration) ""
ABMC (Assignable Bandwidth Monitoring Counters) ""
=============================================== ================================
=============================================================== ================================
RDT (Resource Director Technology) Allocation "rdt_a"
CAT (Cache Allocation Technology) "cat_l3", "cat_l2"
CDP (Code and Data Prioritization) "cdp_l3", "cdp_l2"
CQM (Cache QoS Monitoring) "cqm_llc", "cqm_occup_llc"
MBM (Memory Bandwidth Monitoring) "cqm_mbm_total", "cqm_mbm_local"
MBA (Memory Bandwidth Allocation) "mba"
SMBA (Slow Memory Bandwidth Allocation) ""
BMEC (Bandwidth Monitoring Event Configuration) ""
ABMC (Assignable Bandwidth Monitoring Counters) ""
SDCIAE (Smart Data Cache Injection Allocation Enforcement) ""
=============================================================== ================================
Historically, new features were made visible by default in /proc/cpuinfo. This
resulted in the feature flags becoming hard to parse by humans. Adding a new
@@ -72,6 +73,11 @@ The 'info' directory contains information about the enabled
resources. Each resource has its own subdirectory. The subdirectory
names reflect the resource names.
Most of the files in the resource's subdirectory are read-only, and
describe properties of the resource. Resources that support global
configuration options also include writable files that can be used
to modify those settings.
Each subdirectory contains the following files with respect to
allocation:
@@ -90,12 +96,19 @@ related to allocation:
must be set when writing a mask.
"shareable_bits":
Bitmask of shareable resource with other executing
entities (e.g. I/O). User can use this when
setting up exclusive cache partitions. Note that
some platforms support devices that have their
own settings for cache use which can over-ride
these bits.
Bitmask of shareable resource with other executing entities
(e.g. I/O). Applies to all instances of this resource. User
can use this when setting up exclusive cache partitions.
Note that some platforms support devices that have their
own settings for cache use which can over-ride these bits.
When "io_alloc" is enabled, a portion of each cache instance can
be configured for shared use between hardware and software.
"bit_usage" should be used to see which portions of each cache
instance is configured for hardware use via "io_alloc" feature
because every cache instance can have its "io_alloc" bitmask
configured independently via "io_alloc_cbm".
"bit_usage":
Annotated capacity bitmasks showing how all
instances of the resource are used. The legend is:
@@ -109,16 +122,16 @@ related to allocation:
"H":
Corresponding region is used by hardware only
but available for software use. If a resource
has bits set in "shareable_bits" but not all
of these bits appear in the resource groups'
schematas then the bits appearing in
"shareable_bits" but no resource group will
be marked as "H".
has bits set in "shareable_bits" or "io_alloc_cbm"
but not all of these bits appear in the resource
groups' schemata then the bits appearing in
"shareable_bits" or "io_alloc_cbm" but no
resource group will be marked as "H".
"X":
Corresponding region is available for sharing and
used by hardware and software. These are the
bits that appear in "shareable_bits" as
well as a resource group's allocation.
used by hardware and software. These are the bits
that appear in "shareable_bits" or "io_alloc_cbm"
as well as a resource group's allocation.
"S":
Corresponding region is used by software
and available for sharing.
@@ -136,6 +149,77 @@ related to allocation:
"1":
Non-contiguous 1s value in CBM is supported.
"io_alloc":
"io_alloc" enables system software to configure the portion of
the cache allocated for I/O traffic. File may only exist if the
system supports this feature on some of its cache resources.
"disabled":
Resource supports "io_alloc" but the feature is disabled.
Portions of cache used for allocation of I/O traffic cannot
be configured.
"enabled":
Portions of cache used for allocation of I/O traffic
can be configured using "io_alloc_cbm".
"not supported":
Support not available for this resource.
The feature can be modified by writing to the interface, for example:
To enable::
# echo 1 > /sys/fs/resctrl/info/L3/io_alloc
To disable::
# echo 0 > /sys/fs/resctrl/info/L3/io_alloc
The underlying implementation may reduce resources available to
general (CPU) cache allocation. See architecture specific notes
below. Depending on usage requirements the feature can be enabled
or disabled.
On AMD systems, io_alloc feature is supported by the L3 Smart
Data Cache Injection Allocation Enforcement (SDCIAE). The CLOSID for
io_alloc is the highest CLOSID supported by the resource. When
io_alloc is enabled, the highest CLOSID is dedicated to io_alloc and
no longer available for general (CPU) cache allocation. When CDP is
enabled, io_alloc routes I/O traffic using the highest CLOSID allocated
for the instruction cache (CDP_CODE), making this CLOSID no longer
available for general (CPU) cache allocation for both the CDP_CODE
and CDP_DATA resources.
"io_alloc_cbm":
Capacity bitmasks that describe the portions of cache instances to
which I/O traffic from supported I/O devices are routed when "io_alloc"
is enabled.
CBMs are displayed in the following format:
<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
Example::
# cat /sys/fs/resctrl/info/L3/io_alloc_cbm
0=ffff;1=ffff
CBMs can be configured by writing to the interface.
Example::
# echo 1=ff > /sys/fs/resctrl/info/L3/io_alloc_cbm
# cat /sys/fs/resctrl/info/L3/io_alloc_cbm
0=ffff;1=00ff
# echo "0=ff;1=f" > /sys/fs/resctrl/info/L3/io_alloc_cbm
# cat /sys/fs/resctrl/info/L3/io_alloc_cbm
0=00ff;1=000f
When CDP is enabled "io_alloc_cbm" associated with the CDP_DATA and CDP_CODE
resources may reflect the same values. For example, values read from and
written to /sys/fs/resctrl/info/L3DATA/io_alloc_cbm may be reflected by
/sys/fs/resctrl/info/L3CODE/io_alloc_cbm and vice versa.
Memory bandwidth(MB) subdirectory contains the following files
with respect to allocation:

View File

@@ -17470,6 +17470,16 @@ S: Maintained
F: Documentation/devicetree/bindings/leds/backlight/mps,mp3309c.yaml
F: drivers/video/backlight/mp3309c.c
MPAM DRIVER
M: James Morse <james.morse@arm.com>
M: Ben Horgan <ben.horgan@arm.com>
R: Reinette Chatre <reinette.chatre@intel.com>
R: Fenghua Yu <fenghuay@nvidia.com>
S: Maintained
F: drivers/resctrl/mpam_*
F: drivers/resctrl/test_mpam_*
F: include/linux/arm_mpam.h
MPS MP2869 DRIVER
M: Wensheng Wang <wenswang@yeah.net>
L: linux-hwmon@vger.kernel.org
@@ -21681,6 +21691,11 @@ S: Maintained
F: Documentation/devicetree/bindings/spi/realtek,rtl9301-snand.yaml
F: drivers/spi/spi-realtek-rtl-snand.c
REALTEK SYSTIMER DRIVER
M: Hao-Wen Ting <haowen.ting@realtek.com>
S: Maintained
F: drivers/clocksource/timer-realtek.c
REALTEK WIRELESS DRIVER (rtlwifi family)
M: Ping-Ke Shih <pkshih@realtek.com>
L: linux-wireless@vger.kernel.org

View File

@@ -283,10 +283,17 @@ extern int __put_user_8(void *, unsigned long long);
__gu_err; \
})
/*
* This is a type: either unsigned long, if the argument fits into
* that type, or otherwise unsigned long long.
*/
#define __long_type(x) \
__typeof__(__builtin_choose_expr(sizeof(x) > sizeof(0UL), 0ULL, 0UL))
#define __get_user_err(x, ptr, err, __t) \
do { \
unsigned long __gu_addr = (unsigned long)(ptr); \
unsigned long __gu_val; \
__long_type(x) __gu_val; \
unsigned int __ua_flags; \
__chk_user_ptr(ptr); \
might_fault(); \
@@ -295,6 +302,7 @@ do { \
case 1: __get_user_asm_byte(__gu_val, __gu_addr, err, __t); break; \
case 2: __get_user_asm_half(__gu_val, __gu_addr, err, __t); break; \
case 4: __get_user_asm_word(__gu_val, __gu_addr, err, __t); break; \
case 8: __get_user_asm_dword(__gu_val, __gu_addr, err, __t); break; \
default: (__gu_val) = __get_user_bad(); \
} \
uaccess_restore(__ua_flags); \
@@ -353,6 +361,22 @@ do { \
#define __get_user_asm_word(x, addr, err, __t) \
__get_user_asm(x, addr, err, "ldr" __t)
#ifdef __ARMEB__
#define __WORD0_OFFS 4
#define __WORD1_OFFS 0
#else
#define __WORD0_OFFS 0
#define __WORD1_OFFS 4
#endif
#define __get_user_asm_dword(x, addr, err, __t) \
({ \
unsigned long __w0, __w1; \
__get_user_asm(__w0, addr + __WORD0_OFFS, err, "ldr" __t); \
__get_user_asm(__w1, addr + __WORD1_OFFS, err, "ldr" __t); \
(x) = ((u64)__w1 << 32) | (u64) __w0; \
})
#define __put_user_switch(x, ptr, __err, __fn) \
do { \
const __typeof__(*(ptr)) __user *__pu_ptr = (ptr); \

View File

@@ -47,7 +47,6 @@ config ARM64
select ARCH_HAS_SETUP_DMA_OPS
select ARCH_HAS_SET_DIRECT_MAP
select ARCH_HAS_SET_MEMORY
select ARCH_HAS_MEM_ENCRYPT
select ARCH_HAS_FORCE_DMA_UNENCRYPTED
select ARCH_STACKWALK
select ARCH_HAS_STRICT_KERNEL_RWX
@@ -2023,6 +2022,31 @@ config ARM64_TLB_RANGE
ARMv8.4-TLBI provides TLBI invalidation instruction that apply to a
range of input addresses.
config ARM64_MPAM
bool "Enable support for MPAM"
select ARM64_MPAM_DRIVER if EXPERT # does nothing yet
select ACPI_MPAM if ACPI
help
Memory System Resource Partitioning and Monitoring (MPAM) is an
optional extension to the Arm architecture that allows each
transaction issued to the memory system to be labelled with a
Partition identifier (PARTID) and Performance Monitoring Group
identifier (PMG).
Memory system components, such as the caches, can be configured with
policies to control how much of various physical resources (such as
memory bandwidth or cache memory) the transactions labelled with each
PARTID can consume. Depending on the capabilities of the hardware,
the PARTID and PMG can also be used as filtering criteria to measure
the memory system resource consumption of different parts of a
workload.
Use of this extension requires CPU support, support in the
Memory System Components (MSC), and a description from firmware
of where the MSCs are in the address space.
MPAM is exposed to user-space via the resctrl pseudo filesystem.
endmenu # "ARMv8.4 architectural features"
menu "ARMv8.5 architectural features"

View File

@@ -19,7 +19,7 @@
#error "cpucaps have overflown ARM64_CB_BIT"
#endif
#ifndef __ASSEMBLY__
#ifndef __ASSEMBLER__
#include <linux/stringify.h>
@@ -207,7 +207,7 @@ alternative_endif
#define _ALTERNATIVE_CFG(insn1, insn2, cap, cfg, ...) \
alternative_insn insn1, insn2, cap, IS_ENABLED(cfg)
#endif /* __ASSEMBLY__ */
#endif /* __ASSEMBLER__ */
/*
* Usage: asm(ALTERNATIVE(oldinstr, newinstr, cpucap));
@@ -219,7 +219,7 @@ alternative_endif
#define ALTERNATIVE(oldinstr, newinstr, ...) \
_ALTERNATIVE_CFG(oldinstr, newinstr, __VA_ARGS__, 1)
#ifndef __ASSEMBLY__
#ifndef __ASSEMBLER__
#include <linux/types.h>
@@ -263,6 +263,6 @@ l_yes:
return true;
}
#endif /* __ASSEMBLY__ */
#endif /* __ASSEMBLER__ */
#endif /* __ASM_ALTERNATIVE_MACROS_H */

View File

@@ -4,7 +4,7 @@
#include <asm/alternative-macros.h>
#ifndef __ASSEMBLY__
#ifndef __ASSEMBLER__
#include <linux/init.h>
#include <linux/types.h>
@@ -37,5 +37,5 @@ static inline int apply_alternatives_module(void *start, size_t length)
void alt_cb_patch_nops(struct alt_instr *alt, __le32 *origptr,
__le32 *updptr, int nr_inst);
#endif /* __ASSEMBLY__ */
#endif /* __ASSEMBLER__ */
#endif /* __ASM_ALTERNATIVE_H */

View File

@@ -9,7 +9,7 @@
#include <asm/sysreg.h>
#ifndef __ASSEMBLY__
#ifndef __ASSEMBLER__
#include <linux/irqchip/arm-gic-common.h>
#include <linux/stringify.h>
@@ -188,5 +188,5 @@ static inline bool gic_has_relaxed_pmr_sync(void)
return cpus_have_cap(ARM64_HAS_GIC_PRIO_RELAXED_SYNC);
}
#endif /* __ASSEMBLY__ */
#endif /* __ASSEMBLER__ */
#endif /* __ASM_ARCH_GICV3_H */

View File

@@ -27,7 +27,7 @@
/* Data fields for EX_TYPE_UACCESS_CPY */
#define EX_DATA_UACCESS_WRITE BIT(0)
#ifdef __ASSEMBLY__
#ifdef __ASSEMBLER__
#define __ASM_EXTABLE_RAW(insn, fixup, type, data) \
.pushsection __ex_table, "a"; \
@@ -77,7 +77,7 @@
__ASM_EXTABLE_RAW(\insn, \fixup, EX_TYPE_UACCESS_CPY, \uaccess_is_write)
.endm
#else /* __ASSEMBLY__ */
#else /* __ASSEMBLER__ */
#include <linux/stringify.h>
@@ -132,6 +132,6 @@
EX_DATA_REG(ADDR, addr) \
")")
#endif /* __ASSEMBLY__ */
#endif /* __ASSEMBLER__ */
#endif /* __ASM_ASM_EXTABLE_H */

View File

@@ -5,7 +5,7 @@
* Copyright (C) 1996-2000 Russell King
* Copyright (C) 2012 ARM Ltd.
*/
#ifndef __ASSEMBLY__
#ifndef __ASSEMBLER__
#error "Only include this from assembly code"
#endif
@@ -325,14 +325,14 @@ alternative_cb_end
* tcr_set_t0sz - update TCR.T0SZ so that we can load the ID map
*/
.macro tcr_set_t0sz, valreg, t0sz
bfi \valreg, \t0sz, #TCR_T0SZ_OFFSET, #TCR_TxSZ_WIDTH
bfi \valreg, \t0sz, #TCR_EL1_T0SZ_SHIFT, #TCR_EL1_T0SZ_WIDTH
.endm
/*
* tcr_set_t1sz - update TCR.T1SZ
*/
.macro tcr_set_t1sz, valreg, t1sz
bfi \valreg, \t1sz, #TCR_T1SZ_OFFSET, #TCR_TxSZ_WIDTH
bfi \valreg, \t1sz, #TCR_EL1_T1SZ_SHIFT, #TCR_EL1_T1SZ_WIDTH
.endm
/*
@@ -371,7 +371,7 @@ alternative_endif
* [start, end) with dcache line size explicitly provided.
*
* op: operation passed to dc instruction
* domain: domain used in dsb instruciton
* domain: domain used in dsb instruction
* start: starting virtual address of the region
* end: end virtual address of the region
* linesz: dcache line size
@@ -412,7 +412,7 @@ alternative_endif
* [start, end)
*
* op: operation passed to dc instruction
* domain: domain used in dsb instruciton
* domain: domain used in dsb instruction
* start: starting virtual address of the region
* end: end virtual address of the region
* fixup: optional label to branch to on user fault
@@ -589,7 +589,7 @@ alternative_endif
.macro offset_ttbr1, ttbr, tmp
#if defined(CONFIG_ARM64_VA_BITS_52) && !defined(CONFIG_ARM64_LPA2)
mrs \tmp, tcr_el1
and \tmp, \tmp, #TCR_T1SZ_MASK
and \tmp, \tmp, #TCR_EL1_T1SZ_MASK
cmp \tmp, #TCR_T1SZ(VA_BITS_MIN)
orr \tmp, \ttbr, #TTBR1_BADDR_4852_OFFSET
csel \ttbr, \tmp, \ttbr, eq

View File

@@ -103,17 +103,17 @@ static __always_inline void __lse_atomic_and(int i, atomic_t *v)
return __lse_atomic_andnot(~i, v);
}
#define ATOMIC_FETCH_OP_AND(name, mb, cl...) \
#define ATOMIC_FETCH_OP_AND(name) \
static __always_inline int \
__lse_atomic_fetch_and##name(int i, atomic_t *v) \
{ \
return __lse_atomic_fetch_andnot##name(~i, v); \
}
ATOMIC_FETCH_OP_AND(_relaxed, )
ATOMIC_FETCH_OP_AND(_acquire, a, "memory")
ATOMIC_FETCH_OP_AND(_release, l, "memory")
ATOMIC_FETCH_OP_AND( , al, "memory")
ATOMIC_FETCH_OP_AND(_relaxed)
ATOMIC_FETCH_OP_AND(_acquire)
ATOMIC_FETCH_OP_AND(_release)
ATOMIC_FETCH_OP_AND( )
#undef ATOMIC_FETCH_OP_AND
@@ -210,17 +210,17 @@ static __always_inline void __lse_atomic64_and(s64 i, atomic64_t *v)
return __lse_atomic64_andnot(~i, v);
}
#define ATOMIC64_FETCH_OP_AND(name, mb, cl...) \
#define ATOMIC64_FETCH_OP_AND(name) \
static __always_inline long \
__lse_atomic64_fetch_and##name(s64 i, atomic64_t *v) \
{ \
return __lse_atomic64_fetch_andnot##name(~i, v); \
}
ATOMIC64_FETCH_OP_AND(_relaxed, )
ATOMIC64_FETCH_OP_AND(_acquire, a, "memory")
ATOMIC64_FETCH_OP_AND(_release, l, "memory")
ATOMIC64_FETCH_OP_AND( , al, "memory")
ATOMIC64_FETCH_OP_AND(_relaxed)
ATOMIC64_FETCH_OP_AND(_acquire)
ATOMIC64_FETCH_OP_AND(_release)
ATOMIC64_FETCH_OP_AND( )
#undef ATOMIC64_FETCH_OP_AND

View File

@@ -7,7 +7,7 @@
#ifndef __ASM_BARRIER_H
#define __ASM_BARRIER_H
#ifndef __ASSEMBLY__
#ifndef __ASSEMBLER__
#include <linux/kasan-checks.h>
@@ -221,6 +221,6 @@ do { \
#include <asm-generic/barrier.h>
#endif /* __ASSEMBLY__ */
#endif /* __ASSEMBLER__ */
#endif /* __ASM_BARRIER_H */

View File

@@ -35,7 +35,7 @@
#define ARCH_DMA_MINALIGN (128)
#define ARCH_KMALLOC_MINALIGN (8)
#if !defined(__ASSEMBLY__) && !defined(BUILD_VDSO)
#if !defined(__ASSEMBLER__) && !defined(BUILD_VDSO)
#include <linux/bitops.h>
#include <linux/kasan-enabled.h>
@@ -135,6 +135,6 @@ static inline u32 __attribute_const__ read_cpuid_effective_cachetype(void)
return ctr;
}
#endif /* !defined(__ASSEMBLY__) && !defined(BUILD_VDSO) */
#endif /* !defined(__ASSEMBLER__) && !defined(BUILD_VDSO) */
#endif

View File

@@ -5,7 +5,7 @@
#include <asm/cpucap-defs.h>
#ifndef __ASSEMBLY__
#ifndef __ASSEMBLER__
#include <linux/types.h>
/*
* Check whether a cpucap is possible at compiletime.
@@ -77,6 +77,6 @@ cpucap_is_possible(const unsigned int cap)
return true;
}
#endif /* __ASSEMBLY__ */
#endif /* __ASSEMBLER__ */
#endif /* __ASM_CPUCAPS_H */

View File

@@ -19,7 +19,7 @@
#define ARM64_SW_FEATURE_OVERRIDE_HVHE 4
#define ARM64_SW_FEATURE_OVERRIDE_RODATA_OFF 8
#ifndef __ASSEMBLY__
#ifndef __ASSEMBLER__
#include <linux/bug.h>
#include <linux/jump_label.h>
@@ -199,7 +199,7 @@ extern struct arm64_ftr_reg arm64_ftr_reg_ctrel0;
* registers (e.g, SCTLR, TCR etc.) or patching the kernel via
* alternatives. The kernel patching is batched and performed at later
* point. The actions are always initiated only after the capability
* is finalised. This is usally denoted by "enabling" the capability.
* is finalised. This is usually denoted by "enabling" the capability.
* The actions are initiated as follows :
* a) Action is triggered on all online CPUs, after the capability is
* finalised, invoked within the stop_machine() context from
@@ -251,7 +251,7 @@ extern struct arm64_ftr_reg arm64_ftr_reg_ctrel0;
#define ARM64_CPUCAP_SCOPE_LOCAL_CPU ((u16)BIT(0))
#define ARM64_CPUCAP_SCOPE_SYSTEM ((u16)BIT(1))
/*
* The capabilitiy is detected on the Boot CPU and is used by kernel
* The capability is detected on the Boot CPU and is used by kernel
* during early boot. i.e, the capability should be "detected" and
* "enabled" as early as possibly on all booting CPUs.
*/
@@ -1078,6 +1078,6 @@ static inline bool cpu_has_lpa2(void)
#endif
}
#endif /* __ASSEMBLY__ */
#endif /* __ASSEMBLER__ */
#endif

View File

@@ -247,9 +247,9 @@
/* Fujitsu Erratum 010001 affects A64FX 1.0 and 1.1, (v0r0 and v1r0) */
#define MIDR_FUJITSU_ERRATUM_010001 MIDR_FUJITSU_A64FX
#define MIDR_FUJITSU_ERRATUM_010001_MASK (~MIDR_CPU_VAR_REV(1, 0))
#define TCR_CLEAR_FUJITSU_ERRATUM_010001 (TCR_NFD1 | TCR_NFD0)
#define TCR_CLEAR_FUJITSU_ERRATUM_010001 (TCR_EL1_NFD1 | TCR_EL1_NFD0)
#ifndef __ASSEMBLY__
#ifndef __ASSEMBLER__
#include <asm/sysreg.h>
@@ -328,6 +328,6 @@ static inline u32 __attribute_const__ read_cpuid_cachetype(void)
{
return read_cpuid(CTR_EL0);
}
#endif /* __ASSEMBLY__ */
#endif /* __ASSEMBLER__ */
#endif

View File

@@ -4,7 +4,7 @@
#include <linux/compiler.h>
#ifndef __ASSEMBLY__
#ifndef __ASSEMBLER__
struct task_struct;
@@ -23,7 +23,7 @@ static __always_inline struct task_struct *get_current(void)
#define current get_current()
#endif /* __ASSEMBLY__ */
#endif /* __ASSEMBLER__ */
#endif /* __ASM_CURRENT_H */

View File

@@ -48,7 +48,7 @@
#define AARCH32_BREAK_THUMB2_LO 0xf7f0
#define AARCH32_BREAK_THUMB2_HI 0xa000
#ifndef __ASSEMBLY__
#ifndef __ASSEMBLER__
struct task_struct;
#define DBG_ARCH_ID_RESERVED 0 /* In case of ptrace ABI updates. */
@@ -88,5 +88,5 @@ static inline bool try_step_suspended_breakpoints(struct pt_regs *regs)
bool try_handle_aarch32_break(struct pt_regs *regs);
#endif /* __ASSEMBLY */
#endif /* __ASSEMBLER__ */
#endif /* __ASM_DEBUG_MONITORS_H */

View File

@@ -126,21 +126,14 @@ static inline void efi_set_pgd(struct mm_struct *mm)
if (mm != current->active_mm) {
/*
* Update the current thread's saved ttbr0 since it is
* restored as part of a return from exception. Enable
* access to the valid TTBR0_EL1 and invoke the errata
* workaround directly since there is no return from
* exception when invoking the EFI run-time services.
* restored as part of a return from exception.
*/
update_saved_ttbr0(current, mm);
uaccess_ttbr0_enable();
post_ttbr_update_workaround();
} else {
/*
* Defer the switch to the current thread's TTBR0_EL1
* until uaccess_enable(). Restore the current
* thread's saved ttbr0 corresponding to its active_mm
* Restore the current thread's saved ttbr0
* corresponding to its active_mm
*/
uaccess_ttbr0_disable();
update_saved_ttbr0(current, current->active_mm);
}
}

View File

@@ -7,7 +7,7 @@
#ifndef __ARM_KVM_INIT_H__
#define __ARM_KVM_INIT_H__
#ifndef __ASSEMBLY__
#ifndef __ASSEMBLER__
#error Assembly-only header
#endif
@@ -24,7 +24,7 @@
* ID_AA64MMFR4_EL1.E2H0 < 0. On such CPUs HCR_EL2.E2H is RES1, but it
* can reset into an UNKNOWN state and might not read as 1 until it has
* been initialized explicitly.
* Initalize HCR_EL2.E2H so that later code can rely upon HCR_EL2.E2H
* Initialize HCR_EL2.E2H so that later code can rely upon HCR_EL2.E2H
* indicating whether the CPU is running in E2H mode.
*/
mrs_s x1, SYS_ID_AA64MMFR4_EL1

View File

@@ -133,7 +133,7 @@
#define ELF_ET_DYN_BASE (2 * DEFAULT_MAP_WINDOW_64 / 3)
#endif /* CONFIG_ARM64_FORCE_52BIT */
#ifndef __ASSEMBLY__
#ifndef __ASSEMBLER__
#include <uapi/linux/elf.h>
#include <linux/bug.h>
@@ -293,6 +293,6 @@ static inline int arch_check_elf(void *ehdr, bool has_interp,
return 0;
}
#endif /* !__ASSEMBLY__ */
#endif /* !__ASSEMBLER__ */
#endif

View File

@@ -431,7 +431,7 @@
#define ESR_ELx_IT_GCSPOPCX 6
#define ESR_ELx_IT_GCSPOPX 7
#ifndef __ASSEMBLY__
#ifndef __ASSEMBLER__
#include <asm/types.h>
static inline unsigned long esr_brk_comment(unsigned long esr)
@@ -534,6 +534,6 @@ static inline bool esr_iss_is_eretab(unsigned long esr)
}
const char *esr_get_class_string(unsigned long esr);
#endif /* __ASSEMBLY */
#endif /* __ASSEMBLER__ */
#endif /* __ASM_ESR_H */

View File

@@ -15,7 +15,7 @@
#ifndef _ASM_ARM64_FIXMAP_H
#define _ASM_ARM64_FIXMAP_H
#ifndef __ASSEMBLY__
#ifndef __ASSEMBLER__
#include <linux/kernel.h>
#include <linux/math.h>
#include <linux/sizes.h>
@@ -117,5 +117,5 @@ extern void __set_fixmap(enum fixed_addresses idx, phys_addr_t phys, pgprot_t pr
#include <asm-generic/fixmap.h>
#endif /* !__ASSEMBLY__ */
#endif /* !__ASSEMBLER__ */
#endif /* _ASM_ARM64_FIXMAP_H */

View File

@@ -12,7 +12,7 @@
#include <asm/sigcontext.h>
#include <asm/sysreg.h>
#ifndef __ASSEMBLY__
#ifndef __ASSEMBLER__
#include <linux/bitmap.h>
#include <linux/build_bug.h>

View File

@@ -37,7 +37,7 @@
*/
#define ARCH_FTRACE_SHIFT_STACK_TRACER 1
#ifndef __ASSEMBLY__
#ifndef __ASSEMBLER__
#include <linux/compat.h>
extern void _mcount(unsigned long);
@@ -217,9 +217,9 @@ static inline bool arch_syscall_match_sym_name(const char *sym,
*/
return !strcmp(sym + 8, name);
}
#endif /* ifndef __ASSEMBLY__ */
#endif /* ifndef __ASSEMBLER__ */
#ifndef __ASSEMBLY__
#ifndef __ASSEMBLER__
#ifdef CONFIG_FUNCTION_GRAPH_TRACER
void prepare_ftrace_return(unsigned long self_addr, unsigned long *parent,

View File

@@ -2,7 +2,7 @@
#ifndef __ASM_GPR_NUM_H
#define __ASM_GPR_NUM_H
#ifdef __ASSEMBLY__
#ifdef __ASSEMBLER__
.irp num,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30
.equ .L__gpr_num_x\num, \num
@@ -11,7 +11,7 @@
.equ .L__gpr_num_xzr, 31
.equ .L__gpr_num_wzr, 31
#else /* __ASSEMBLY__ */
#else /* __ASSEMBLER__ */
#define __DEFINE_ASM_GPR_NUMS \
" .irp num,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30\n" \
@@ -21,6 +21,6 @@
" .equ .L__gpr_num_xzr, 31\n" \
" .equ .L__gpr_num_wzr, 31\n"
#endif /* __ASSEMBLY__ */
#endif /* __ASSEMBLER__ */
#endif /* __ASM_GPR_NUM_H */

View File

@@ -46,7 +46,7 @@
#define COMPAT_HWCAP2_SB (1 << 5)
#define COMPAT_HWCAP2_SSBS (1 << 6)
#ifndef __ASSEMBLY__
#ifndef __ASSEMBLER__
#include <linux/log2.h>
/*

View File

@@ -20,7 +20,7 @@
#define ARM64_IMAGE_FLAG_PAGE_SIZE_64K 3
#define ARM64_IMAGE_FLAG_PHYS_BASE 1
#ifndef __ASSEMBLY__
#ifndef __ASSEMBLER__
#define arm64_image_flag_field(flags, field) \
(((flags) >> field##_SHIFT) & field##_MASK)
@@ -54,6 +54,6 @@ struct arm64_image_header {
__le32 res5;
};
#endif /* __ASSEMBLY__ */
#endif /* __ASSEMBLER__ */
#endif /* __ASM_IMAGE_H */

View File

@@ -12,7 +12,7 @@
#include <asm/insn-def.h>
#ifndef __ASSEMBLY__
#ifndef __ASSEMBLER__
enum aarch64_insn_hint_cr_op {
AARCH64_INSN_HINT_NOP = 0x0 << 5,
@@ -730,6 +730,6 @@ u32 aarch32_insn_mcr_extract_crm(u32 insn);
typedef bool (pstate_check_t)(unsigned long);
extern pstate_check_t * const aarch32_opcode_cond_checks[16];
#endif /* __ASSEMBLY__ */
#endif /* __ASSEMBLER__ */
#endif /* __ASM_INSN_H */

View File

@@ -8,7 +8,7 @@
#ifndef __ASM_JUMP_LABEL_H
#define __ASM_JUMP_LABEL_H
#ifndef __ASSEMBLY__
#ifndef __ASSEMBLER__
#include <linux/types.h>
#include <asm/insn.h>
@@ -58,5 +58,5 @@ l_yes:
return true;
}
#endif /* __ASSEMBLY__ */
#endif /* __ASSEMBLER__ */
#endif /* __ASM_JUMP_LABEL_H */

View File

@@ -2,7 +2,7 @@
#ifndef __ASM_KASAN_H
#define __ASM_KASAN_H
#ifndef __ASSEMBLY__
#ifndef __ASSEMBLER__
#include <linux/linkage.h>
#include <asm/memory.h>

View File

@@ -25,7 +25,7 @@
#define KEXEC_ARCH KEXEC_ARCH_AARCH64
#ifndef __ASSEMBLY__
#ifndef __ASSEMBLER__
/**
* crash_setup_regs() - save registers for the panic kernel
@@ -130,6 +130,6 @@ extern int load_other_segments(struct kimage *image,
char *cmdline);
#endif
#endif /* __ASSEMBLY__ */
#endif /* __ASSEMBLER__ */
#endif

View File

@@ -14,7 +14,7 @@
#include <linux/ptrace.h>
#include <asm/debug-monitors.h>
#ifndef __ASSEMBLY__
#ifndef __ASSEMBLER__
static inline void arch_kgdb_breakpoint(void)
{
@@ -36,7 +36,7 @@ static inline int kgdb_single_step_handler(struct pt_regs *regs,
}
#endif
#endif /* !__ASSEMBLY__ */
#endif /* !__ASSEMBLER__ */
/*
* gdb remote procotol (well most versions of it) expects the following

View File

@@ -46,7 +46,7 @@
#define __KVM_HOST_SMCCC_FUNC___kvm_hyp_init 0
#ifndef __ASSEMBLY__
#ifndef __ASSEMBLER__
#include <linux/mm.h>
@@ -303,7 +303,7 @@ void kvm_compute_final_ctr_el0(struct alt_instr *alt,
void __noreturn __cold nvhe_hyp_panic_handler(u64 esr, u64 spsr, u64 elr_virt,
u64 elr_phys, u64 par, uintptr_t vcpu, u64 far, u64 hpfar);
#else /* __ASSEMBLY__ */
#else /* __ASSEMBLER__ */
.macro get_host_ctxt reg, tmp
adr_this_cpu \reg, kvm_host_data, \tmp

View File

@@ -49,7 +49,7 @@
* mappings, and none of this applies in that case.
*/
#ifdef __ASSEMBLY__
#ifdef __ASSEMBLER__
#include <asm/alternative.h>
@@ -396,5 +396,5 @@ void kvm_s2_ptdump_create_debugfs(struct kvm *kvm);
static inline void kvm_s2_ptdump_create_debugfs(struct kvm *kvm) {}
#endif /* CONFIG_PTDUMP_STAGE2_DEBUGFS */
#endif /* __ASSEMBLY__ */
#endif /* __ASSEMBLER__ */
#endif /* __ARM64_KVM_MMU_H__ */

View File

@@ -5,7 +5,7 @@
#ifndef __ASM_KVM_MTE_H
#define __ASM_KVM_MTE_H
#ifdef __ASSEMBLY__
#ifdef __ASSEMBLER__
#include <asm/sysreg.h>
@@ -62,5 +62,5 @@ alternative_else_nop_endif
.endm
#endif /* CONFIG_ARM64_MTE */
#endif /* __ASSEMBLY__ */
#endif /* __ASSEMBLER__ */
#endif /* __ASM_KVM_MTE_H */

View File

@@ -8,7 +8,7 @@
#ifndef __ASM_KVM_PTRAUTH_H
#define __ASM_KVM_PTRAUTH_H
#ifdef __ASSEMBLY__
#ifdef __ASSEMBLER__
#include <asm/sysreg.h>
@@ -100,7 +100,7 @@ alternative_else_nop_endif
.endm
#endif /* CONFIG_ARM64_PTR_AUTH */
#else /* !__ASSEMBLY */
#else /* !__ASSEMBLER__ */
#define __ptrauth_save_key(ctxt, key) \
do { \
@@ -120,5 +120,5 @@ alternative_else_nop_endif
__ptrauth_save_key(ctxt, APGA); \
} while(0)
#endif /* __ASSEMBLY__ */
#endif /* __ASSEMBLER__ */
#endif /* __ASM_KVM_PTRAUTH_H */

View File

@@ -1,7 +1,7 @@
#ifndef __ASM_LINKAGE_H
#define __ASM_LINKAGE_H
#ifdef __ASSEMBLY__
#ifdef __ASSEMBLER__
#include <asm/assembler.h>
#endif

View File

@@ -207,7 +207,7 @@
*/
#define TRAMP_SWAPPER_OFFSET (2 * PAGE_SIZE)
#ifndef __ASSEMBLY__
#ifndef __ASSEMBLER__
#include <linux/bitops.h>
#include <linux/compiler.h>
@@ -392,7 +392,6 @@ static inline unsigned long virt_to_pfn(const void *kaddr)
* virt_to_page(x) convert a _valid_ virtual address to struct page *
* virt_addr_valid(x) indicates whether a virtual address is valid
*/
#define ARCH_PFN_OFFSET ((unsigned long)PHYS_PFN_OFFSET)
#if defined(CONFIG_DEBUG_VIRTUAL)
#define page_to_virt(x) ({ \
@@ -422,7 +421,7 @@ static inline unsigned long virt_to_pfn(const void *kaddr)
})
void dump_mem_limit(void);
#endif /* !ASSEMBLY */
#endif /* !__ASSEMBLER__ */
/*
* Given that the GIC architecture permits ITS implementations that can only be

View File

@@ -12,7 +12,7 @@
#define USER_ASID_FLAG (UL(1) << USER_ASID_BIT)
#define TTBR_ASID_MASK (UL(0xffff) << 48)
#ifndef __ASSEMBLY__
#ifndef __ASSEMBLER__
#include <linux/refcount.h>
#include <asm/cpufeature.h>
@@ -112,5 +112,5 @@ void kpti_install_ng_mappings(void);
static inline void kpti_install_ng_mappings(void) {}
#endif
#endif /* !__ASSEMBLY__ */
#endif /* !__ASSEMBLER__ */
#endif

View File

@@ -8,7 +8,7 @@
#ifndef __ASM_MMU_CONTEXT_H
#define __ASM_MMU_CONTEXT_H
#ifndef __ASSEMBLY__
#ifndef __ASSEMBLER__
#include <linux/compiler.h>
#include <linux/sched.h>
@@ -61,11 +61,6 @@ static inline void cpu_switch_mm(pgd_t *pgd, struct mm_struct *mm)
cpu_do_switch_mm(virt_to_phys(pgd),mm);
}
/*
* TCR.T0SZ value to use when the ID map is active.
*/
#define idmap_t0sz TCR_T0SZ(IDMAP_VA_BITS)
/*
* Ensure TCR.T0SZ is set to the provided value.
*/
@@ -73,18 +68,15 @@ static inline void __cpu_set_tcr_t0sz(unsigned long t0sz)
{
unsigned long tcr = read_sysreg(tcr_el1);
if ((tcr & TCR_T0SZ_MASK) == t0sz)
if ((tcr & TCR_EL1_T0SZ_MASK) == t0sz)
return;
tcr &= ~TCR_T0SZ_MASK;
tcr &= ~TCR_EL1_T0SZ_MASK;
tcr |= t0sz;
write_sysreg(tcr, tcr_el1);
isb();
}
#define cpu_set_default_tcr_t0sz() __cpu_set_tcr_t0sz(TCR_T0SZ(vabits_actual))
#define cpu_set_idmap_tcr_t0sz() __cpu_set_tcr_t0sz(idmap_t0sz)
/*
* Remove the idmap from TTBR0_EL1 and install the pgd of the active mm.
*
@@ -103,7 +95,7 @@ static inline void cpu_uninstall_idmap(void)
cpu_set_reserved_ttbr0();
local_flush_tlb_all();
cpu_set_default_tcr_t0sz();
__cpu_set_tcr_t0sz(TCR_T0SZ(vabits_actual));
if (mm != &init_mm && !system_uses_ttbr0_pan())
cpu_switch_mm(mm->pgd, mm);
@@ -113,7 +105,7 @@ static inline void cpu_install_idmap(void)
{
cpu_set_reserved_ttbr0();
local_flush_tlb_all();
cpu_set_idmap_tcr_t0sz();
__cpu_set_tcr_t0sz(TCR_T0SZ(IDMAP_VA_BITS));
cpu_switch_mm(lm_alias(idmap_pg_dir), &init_mm);
}
@@ -330,6 +322,6 @@ static inline void deactivate_mm(struct task_struct *tsk,
#include <asm-generic/mmu_context.h>
#endif /* !__ASSEMBLY__ */
#endif /* !__ASSEMBLER__ */
#endif /* !__ASM_MMU_CONTEXT_H */

View File

@@ -9,7 +9,7 @@
#include <asm/cputype.h>
#include <asm/mte-def.h>
#ifndef __ASSEMBLY__
#ifndef __ASSEMBLER__
#include <linux/types.h>
@@ -259,6 +259,6 @@ static inline int mte_enable_kernel_store_only(void)
#endif /* CONFIG_ARM64_MTE */
#endif /* __ASSEMBLY__ */
#endif /* __ASSEMBLER__ */
#endif /* __ASM_MTE_KASAN_H */

View File

@@ -8,7 +8,7 @@
#include <asm/compiler.h>
#include <asm/mte-def.h>
#ifndef __ASSEMBLY__
#ifndef __ASSEMBLER__
#include <linux/bitfield.h>
#include <linux/kasan-enabled.h>
@@ -282,5 +282,5 @@ static inline void mte_check_tfsr_exit(void)
}
#endif /* CONFIG_KASAN_HW_TAGS */
#endif /* __ASSEMBLY__ */
#endif /* __ASSEMBLER__ */
#endif /* __ASM_MTE_H */

View File

@@ -10,7 +10,7 @@
#include <asm/page-def.h>
#ifndef __ASSEMBLY__
#ifndef __ASSEMBLER__
#include <linux/personality.h> /* for READ_IMPLIES_EXEC */
#include <linux/types.h> /* for gfp_t */
@@ -45,7 +45,7 @@ int pfn_is_map_memory(unsigned long pfn);
#include <asm/memory.h>
#endif /* !__ASSEMBLY__ */
#endif /* !__ASSEMBLER__ */
#define VM_DATA_DEFAULT_FLAGS (VM_DATA_FLAGS_TSK_EXEC | VM_MTE_ALLOWED)

View File

@@ -228,102 +228,53 @@
/*
* TCR flags.
*/
#define TCR_T0SZ_OFFSET 0
#define TCR_T1SZ_OFFSET 16
#define TCR_T0SZ(x) ((UL(64) - (x)) << TCR_T0SZ_OFFSET)
#define TCR_T1SZ(x) ((UL(64) - (x)) << TCR_T1SZ_OFFSET)
#define TCR_TxSZ(x) (TCR_T0SZ(x) | TCR_T1SZ(x))
#define TCR_TxSZ_WIDTH 6
#define TCR_T0SZ_MASK (((UL(1) << TCR_TxSZ_WIDTH) - 1) << TCR_T0SZ_OFFSET)
#define TCR_T1SZ_MASK (((UL(1) << TCR_TxSZ_WIDTH) - 1) << TCR_T1SZ_OFFSET)
#define TCR_T0SZ(x) ((UL(64) - (x)) << TCR_EL1_T0SZ_SHIFT)
#define TCR_T1SZ(x) ((UL(64) - (x)) << TCR_EL1_T1SZ_SHIFT)
#define TCR_EPD0_SHIFT 7
#define TCR_EPD0_MASK (UL(1) << TCR_EPD0_SHIFT)
#define TCR_IRGN0_SHIFT 8
#define TCR_IRGN0_MASK (UL(3) << TCR_IRGN0_SHIFT)
#define TCR_IRGN0_NC (UL(0) << TCR_IRGN0_SHIFT)
#define TCR_IRGN0_WBWA (UL(1) << TCR_IRGN0_SHIFT)
#define TCR_IRGN0_WT (UL(2) << TCR_IRGN0_SHIFT)
#define TCR_IRGN0_WBnWA (UL(3) << TCR_IRGN0_SHIFT)
#define TCR_T0SZ_MASK TCR_EL1_T0SZ_MASK
#define TCR_T1SZ_MASK TCR_EL1_T1SZ_MASK
#define TCR_EPD1_SHIFT 23
#define TCR_EPD1_MASK (UL(1) << TCR_EPD1_SHIFT)
#define TCR_IRGN1_SHIFT 24
#define TCR_IRGN1_MASK (UL(3) << TCR_IRGN1_SHIFT)
#define TCR_IRGN1_NC (UL(0) << TCR_IRGN1_SHIFT)
#define TCR_IRGN1_WBWA (UL(1) << TCR_IRGN1_SHIFT)
#define TCR_IRGN1_WT (UL(2) << TCR_IRGN1_SHIFT)
#define TCR_IRGN1_WBnWA (UL(3) << TCR_IRGN1_SHIFT)
#define TCR_EPD0_MASK TCR_EL1_EPD0_MASK
#define TCR_EPD1_MASK TCR_EL1_EPD1_MASK
#define TCR_IRGN_NC (TCR_IRGN0_NC | TCR_IRGN1_NC)
#define TCR_IRGN_WBWA (TCR_IRGN0_WBWA | TCR_IRGN1_WBWA)
#define TCR_IRGN_WT (TCR_IRGN0_WT | TCR_IRGN1_WT)
#define TCR_IRGN_WBnWA (TCR_IRGN0_WBnWA | TCR_IRGN1_WBnWA)
#define TCR_IRGN_MASK (TCR_IRGN0_MASK | TCR_IRGN1_MASK)
#define TCR_IRGN0_MASK TCR_EL1_IRGN0_MASK
#define TCR_IRGN0_WBWA (TCR_EL1_IRGN0_WBWA << TCR_EL1_IRGN0_SHIFT)
#define TCR_ORGN0_MASK TCR_EL1_ORGN0_MASK
#define TCR_ORGN0_WBWA (TCR_EL1_ORGN0_WBWA << TCR_EL1_ORGN0_SHIFT)
#define TCR_ORGN0_SHIFT 10
#define TCR_ORGN0_MASK (UL(3) << TCR_ORGN0_SHIFT)
#define TCR_ORGN0_NC (UL(0) << TCR_ORGN0_SHIFT)
#define TCR_ORGN0_WBWA (UL(1) << TCR_ORGN0_SHIFT)
#define TCR_ORGN0_WT (UL(2) << TCR_ORGN0_SHIFT)
#define TCR_ORGN0_WBnWA (UL(3) << TCR_ORGN0_SHIFT)
#define TCR_SH0_MASK TCR_EL1_SH0_MASK
#define TCR_SH0_INNER (TCR_EL1_SH0_INNER << TCR_EL1_SH0_SHIFT)
#define TCR_ORGN1_SHIFT 26
#define TCR_ORGN1_MASK (UL(3) << TCR_ORGN1_SHIFT)
#define TCR_ORGN1_NC (UL(0) << TCR_ORGN1_SHIFT)
#define TCR_ORGN1_WBWA (UL(1) << TCR_ORGN1_SHIFT)
#define TCR_ORGN1_WT (UL(2) << TCR_ORGN1_SHIFT)
#define TCR_ORGN1_WBnWA (UL(3) << TCR_ORGN1_SHIFT)
#define TCR_SH1_MASK TCR_EL1_SH1_MASK
#define TCR_ORGN_NC (TCR_ORGN0_NC | TCR_ORGN1_NC)
#define TCR_ORGN_WBWA (TCR_ORGN0_WBWA | TCR_ORGN1_WBWA)
#define TCR_ORGN_WT (TCR_ORGN0_WT | TCR_ORGN1_WT)
#define TCR_ORGN_WBnWA (TCR_ORGN0_WBnWA | TCR_ORGN1_WBnWA)
#define TCR_ORGN_MASK (TCR_ORGN0_MASK | TCR_ORGN1_MASK)
#define TCR_TG0_SHIFT TCR_EL1_TG0_SHIFT
#define TCR_TG0_MASK TCR_EL1_TG0_MASK
#define TCR_TG0_4K (TCR_EL1_TG0_4K << TCR_EL1_TG0_SHIFT)
#define TCR_TG0_64K (TCR_EL1_TG0_64K << TCR_EL1_TG0_SHIFT)
#define TCR_TG0_16K (TCR_EL1_TG0_16K << TCR_EL1_TG0_SHIFT)
#define TCR_SH0_SHIFT 12
#define TCR_SH0_MASK (UL(3) << TCR_SH0_SHIFT)
#define TCR_SH0_INNER (UL(3) << TCR_SH0_SHIFT)
#define TCR_TG1_SHIFT TCR_EL1_TG1_SHIFT
#define TCR_TG1_MASK TCR_EL1_TG1_MASK
#define TCR_TG1_16K (TCR_EL1_TG1_16K << TCR_EL1_TG1_SHIFT)
#define TCR_TG1_4K (TCR_EL1_TG1_4K << TCR_EL1_TG1_SHIFT)
#define TCR_TG1_64K (TCR_EL1_TG1_64K << TCR_EL1_TG1_SHIFT)
#define TCR_SH1_SHIFT 28
#define TCR_SH1_MASK (UL(3) << TCR_SH1_SHIFT)
#define TCR_SH1_INNER (UL(3) << TCR_SH1_SHIFT)
#define TCR_SHARED (TCR_SH0_INNER | TCR_SH1_INNER)
#define TCR_TG0_SHIFT 14
#define TCR_TG0_MASK (UL(3) << TCR_TG0_SHIFT)
#define TCR_TG0_4K (UL(0) << TCR_TG0_SHIFT)
#define TCR_TG0_64K (UL(1) << TCR_TG0_SHIFT)
#define TCR_TG0_16K (UL(2) << TCR_TG0_SHIFT)
#define TCR_TG1_SHIFT 30
#define TCR_TG1_MASK (UL(3) << TCR_TG1_SHIFT)
#define TCR_TG1_16K (UL(1) << TCR_TG1_SHIFT)
#define TCR_TG1_4K (UL(2) << TCR_TG1_SHIFT)
#define TCR_TG1_64K (UL(3) << TCR_TG1_SHIFT)
#define TCR_IPS_SHIFT 32
#define TCR_IPS_MASK (UL(7) << TCR_IPS_SHIFT)
#define TCR_A1 (UL(1) << 22)
#define TCR_ASID16 (UL(1) << 36)
#define TCR_TBI0 (UL(1) << 37)
#define TCR_TBI1 (UL(1) << 38)
#define TCR_HA (UL(1) << 39)
#define TCR_HD (UL(1) << 40)
#define TCR_HPD0_SHIFT 41
#define TCR_HPD0 (UL(1) << TCR_HPD0_SHIFT)
#define TCR_HPD1_SHIFT 42
#define TCR_HPD1 (UL(1) << TCR_HPD1_SHIFT)
#define TCR_TBID0 (UL(1) << 51)
#define TCR_TBID1 (UL(1) << 52)
#define TCR_NFD0 (UL(1) << 53)
#define TCR_NFD1 (UL(1) << 54)
#define TCR_E0PD0 (UL(1) << 55)
#define TCR_E0PD1 (UL(1) << 56)
#define TCR_TCMA0 (UL(1) << 57)
#define TCR_TCMA1 (UL(1) << 58)
#define TCR_DS (UL(1) << 59)
#define TCR_IPS_SHIFT TCR_EL1_IPS_SHIFT
#define TCR_IPS_MASK TCR_EL1_IPS_MASK
#define TCR_A1 TCR_EL1_A1
#define TCR_ASID16 TCR_EL1_AS
#define TCR_TBI0 TCR_EL1_TBI0
#define TCR_TBI1 TCR_EL1_TBI1
#define TCR_HA TCR_EL1_HA
#define TCR_HD TCR_EL1_HD
#define TCR_HPD0 TCR_EL1_HPD0
#define TCR_HPD1 TCR_EL1_HPD1
#define TCR_TBID0 TCR_EL1_TBID0
#define TCR_TBID1 TCR_EL1_TBID1
#define TCR_E0PD0 TCR_EL1_E0PD0
#define TCR_E0PD1 TCR_EL1_E0PD1
#define TCR_DS TCR_EL1_DS
/*
* TTBR.

View File

@@ -62,7 +62,7 @@
#define _PAGE_READONLY_EXEC (_PAGE_DEFAULT | PTE_USER | PTE_RDONLY | PTE_NG | PTE_PXN)
#define _PAGE_EXECONLY (_PAGE_DEFAULT | PTE_RDONLY | PTE_NG | PTE_PXN)
#ifndef __ASSEMBLY__
#ifndef __ASSEMBLER__
#include <asm/cpufeature.h>
#include <asm/pgtable-types.h>
@@ -84,7 +84,7 @@ extern unsigned long prot_ns_shared;
#else
static inline bool __pure lpa2_is_enabled(void)
{
return read_tcr() & TCR_DS;
return read_tcr() & TCR_EL1_DS;
}
#define PTE_MAYBE_SHARED (lpa2_is_enabled() ? 0 : PTE_SHARED)
@@ -127,7 +127,7 @@ static inline bool __pure lpa2_is_enabled(void)
#define PAGE_READONLY_EXEC __pgprot(_PAGE_READONLY_EXEC)
#define PAGE_EXECONLY __pgprot(_PAGE_EXECONLY)
#endif /* __ASSEMBLY__ */
#endif /* __ASSEMBLER__ */
#define pte_pi_index(pte) ( \
((pte & BIT(PTE_PI_IDX_3)) >> (PTE_PI_IDX_3 - 3)) | \

View File

@@ -30,7 +30,7 @@
#define vmemmap ((struct page *)VMEMMAP_START - (memstart_addr >> PAGE_SHIFT))
#ifndef __ASSEMBLY__
#ifndef __ASSEMBLER__
#include <asm/cmpxchg.h>
#include <asm/fixmap.h>
@@ -130,12 +130,16 @@ static inline void arch_leave_lazy_mmu_mode(void)
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
/*
* Outside of a few very special situations (e.g. hibernation), we always
* use broadcast TLB invalidation instructions, therefore a spurious page
* fault on one CPU which has been handled concurrently by another CPU
* does not need to perform additional invalidation.
* We use local TLB invalidation instruction when reusing page in
* write protection fault handler to avoid TLBI broadcast in the hot
* path. This will cause spurious page faults if stale read-only TLB
* entries exist.
*/
#define flush_tlb_fix_spurious_fault(vma, address, ptep) do { } while (0)
#define flush_tlb_fix_spurious_fault(vma, address, ptep) \
local_flush_tlb_page_nonotify(vma, address)
#define flush_tlb_fix_spurious_fault_pmd(vma, address, pmdp) \
local_flush_tlb_page_nonotify(vma, address)
/*
* ZERO_PAGE is a global shared page that is always zero: used
@@ -433,7 +437,7 @@ bool pgattr_change_is_safe(pteval_t old, pteval_t new);
* 1 0 | 1 0 1
* 1 1 | 0 1 x
*
* When hardware DBM is not present, the sofware PTE_DIRTY bit is updated via
* When hardware DBM is not present, the software PTE_DIRTY bit is updated via
* the page fault mechanism. Checking the dirty status of a pte becomes:
*
* PTE_DIRTY || (PTE_WRITE && !PTE_RDONLY)
@@ -599,7 +603,7 @@ static inline int pte_protnone(pte_t pte)
/*
* pte_present_invalid() tells us that the pte is invalid from HW
* perspective but present from SW perspective, so the fields are to be
* interpretted as per the HW layout. The second 2 checks are the unique
* interpreted as per the HW layout. The second 2 checks are the unique
* encoding that we use for PROT_NONE. It is insufficient to only use
* the first check because we share the same encoding scheme with pmds
* which support pmd_mkinvalid(), so can be present-invalid without
@@ -1949,6 +1953,6 @@ static inline void clear_young_dirty_ptes(struct vm_area_struct *vma,
#endif /* CONFIG_ARM64_CONTPTE */
#endif /* !__ASSEMBLY__ */
#endif /* !__ASSEMBLER__ */
#endif /* __ASM_PGTABLE_H */

View File

@@ -9,7 +9,7 @@
#ifndef __ASM_PROCFNS_H
#define __ASM_PROCFNS_H
#ifndef __ASSEMBLY__
#ifndef __ASSEMBLER__
#include <asm/page.h>
@@ -21,5 +21,5 @@ extern u64 cpu_do_resume(phys_addr_t ptr, u64 idmap_ttbr);
#include <asm/memory.h>
#endif /* __ASSEMBLY__ */
#endif /* __ASSEMBLER__ */
#endif /* __ASM_PROCFNS_H */

View File

@@ -25,7 +25,7 @@
#define MTE_CTRL_STORE_ONLY (1UL << 19)
#ifndef __ASSEMBLY__
#ifndef __ASSEMBLER__
#include <linux/build_bug.h>
#include <linux/cache.h>
@@ -437,5 +437,5 @@ int set_tsc_mode(unsigned int val);
#define GET_TSC_CTL(adr) get_tsc_mode((adr))
#define SET_TSC_CTL(val) set_tsc_mode((val))
#endif /* __ASSEMBLY__ */
#endif /* __ASSEMBLER__ */
#endif /* __ASM_PROCESSOR_H */

View File

@@ -94,7 +94,7 @@
*/
#define NO_SYSCALL (-1)
#ifndef __ASSEMBLY__
#ifndef __ASSEMBLER__
#include <linux/bug.h>
#include <linux/types.h>
@@ -361,5 +361,5 @@ static inline void procedure_link_pointer_set(struct pt_regs *regs,
extern unsigned long profile_pc(struct pt_regs *regs);
#endif /* __ASSEMBLY__ */
#endif /* __ASSEMBLER__ */
#endif

View File

@@ -122,7 +122,7 @@
*/
#define SMC_RSI_ATTESTATION_TOKEN_CONTINUE SMC_RSI_FID(0x195)
#ifndef __ASSEMBLY__
#ifndef __ASSEMBLER__
struct realm_config {
union {
@@ -142,7 +142,7 @@ struct realm_config {
*/
} __aligned(0x1000);
#endif /* __ASSEMBLY__ */
#endif /* __ASSEMBLER__ */
/*
* Read configuration for the current Realm.

View File

@@ -5,7 +5,7 @@
#ifndef __ASM_RWONCE_H
#define __ASM_RWONCE_H
#if defined(CONFIG_LTO) && !defined(__ASSEMBLY__)
#if defined(CONFIG_LTO) && !defined(__ASSEMBLER__)
#include <linux/compiler_types.h>
#include <asm/alternative-macros.h>
@@ -62,7 +62,7 @@
})
#endif /* !BUILD_VDSO */
#endif /* CONFIG_LTO && !__ASSEMBLY__ */
#endif /* CONFIG_LTO && !__ASSEMBLER__ */
#include <asm-generic/rwonce.h>

View File

@@ -2,7 +2,7 @@
#ifndef _ASM_SCS_H
#define _ASM_SCS_H
#ifdef __ASSEMBLY__
#ifdef __ASSEMBLER__
#include <asm/asm-offsets.h>
#include <asm/sysreg.h>
@@ -55,6 +55,6 @@ enum {
int __pi_scs_patch(const u8 eh_frame[], int size, bool skip_dry_run);
#endif /* __ASSEMBLY __ */
#endif /* __ASSEMBLER__ */
#endif /* _ASM_SCS_H */

View File

@@ -9,7 +9,7 @@
#define SDEI_STACK_SIZE IRQ_STACK_SIZE
#ifndef __ASSEMBLY__
#ifndef __ASSEMBLER__
#include <linux/linkage.h>
#include <linux/preempt.h>
@@ -49,5 +49,5 @@ unsigned long do_sdei_event(struct pt_regs *regs,
unsigned long sdei_arch_get_entry_point(int conduit);
#define sdei_arch_get_entry_point(x) sdei_arch_get_entry_point(x)
#endif /* __ASSEMBLY__ */
#endif /* __ASSEMBLER__ */
#endif /* __ASM_SDEI_H */

View File

@@ -29,7 +29,7 @@ static __must_check inline bool may_use_simd(void)
*/
return !WARN_ON(!system_capabilities_finalized()) &&
system_supports_fpsimd() &&
!in_hardirq() && !irqs_disabled() && !in_nmi();
!in_hardirq() && !in_nmi();
}
#else /* ! CONFIG_KERNEL_MODE_NEON */

View File

@@ -23,7 +23,7 @@
#define CPU_STUCK_REASON_52_BIT_VA (UL(1) << CPU_STUCK_REASON_SHIFT)
#define CPU_STUCK_REASON_NO_GRAN (UL(2) << CPU_STUCK_REASON_SHIFT)
#ifndef __ASSEMBLY__
#ifndef __ASSEMBLER__
#include <linux/threads.h>
#include <linux/cpumask.h>
@@ -155,6 +155,6 @@ bool cpus_are_stuck_in_kernel(void);
extern void crash_smp_send_stop(void);
extern bool smp_crash_stop_failed(void);
#endif /* ifndef __ASSEMBLY__ */
#endif /* ifndef __ASSEMBLER__ */
#endif /* ifndef __ASM_SMP_H */

View File

@@ -12,7 +12,7 @@
#define BP_HARDEN_EL2_SLOTS 4
#define __BP_HARDEN_HYP_VECS_SZ ((BP_HARDEN_EL2_SLOTS - 1) * SZ_2K)
#ifndef __ASSEMBLY__
#ifndef __ASSEMBLER__
#include <linux/smp.h>
#include <asm/percpu.h>
@@ -119,5 +119,5 @@ void spectre_bhb_patch_clearbhb(struct alt_instr *alt,
__le32 *origptr, __le32 *updptr, int nr_inst);
void spectre_print_disabled_mitigations(void);
#endif /* __ASSEMBLY__ */
#endif /* __ASSEMBLER__ */
#endif /* __ASM_SPECTRE_H */

View File

@@ -25,7 +25,7 @@
#define FRAME_META_TYPE_FINAL 1
#define FRAME_META_TYPE_PT_REGS 2
#ifndef __ASSEMBLY__
#ifndef __ASSEMBLER__
/*
* A standard AAPCS64 frame record.
*/
@@ -43,6 +43,6 @@ struct frame_record_meta {
struct frame_record record;
u64 type;
};
#endif /* __ASSEMBLY */
#endif /* __ASSEMBLER__ */
#endif /* __ASM_STACKTRACE_FRAME_H */

View File

@@ -23,7 +23,7 @@ struct cpu_suspend_ctx {
* __cpu_suspend_enter()'s caller, and populated by __cpu_suspend_enter().
* This data must survive until cpu_resume() is called.
*
* This struct desribes the size and the layout of the saved cpu state.
* This struct describes the size and the layout of the saved cpu state.
* The layout of the callee_saved_regs is defined by the implementation
* of __cpu_suspend_enter(), and cpu_resume(). This struct must be passed
* in by the caller as __cpu_suspend_enter()'s stack-frame is gone once it

View File

@@ -52,7 +52,7 @@
#ifndef CONFIG_BROKEN_GAS_INST
#ifdef __ASSEMBLY__
#ifdef __ASSEMBLER__
// The space separator is omitted so that __emit_inst(x) can be parsed as
// either an assembler directive or an assembler macro argument.
#define __emit_inst(x) .inst(x)
@@ -71,11 +71,11 @@
(((x) >> 24) & 0x000000ff))
#endif /* CONFIG_CPU_BIG_ENDIAN */
#ifdef __ASSEMBLY__
#ifdef __ASSEMBLER__
#define __emit_inst(x) .long __INSTR_BSWAP(x)
#else /* __ASSEMBLY__ */
#else /* __ASSEMBLER__ */
#define __emit_inst(x) ".long " __stringify(__INSTR_BSWAP(x)) "\n\t"
#endif /* __ASSEMBLY__ */
#endif /* __ASSEMBLER__ */
#endif /* CONFIG_BROKEN_GAS_INST */
@@ -1129,9 +1129,7 @@
#define gicr_insn(insn) read_sysreg_s(GICV5_OP_GICR_##insn)
#define gic_insn(v, insn) write_sysreg_s(v, GICV5_OP_GIC_##insn)
#define ARM64_FEATURE_FIELD_BITS 4
#ifdef __ASSEMBLY__
#ifdef __ASSEMBLER__
.macro mrs_s, rt, sreg
__emit_inst(0xd5200000|(\sreg)|(.L__gpr_num_\rt))

View File

@@ -7,7 +7,7 @@
#ifndef __ASM_SYSTEM_MISC_H
#define __ASM_SYSTEM_MISC_H
#ifndef __ASSEMBLY__
#ifndef __ASSEMBLER__
#include <linux/compiler.h>
#include <linux/linkage.h>
@@ -28,6 +28,6 @@ void arm64_notify_die(const char *str, struct pt_regs *regs,
struct mm_struct;
extern void __show_regs(struct pt_regs *);
#endif /* __ASSEMBLY__ */
#endif /* __ASSEMBLER__ */
#endif /* __ASM_SYSTEM_MISC_H */

View File

@@ -10,7 +10,7 @@
#include <linux/compiler.h>
#ifndef __ASSEMBLY__
#ifndef __ASSEMBLER__
struct task_struct;

View File

@@ -8,7 +8,7 @@
#ifndef __ASM_TLBFLUSH_H
#define __ASM_TLBFLUSH_H
#ifndef __ASSEMBLY__
#ifndef __ASSEMBLER__
#include <linux/bitfield.h>
#include <linux/mm_types.h>
@@ -249,6 +249,19 @@ static inline unsigned long get_trans_granule(void)
* cannot be easily determined, the value TLBI_TTL_UNKNOWN will
* perform a non-hinted invalidation.
*
* local_flush_tlb_page(vma, addr)
* Local variant of flush_tlb_page(). Stale TLB entries may
* remain in remote CPUs.
*
* local_flush_tlb_page_nonotify(vma, addr)
* Same as local_flush_tlb_page() except MMU notifier will not be
* called.
*
* local_flush_tlb_contpte(vma, addr)
* Invalidate the virtual-address range
* '[addr, addr+CONT_PTE_SIZE)' mapped with contpte on local CPU
* for the user address space corresponding to 'vma->mm'. Stale
* TLB entries may remain in remote CPUs.
*
* Finally, take a look at asm/tlb.h to see how tlb_flush() is implemented
* on top of these routines, since that is our interface to the mmu_gather
@@ -282,6 +295,33 @@ static inline void flush_tlb_mm(struct mm_struct *mm)
mmu_notifier_arch_invalidate_secondary_tlbs(mm, 0, -1UL);
}
static inline void __local_flush_tlb_page_nonotify_nosync(struct mm_struct *mm,
unsigned long uaddr)
{
unsigned long addr;
dsb(nshst);
addr = __TLBI_VADDR(uaddr, ASID(mm));
__tlbi(vale1, addr);
__tlbi_user(vale1, addr);
}
static inline void local_flush_tlb_page_nonotify(struct vm_area_struct *vma,
unsigned long uaddr)
{
__local_flush_tlb_page_nonotify_nosync(vma->vm_mm, uaddr);
dsb(nsh);
}
static inline void local_flush_tlb_page(struct vm_area_struct *vma,
unsigned long uaddr)
{
__local_flush_tlb_page_nonotify_nosync(vma->vm_mm, uaddr);
mmu_notifier_arch_invalidate_secondary_tlbs(vma->vm_mm, uaddr & PAGE_MASK,
(uaddr & PAGE_MASK) + PAGE_SIZE);
dsb(nsh);
}
static inline void __flush_tlb_page_nosync(struct mm_struct *mm,
unsigned long uaddr)
{
@@ -472,6 +512,22 @@ static inline void __flush_tlb_range(struct vm_area_struct *vma,
dsb(ish);
}
static inline void local_flush_tlb_contpte(struct vm_area_struct *vma,
unsigned long addr)
{
unsigned long asid;
addr = round_down(addr, CONT_PTE_SIZE);
dsb(nshst);
asid = ASID(vma->vm_mm);
__flush_tlb_range_op(vale1, addr, CONT_PTES, PAGE_SIZE, asid,
3, true, lpa2_is_enabled());
mmu_notifier_arch_invalidate_secondary_tlbs(vma->vm_mm, addr,
addr + CONT_PTE_SIZE);
dsb(nsh);
}
static inline void flush_tlb_range(struct vm_area_struct *vma,
unsigned long start, unsigned long end)
{
@@ -524,6 +580,33 @@ static inline void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *b
{
__flush_tlb_range_nosync(mm, start, end, PAGE_SIZE, true, 3);
}
static inline bool __pte_flags_need_flush(ptdesc_t oldval, ptdesc_t newval)
{
ptdesc_t diff = oldval ^ newval;
/* invalid to valid transition requires no flush */
if (!(oldval & PTE_VALID))
return false;
/* Transition in the SW bits requires no flush */
diff &= ~PTE_SWBITS_MASK;
return diff;
}
static inline bool pte_needs_flush(pte_t oldpte, pte_t newpte)
{
return __pte_flags_need_flush(pte_val(oldpte), pte_val(newpte));
}
#define pte_needs_flush pte_needs_flush
static inline bool huge_pmd_needs_flush(pmd_t oldpmd, pmd_t newpmd)
{
return __pte_flags_need_flush(pmd_val(oldpmd), pmd_val(newpmd));
}
#define huge_pmd_needs_flush huge_pmd_needs_flush
#endif
#endif

View File

@@ -422,9 +422,9 @@ static __must_check __always_inline bool user_access_begin(const void __user *pt
}
#define user_access_begin(a,b) user_access_begin(a,b)
#define user_access_end() uaccess_ttbr0_disable()
#define unsafe_put_user(x, ptr, label) \
#define arch_unsafe_put_user(x, ptr, label) \
__raw_put_mem("sttr", x, uaccess_mask_ptr(ptr), label, U)
#define unsafe_get_user(x, ptr, label) \
#define arch_unsafe_get_user(x, ptr, label) \
__raw_get_mem("ldtr", x, uaccess_mask_ptr(ptr), label, U)
/*

View File

@@ -7,7 +7,7 @@
#define __VDSO_PAGES 4
#ifndef __ASSEMBLY__
#ifndef __ASSEMBLER__
#include <generated/vdso-offsets.h>
@@ -19,6 +19,6 @@
extern char vdso_start[], vdso_end[];
extern char vdso32_start[], vdso32_end[];
#endif /* !__ASSEMBLY__ */
#endif /* !__ASSEMBLER__ */
#endif /* __ASM_VDSO_H */

View File

@@ -5,7 +5,7 @@
#ifndef __COMPAT_BARRIER_H
#define __COMPAT_BARRIER_H
#ifndef __ASSEMBLY__
#ifndef __ASSEMBLER__
/*
* Warning: This code is meant to be used from the compat vDSO only.
*/
@@ -31,6 +31,6 @@
#define smp_rmb() aarch32_smp_rmb()
#define smp_wmb() aarch32_smp_wmb()
#endif /* !__ASSEMBLY__ */
#endif /* !__ASSEMBLER__ */
#endif /* __COMPAT_BARRIER_H */

View File

@@ -5,7 +5,7 @@
#ifndef __ASM_VDSO_COMPAT_GETTIMEOFDAY_H
#define __ASM_VDSO_COMPAT_GETTIMEOFDAY_H
#ifndef __ASSEMBLY__
#ifndef __ASSEMBLER__
#include <asm/barrier.h>
#include <asm/unistd_compat_32.h>
@@ -161,6 +161,6 @@ static inline bool vdso_clocksource_ok(const struct vdso_clock *vc)
}
#define vdso_clocksource_ok vdso_clocksource_ok
#endif /* !__ASSEMBLY__ */
#endif /* !__ASSEMBLER__ */
#endif /* __ASM_VDSO_COMPAT_GETTIMEOFDAY_H */

View File

@@ -3,7 +3,7 @@
#ifndef __ASM_VDSO_GETRANDOM_H
#define __ASM_VDSO_GETRANDOM_H
#ifndef __ASSEMBLY__
#ifndef __ASSEMBLER__
#include <asm/unistd.h>
#include <asm/vdso/vsyscall.h>
@@ -33,6 +33,6 @@ static __always_inline ssize_t getrandom_syscall(void *_buffer, size_t _len, uns
return ret;
}
#endif /* !__ASSEMBLY__ */
#endif /* !__ASSEMBLER__ */
#endif /* __ASM_VDSO_GETRANDOM_H */

View File

@@ -7,7 +7,7 @@
#ifdef __aarch64__
#ifndef __ASSEMBLY__
#ifndef __ASSEMBLER__
#include <asm/alternative.h>
#include <asm/arch_timer.h>
@@ -96,7 +96,7 @@ static __always_inline const struct vdso_time_data *__arch_get_vdso_u_time_data(
#define __arch_get_vdso_u_time_data __arch_get_vdso_u_time_data
#endif /* IS_ENABLED(CONFIG_CC_IS_GCC) && IS_ENABLED(CONFIG_PAGE_SIZE_64KB) */
#endif /* !__ASSEMBLY__ */
#endif /* !__ASSEMBLER__ */
#else /* !__aarch64__ */

View File

@@ -5,13 +5,13 @@
#ifndef __ASM_VDSO_PROCESSOR_H
#define __ASM_VDSO_PROCESSOR_H
#ifndef __ASSEMBLY__
#ifndef __ASSEMBLER__
static inline void cpu_relax(void)
{
asm volatile("yield" ::: "memory");
}
#endif /* __ASSEMBLY__ */
#endif /* __ASSEMBLER__ */
#endif /* __ASM_VDSO_PROCESSOR_H */

View File

@@ -2,7 +2,7 @@
#ifndef __ASM_VDSO_VSYSCALL_H
#define __ASM_VDSO_VSYSCALL_H
#ifndef __ASSEMBLY__
#ifndef __ASSEMBLER__
#include <vdso/datapage.h>
@@ -22,6 +22,6 @@ void __arch_update_vdso_clock(struct vdso_clock *vc)
/* The asm-generic header needs to be included after the definitions above */
#include <asm-generic/vdso/vsyscall.h>
#endif /* !__ASSEMBLY__ */
#endif /* !__ASSEMBLER__ */
#endif /* __ASM_VDSO_VSYSCALL_H */

View File

@@ -56,7 +56,7 @@
*/
#define BOOT_CPU_FLAG_E2H BIT_ULL(32)
#ifndef __ASSEMBLY__
#ifndef __ASSEMBLER__
#include <asm/ptrace.h>
#include <asm/sections.h>
@@ -161,6 +161,6 @@ static inline bool is_hyp_nvhe(void)
return is_hyp_mode_available() && !is_kernel_in_hyp_mode();
}
#endif /* __ASSEMBLY__ */
#endif /* __ASSEMBLER__ */
#endif /* ! __ASM__VIRT_H */

View File

@@ -3,9 +3,7 @@
#ifndef __ASM_VMAP_STACK_H
#define __ASM_VMAP_STACK_H
#include <linux/bug.h>
#include <linux/gfp.h>
#include <linux/kconfig.h>
#include <linux/vmalloc.h>
#include <linux/pgtable.h>
#include <asm/memory.h>
@@ -19,8 +17,6 @@ static inline unsigned long *arch_alloc_vmap_stack(size_t stack_size, int node)
{
void *p;
BUILD_BUG_ON(!IS_ENABLED(CONFIG_VMAP_STACK));
p = __vmalloc_node(stack_size, THREAD_ALIGN, THREADINFO_GFP, node,
__builtin_return_address(0));
return kasan_reset_tag(p);

View File

@@ -31,7 +31,7 @@
#define KVM_SPSR_FIQ 4
#define KVM_NR_SPSR 5
#ifndef __ASSEMBLY__
#ifndef __ASSEMBLER__
#include <linux/psci.h>
#include <linux/types.h>
#include <asm/ptrace.h>

View File

@@ -80,7 +80,7 @@
#define PTRACE_PEEKMTETAGS 33
#define PTRACE_POKEMTETAGS 34
#ifndef __ASSEMBLY__
#ifndef __ASSEMBLER__
/*
* User structures for general purpose, floating point and debug registers.
@@ -332,6 +332,6 @@ struct user_gcs {
__u64 gcspr_el0;
};
#endif /* __ASSEMBLY__ */
#endif /* __ASSEMBLER__ */
#endif /* _UAPI__ASM_PTRACE_H */

View File

@@ -17,7 +17,7 @@
#ifndef _UAPI__ASM_SIGCONTEXT_H
#define _UAPI__ASM_SIGCONTEXT_H
#ifndef __ASSEMBLY__
#ifndef __ASSEMBLER__
#include <linux/types.h>
@@ -192,7 +192,7 @@ struct gcs_context {
__u64 reserved;
};
#endif /* !__ASSEMBLY__ */
#endif /* !__ASSEMBLER__ */
#include <asm/sve_context.h>

View File

@@ -133,7 +133,7 @@ static int __init acpi_fadt_sanity_check(void)
/*
* FADT is required on arm64; retrieve it to check its presence
* and carry out revision and ACPI HW reduced compliancy tests
* and carry out revision and ACPI HW reduced compliance tests
*/
status = acpi_get_table(ACPI_SIG_FADT, 0, &table);
if (ACPI_FAILURE(status)) {
@@ -423,7 +423,7 @@ int apei_claim_sea(struct pt_regs *regs)
irq_work_run();
__irq_exit();
} else {
pr_warn_ratelimited("APEI work queued but not completed");
pr_warn_ratelimited("APEI work queued but not completed\n");
err = -EINPROGRESS;
}
}

View File

@@ -1003,7 +1003,7 @@ static void __init sort_ftr_regs(void)
/*
* Initialise the CPU feature register from Boot CPU values.
* Also initiliases the strict_mask for the register.
* Also initialises the strict_mask for the register.
* Any bits that are not covered by an arm64_ftr_bits entry are considered
* RES0 for the system-wide value, and must strictly match.
*/
@@ -1970,7 +1970,7 @@ static struct cpumask dbm_cpus __read_mostly;
static inline void __cpu_enable_hw_dbm(void)
{
u64 tcr = read_sysreg(tcr_el1) | TCR_HD;
u64 tcr = read_sysreg(tcr_el1) | TCR_EL1_HD;
write_sysreg(tcr, tcr_el1);
isb();
@@ -2256,7 +2256,7 @@ static bool has_generic_auth(const struct arm64_cpu_capabilities *entry,
static void cpu_enable_e0pd(struct arm64_cpu_capabilities const *cap)
{
if (this_cpu_has_cap(ARM64_HAS_E0PD))
sysreg_clear_set(tcr_el1, 0, TCR_E0PD1);
sysreg_clear_set(tcr_el1, 0, TCR_EL1_E0PD1);
}
#endif /* CONFIG_ARM64_E0PD */

View File

@@ -10,6 +10,7 @@
#include <linux/efi.h>
#include <linux/init.h>
#include <linux/kmemleak.h>
#include <linux/kthread.h>
#include <linux/screen_info.h>
#include <linux/vmalloc.h>
@@ -165,20 +166,53 @@ asmlinkage efi_status_t efi_handle_corrupted_x18(efi_status_t s, const char *f)
return s;
}
static DEFINE_RAW_SPINLOCK(efi_rt_lock);
void arch_efi_call_virt_setup(void)
{
efi_virtmap_load();
raw_spin_lock(&efi_rt_lock);
efi_runtime_assert_lock_held();
if (preemptible() && (current->flags & PF_KTHREAD)) {
/*
* Disable migration to ensure that a preempted EFI runtime
* service call will be resumed on the same CPU. This avoids
* potential issues with EFI runtime calls that are preempted
* while polling for an asynchronous completion of a secure
* firmware call, which may not permit the CPU to change.
*/
migrate_disable();
kthread_use_mm(&efi_mm);
} else {
efi_virtmap_load();
}
/*
* Enable access to the valid TTBR0_EL1 and invoke the errata
* workaround directly since there is no return from exception when
* invoking the EFI run-time services.
*/
uaccess_ttbr0_enable();
post_ttbr_update_workaround();
__efi_fpsimd_begin();
}
void arch_efi_call_virt_teardown(void)
{
__efi_fpsimd_end();
raw_spin_unlock(&efi_rt_lock);
efi_virtmap_unload();
/*
* Defer the switch to the current thread's TTBR0_EL1 until
* uaccess_enable(). Do so before efi_virtmap_unload() updates the
* saved TTBR0 value, so the userland page tables are not activated
* inadvertently over the back of an exception.
*/
uaccess_ttbr0_disable();
if (preemptible() && (current->flags & PF_KTHREAD)) {
kthread_unuse_mm(&efi_mm);
migrate_enable();
} else {
efi_virtmap_unload();
}
}
asmlinkage u64 *efi_rt_stack_top __ro_after_init;

View File

@@ -34,20 +34,12 @@
* Handle IRQ/context state management when entering from kernel mode.
* Before this function is called it is not safe to call regular kernel code,
* instrumentable code, or any code which may trigger an exception.
*
* This is intended to match the logic in irqentry_enter(), handling the kernel
* mode transitions only.
*/
static __always_inline irqentry_state_t __enter_from_kernel_mode(struct pt_regs *regs)
{
return irqentry_enter(regs);
}
static noinstr irqentry_state_t enter_from_kernel_mode(struct pt_regs *regs)
{
irqentry_state_t state;
state = __enter_from_kernel_mode(regs);
state = irqentry_enter(regs);
mte_check_tfsr_entry();
mte_disable_tco_entry(current);
@@ -58,21 +50,12 @@ static noinstr irqentry_state_t enter_from_kernel_mode(struct pt_regs *regs)
* Handle IRQ/context state management when exiting to kernel mode.
* After this function returns it is not safe to call regular kernel code,
* instrumentable code, or any code which may trigger an exception.
*
* This is intended to match the logic in irqentry_exit(), handling the kernel
* mode transitions only, and with preemption handled elsewhere.
*/
static __always_inline void __exit_to_kernel_mode(struct pt_regs *regs,
irqentry_state_t state)
{
irqentry_exit(regs, state);
}
static void noinstr exit_to_kernel_mode(struct pt_regs *regs,
irqentry_state_t state)
{
mte_check_tfsr_exit();
__exit_to_kernel_mode(regs, state);
irqentry_exit(regs, state);
}
/*
@@ -80,17 +63,12 @@ static void noinstr exit_to_kernel_mode(struct pt_regs *regs,
* Before this function is called it is not safe to call regular kernel code,
* instrumentable code, or any code which may trigger an exception.
*/
static __always_inline void __enter_from_user_mode(struct pt_regs *regs)
static __always_inline void arm64_enter_from_user_mode(struct pt_regs *regs)
{
enter_from_user_mode(regs);
mte_disable_tco_entry(current);
}
static __always_inline void arm64_enter_from_user_mode(struct pt_regs *regs)
{
__enter_from_user_mode(regs);
}
/*
* Handle IRQ/context state management when exiting to user mode.
* After this function returns it is not safe to call regular kernel code,
@@ -100,7 +78,7 @@ static __always_inline void arm64_enter_from_user_mode(struct pt_regs *regs)
static __always_inline void arm64_exit_to_user_mode(struct pt_regs *regs)
{
local_irq_disable();
exit_to_user_mode_prepare(regs);
exit_to_user_mode_prepare_legacy(regs);
local_daif_mask();
mte_check_tfsr_exit();
exit_to_user_mode();

View File

@@ -94,7 +94,7 @@ SYM_CODE_START(ftrace_caller)
stp x29, x30, [sp, #FREGS_SIZE]
add x29, sp, #FREGS_SIZE
/* Prepare arguments for the the tracer func */
/* Prepare arguments for the tracer func */
sub x0, x30, #AARCH64_INSN_SIZE // ip (callsite's BL insn)
mov x1, x9 // parent_ip (callsite's LR)
mov x3, sp // regs

View File

@@ -225,10 +225,21 @@ static void fpsimd_bind_task_to_cpu(void);
*/
static void get_cpu_fpsimd_context(void)
{
if (!IS_ENABLED(CONFIG_PREEMPT_RT))
local_bh_disable();
else
if (!IS_ENABLED(CONFIG_PREEMPT_RT)) {
/*
* The softirq subsystem lacks a true unmask/mask API, and
* re-enabling softirq processing using local_bh_enable() will
* not only unmask softirqs, it will also result in immediate
* delivery of any pending softirqs.
* This is undesirable when running with IRQs disabled, but in
* that case, there is no need to mask softirqs in the first
* place, so only bother doing so when IRQs are enabled.
*/
if (!irqs_disabled())
local_bh_disable();
} else {
preempt_disable();
}
}
/*
@@ -240,10 +251,12 @@ static void get_cpu_fpsimd_context(void)
*/
static void put_cpu_fpsimd_context(void)
{
if (!IS_ENABLED(CONFIG_PREEMPT_RT))
local_bh_enable();
else
if (!IS_ENABLED(CONFIG_PREEMPT_RT)) {
if (!irqs_disabled())
local_bh_enable();
} else {
preempt_enable();
}
}
unsigned int task_get_vl(const struct task_struct *task, enum vec_type type)
@@ -1934,11 +1947,11 @@ void __efi_fpsimd_begin(void)
if (!system_supports_fpsimd())
return;
WARN_ON(preemptible());
if (may_use_simd()) {
kernel_neon_begin();
} else {
WARN_ON(preemptible());
/*
* If !efi_sve_state, SVE can't be in use yet and doesn't need
* preserving:

View File

@@ -492,7 +492,7 @@ int ftrace_make_nop(struct module *mod, struct dyn_ftrace *rec,
return ret;
/*
* When using mcount, callsites in modules may have been initalized to
* When using mcount, callsites in modules may have been initialized to
* call an arbitrary module PLT (which redirects to the _mcount stub)
* rather than the ftrace PLT we'll use at runtime (which redirects to
* the ftrace trampoline). We can ignore the old PLT when initializing

View File

@@ -62,7 +62,7 @@ static void __init init_irq_stacks(void)
}
}
#ifndef CONFIG_PREEMPT_RT
#ifdef CONFIG_SOFTIRQ_ON_OWN_STACK
static void ____do_softirq(struct pt_regs *regs)
{
__do_softirq();

View File

@@ -251,7 +251,7 @@ void crash_post_resume(void)
* marked as Reserved as memory was allocated via memblock_reserve().
*
* In hibernation, the pages which are Reserved and yet "nosave" are excluded
* from the hibernation iamge. crash_is_nosave() does thich check for crash
* from the hibernation image. crash_is_nosave() does thich check for crash
* dump kernel and will reduce the total size of hibernation image.
*/

View File

@@ -141,13 +141,13 @@ static void __init map_kernel(u64 kaslr_offset, u64 va_offset, int root_level)
static void noinline __section(".idmap.text") set_ttbr0_for_lpa2(phys_addr_t ttbr)
{
u64 sctlr = read_sysreg(sctlr_el1);
u64 tcr = read_sysreg(tcr_el1) | TCR_DS;
u64 tcr = read_sysreg(tcr_el1) | TCR_EL1_DS;
u64 mmfr0 = read_sysreg(id_aa64mmfr0_el1);
u64 parange = cpuid_feature_extract_unsigned_field(mmfr0,
ID_AA64MMFR0_EL1_PARANGE_SHIFT);
tcr &= ~TCR_IPS_MASK;
tcr |= parange << TCR_IPS_SHIFT;
tcr &= ~TCR_EL1_IPS_MASK;
tcr |= parange << TCR_EL1_IPS_SHIFT;
asm(" msr sctlr_el1, %0 ;"
" isb ;"
@@ -263,7 +263,7 @@ asmlinkage void __init early_map_kernel(u64 boot_status, phys_addr_t fdt)
}
if (va_bits > VA_BITS_MIN)
sysreg_clear_set(tcr_el1, TCR_T1SZ_MASK, TCR_T1SZ(va_bits));
sysreg_clear_set(tcr_el1, TCR_EL1_T1SZ_MASK, TCR_T1SZ(va_bits));
/*
* The virtual KASLR displacement modulo 2MiB is decided by the

View File

@@ -131,7 +131,7 @@ void arch_uprobe_abort_xol(struct arch_uprobe *auprobe, struct pt_regs *regs)
struct uprobe_task *utask = current->utask;
/*
* Task has received a fatal signal, so reset back to probbed
* Task has received a fatal signal, so reset back to probed
* address.
*/
instruction_pointer_set(regs, utask->vaddr);

View File

@@ -912,13 +912,39 @@ static int sve_set_common(struct task_struct *target,
return -EINVAL;
/*
* Apart from SVE_PT_REGS_MASK, all SVE_PT_* flags are consumed by
* vec_set_vector_length(), which will also validate them for us:
* On systems without SVE we accept FPSIMD format writes with
* a VL of 0 to allow exiting streaming mode, otherwise a VL
* is required.
*/
ret = vec_set_vector_length(target, type, header.vl,
((unsigned long)header.flags & ~SVE_PT_REGS_MASK) << 16);
if (ret)
return ret;
if (header.vl) {
/*
* If the system does not support SVE we can't
* configure a SVE VL.
*/
if (!system_supports_sve() && type == ARM64_VEC_SVE)
return -EINVAL;
/*
* Apart from SVE_PT_REGS_MASK, all SVE_PT_* flags are
* consumed by vec_set_vector_length(), which will
* also validate them for us:
*/
ret = vec_set_vector_length(target, type, header.vl,
((unsigned long)header.flags & ~SVE_PT_REGS_MASK) << 16);
if (ret)
return ret;
} else {
/* If the system supports SVE we require a VL. */
if (system_supports_sve())
return -EINVAL;
/*
* Only FPSIMD formatted data with no flags set is
* supported.
*/
if (header.flags != SVE_PT_REGS_FPSIMD)
return -EINVAL;
}
/* Allocate SME storage if necessary, preserving any existing ZA/ZT state */
if (type == ARM64_VEC_SME) {
@@ -1016,7 +1042,7 @@ static int sve_set(struct task_struct *target,
unsigned int pos, unsigned int count,
const void *kbuf, const void __user *ubuf)
{
if (!system_supports_sve())
if (!system_supports_sve() && !system_supports_sme())
return -EINVAL;
return sve_set_common(target, regset, pos, count, kbuf, ubuf,

Some files were not shown because too many files have changed in this diff Show More