Files
linux/Documentation/driver-api/hw-recoverable-errors.rst
Breno Leitao 3fa805c37d vmcoreinfo: track and log recoverable hardware errors
Introduce a generic infrastructure for tracking recoverable hardware
errors (HW errors that are visible to the OS but does not cause a panic)
and record them for vmcore consumption.  This aids post-mortem crash
analysis tools by preserving a count and timestamp for the last occurrence
of such errors.  On the other side, correctable errors, which the OS
typically remains unaware of because the underlying hardware handles them
transparently, are less relevant for crash dump and therefore are NOT
tracked in this infrastructure.

Add centralized logging for sources of recoverable hardware errors based
on the subsystem it has been notified.

hwerror_data is write-only at kernel runtime, and it is meant to be read
from vmcore using tools like crash/drgn.  For example, this is how it
looks like when opening the crashdump from drgn.

	>>> prog['hwerror_data']
	(struct hwerror_info[1]){
		{
			.count = (int)844,
			.timestamp = (time64_t)1752852018,
		},
		...

This helps fleet operators quickly triage whether a crash may be
influenced by hardware recoverable errors (which executes a uncommon code
path in the kernel), especially when recoverable errors occurred shortly
before a panic, such as the bug fixed by commit ee62ce7a1d ("page_pool:
Track DMA-mapped pages and unmap them when destroying the pool")

This is not intended to replace full hardware diagnostics but provides a
fast way to correlate hardware events with kernel panics quickly.

Rare machine check exceptions—like those indicated by mce_flags.p5 or
mce_flags.winchip—are not accounted for in this method, as they fall
outside the intended usage scope for this feature's user base.

[leitao@debian.org: add hw-recoverable-errors to toctree]
  Link: https://lkml.kernel.org/r/20251127-vmcoreinfo_fix-v1-1-26f5b1c43da9@debian.org
Link: https://lkml.kernel.org/r/20251010-vmcore_hw_error-v5-1-636ede3efe44@debian.org
Signed-off-by: Breno Leitao <leitao@debian.org>
Suggested-by: Tony Luck <tony.luck@intel.com>
Suggested-by: Shuai Xue <xueshuai@linux.alibaba.com>
Reviewed-by: Shuai Xue <xueshuai@linux.alibaba.com>
Reviewed-by: Hanjun Guo <guohanjun@huawei.com>	[APEI]
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Bob Moore <robert.moore@intel.com>
Cc: Borislav Betkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Morse <james.morse@arm.com>
Cc: Konrad Rzessutek Wilk <konrad.wilk@oracle.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Mahesh Salgaonkar <mahesh@linux.ibm.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: "Oliver O'Halloran" <oohall@gmail.com>
Cc: Omar Sandoval <osandov@osandov.com>
Cc: Thomas Gleinxer <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-27 14:24:44 -08:00

61 lines
2.1 KiB
ReStructuredText

.. SPDX-License-Identifier: GPL-2.0
=================================================
Recoverable Hardware Error Tracking in vmcoreinfo
=================================================
Overview
--------
This feature provides a generic infrastructure within the Linux kernel to track
and log recoverable hardware errors. These are hardware recoverable errors
visible that might not cause immediate panics but may influence health, mainly
because new code path will be executed in the kernel.
By recording counts and timestamps of recoverable errors into the vmcoreinfo
crash dump notes, this infrastructure aids post-mortem crash analysis tools in
correlating hardware events with kernel failures. This enables faster triage
and better understanding of root causes, especially in large-scale cloud
environments where hardware issues are common.
Benefits
--------
- Facilitates correlation of hardware recoverable errors with kernel panics or
unusual code paths that lead to system crashes.
- Provides operators and cloud providers quick insights, improving reliability
and reducing troubleshooting time.
- Complements existing full hardware diagnostics without replacing them.
Data Exposure and Consumption
-----------------------------
- The tracked error data consists of per-error-type counts and timestamps of
last occurrence.
- This data is stored in the `hwerror_data` array, categorized by error source
types like CPU, memory, PCI, CXL, and others.
- It is exposed via vmcoreinfo crash dump notes and can be read using tools
like `crash`, `drgn`, or other kernel crash analysis utilities.
- There is no other way to read these data other than from crash dumps.
- These errors are divided by area, which includes CPU, Memory, PCI, CXL and
others.
Typical usage example (in drgn REPL):
.. code-block:: python
>>> prog['hwerror_data']
(struct hwerror_info[HWERR_RECOV_MAX]){
{
.count = (int)844,
.timestamp = (time64_t)1752852018,
},
...
}
Enabling
--------
- This feature is enabled when CONFIG_VMCORE_INFO is set.