If the privileges given to root threads (3% of allowable memory) or a
negative value of /proc/pid/oom_score_adj happen to exceed the amount of
rss of a thread, its badness score overflows as a result of commit
a7f638f999 ("mm, oom: normalize oom scores to oom_score_adj scale only
for userspace").
Fix this by making the type signed and return 1, meaning the thread is
still eligible for kill, if the value is negative.
Reported-by: Dave Jones <davej@redhat.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix lots of new kernel-doc warnings in kernel/sched/fair.c:
Warning(kernel/sched/fair.c:3625): No description found for parameter 'env'
Warning(kernel/sched/fair.c:3625): Excess function parameter 'sd' description in 'update_sg_lb_stats'
Warning(kernel/sched/fair.c:3735): No description found for parameter 'env'
Warning(kernel/sched/fair.c:3735): Excess function parameter 'sd' description in 'update_sd_pick_busiest'
Warning(kernel/sched/fair.c:3735): Excess function parameter 'this_cpu' description in 'update_sd_pick_busiest'
.. more warnings
Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This reverts commit 9e612a008f.
It incorrectly finds VGA connectors where none are attached, apparently
not noticing that nothing replied to the EDID queries, and happily using
the default EDID modes that have nothing to do with actual hardware.
That in turn then causes X to fall down to the lowest common
denominator, which is usually the default 1024x768 mode that is in the
default EDID and pretty much anything supports).
I'd suggest that if not relying on the HDP pin, the code should at least
check whether it gets valid EDID data back, rather than just assume
there's something on the VGA connector.
Cc: Dave Airlie <airlied@linux.ie>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull ext4 bug fixes from Theodore Ts'o:
"This update contains two bug fixes, both destined for the stable tree.
Perhaps the most important is one which fixes ext4 when used with file
systems originally formatted for use with ext3, but then later
converted to take advantage of ext4."
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: don't set i_flags in EXT4_IOC_SETFLAGS
ext4: fix the free blocks calculation for ext3 file systems w/ uninit_bg
Pull powerpc fixes from Paul Mackerras:
"Two small fixes for powerpc:
- a fix for a regression since 3.2 that causes 4-second (or longer)
pauses
- a fix for a potential oops when loading kernel modules on 32-bit
embedded systems."
* 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc:
powerpc: Fix kernel panic during kernel module load
powerpc/time: Sanity check of decrementer expiration is necessary
Pull UBI/UBIFS fixes from Artem Bityutskiy:
"Fix UBI and UBIFS - they refuse to work without debugfs. This was
broken by the 3.5-rc1 UBI/UBIFS changes when we removed the debugging
Kconfig switches.
Also, correct locking in 'ubi_wl_flush()' - it was extended to support
flushing a specific LEB in 3.5-rc1, and the locking was sub-optimal."
* tag 'upstream-3.5-rc2' of git://git.infradead.org/linux-ubifs:
UBI: correct ubi_wl_flush locking
UBIFS: fix debugfs-less systems support
UBI: fix debugfs-less systems support
This reverts commit 7732a557b1 (and commit
3f50fff4da, which was a follow-up
cleanup).
We're chasing an elusive bug that Dave Jones can apparently reproduce
using his system call fuzzer tool, and that looks like some kind of
locking ordering problem on the directory i_mutex chain. Our i_mutex
locking is rather complex, and depends on the topological ordering of
the directories, which is why we have been very wary of splicing
directory entries around.
Of course, we really don't want to ever see aliased unconnected
directories anyway, so none of this should ever happen, but this revert
aims to basically get us back to a known older state.
Bruce points to some of the previous discussion at
http://marc.info/?i=<20110310105821.GE22723@ZenIV.linux.org.uk>
and in particular a long post from Neil:
http://marc.info/?i=<20110311150749.2fa2be66@notabene.brown>
It should be noted that it's possible that Dave's problems come from
other changes altohgether, including possibly just the fact that Dave
constantly is teachning his fuzzer new tricks. So what appears to be a
new bug could in fact be an old one that just gets newly triggered, but
reverting these patches as "still under heavy discussion" is the right
thing regardless.
Requested-by: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: J. Bruce Fields <bfields@fieldses.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull x86 fixes from Ingo Molnar.
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/nmi: Fix section mismatch warnings on 32-bit
x86/uv: Fix UV2 BAU legacy mode
x86/mm: Only add extra pages count for the first memory range during pre-allocation early page table space
x86, efi stub: Add .reloc section back into image
x86/ioapic: Fix NULL pointer dereference on CPU hotplug after disabling irqs
x86/reboot: Fix a warning message triggered by stop_other_cpus()
x86/intel/moorestown: Change intel_scu_devices_create() to __devinit
x86/numa: Set numa_nodes_parsed at acpi_numa_memory_affinity_init()
x86/gart: Fix kmemleak warning
x86: mce: Add the dropped timer interval init back
x86/mce: Fix the MCE poll timer logic
Pull perf fixes from Ingo Molnar:
"A bit larger than what I'd wish for - half of it is due to hw driver
updates to Intel Ivy-Bridge which info got recently released,
cycles:pp should work there now too, amongst other things. (but we
are generally making exceptions for hardware enablement of this type.)
There are also callchain fixes in it - responding to mostly
theoretical (but valid) concerns. The tooling side sports perf.data
endianness/portability fixes which did not make it for the merge
window - and various other fixes as well."
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (26 commits)
perf/x86: Check user address explicitly in copy_from_user_nmi()
perf/x86: Check if user fp is valid
perf: Limit callchains to 127
perf/x86: Allow multiple stacks
perf/x86: Update SNB PEBS constraints
perf/x86: Enable/Add IvyBridge hardware support
perf/x86: Implement cycles:p for SNB/IVB
perf/x86: Fix Intel shared extra MSR allocation
x86/decoder: Fix bsr/bsf/jmpe decoding with operand-size prefix
perf: Remove duplicate invocation on perf_event_for_each
perf uprobes: Remove unnecessary check before strlist__delete
perf symbols: Check for valid dso before creating map
perf evsel: Fix 32 bit values endianity swap for sample_id_all header
perf session: Handle endianity swap on sample_id_all header data
perf symbols: Handle different endians properly during symbol load
perf evlist: Pass third argument to ioctl explicitly
perf tools: Update ioctl documentation for PERF_IOC_FLAG_GROUP
perf tools: Make --version show kernel version instead of pull req tag
perf tools: Check if callchain is corrupted
perf callchain: Make callchain cursors TLS
...
Pull drm intel and exynos fixes from Dave Airlie:
"A bunch of fixes for Intel and exynos, nothing too major, a new intel
PCI ID, and a fix for CRT detection."
* 'drm-fixes' of git://people.freedesktop.org/~airlied/linux:
drm/i915: pch_irq_handler -> {ibx, cpt}_irq_handler
char/agp: add another Ironlake host bridge
drm/i915: fix up ivb plane 3 pageflips
drm/exynos: fixed blending for hdmi graphic layer
drm/exynos: Remove dummy encoder get_crtc operation implementation
drm/exynos: Keep a reference to frame buffer GEM objects
drm/exynos: Don't cast GEM object to Exynos GEM object when not needed
drm/exynos: DRIVER_BUS_PLATFORM is not a driver feature
drm/exynos: fixed size type.
drm/exynos: Use DRM_FORMAT_{NV12, YUV420} instead of DRM_FORMAT_{NV12M, YUV420M}
drm/i915: hold forcewake around ring hw init
drm/i915: Mark the ringbuffers as being in the GTT domain
drm/i915/crt: Do not rely upon the HPD presence pin
drm/i915: Reset last_retired_head when resetting ring
Pull leap second timer fix from Thomas Gleixner.
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
timekeeping: Fix CLOCK_MONOTONIC inconsistency during leapsecond
Pull minor module param fixes from Rusty Russell:
"One bugfix for multiple moduleparam levels, one removal of overzealous
printk."
* tag 'moduleparam-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus:
init: Drop initcall level output
module_param: stop double-calling parameters.
It was reported that compiling for 32-bit caused a bunch of
section mismatch warnings:
VDSOSYM arch/x86/vdso/vdso32-syms.lds
LD arch/x86/vdso/built-in.o
LD arch/x86/built-in.o
WARNING: arch/x86/built-in.o(.data+0x5af0): Section mismatch in
reference from the variable test_nmi_ipi_callback_na.10451 to
the function .init.text:test_nmi_ipi_callback() [...]
WARNING: arch/x86/built-in.o(.data+0x5b04): Section mismatch in
reference from the variable nmi_unk_cb_na.10399 to the function
.init.text:nmi_unk_cb() The variable nmi_unk_cb_na.10399
references the function __init nmi_unk_cb() [...]
Both of these are attributed to the internal representation of
the nmiaction struct created during register_nmi_handler. The
reason for this is that those structs are not defined in the
init section whereas the rest of the code in nmi_selftest.c is.
To resolve this, I created a new #define,
register_nmi_handler_initonly, that tags the struct as
__initdata to resolve the mismatch. This #define should only be
used in rare situations where the register/unregister is called
during init of the kernel.
Big thanks to Jan Beulich for decoding this for me as I didn't
have a clue what was going on.
Reported-by: Witold Baryluk <baryluk@smp.if.uj.edu.pl>
Tested-by: Witold Baryluk <baryluk@smp.if.uj.edu.pl>
Cc: Jan Beulich <JBeulich@suse.com>
Signed-off-by: Don Zickus <dzickus@redhat.com>
Link: http://lkml.kernel.org/r/1338991542-23000-1-git-send-email-dzickus@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This fixes a problem which can causes kernel oopses while loading
a kernel module.
According to the PowerPC EABI specification, GPR r11 is assigned
the dedicated function to point to the previous stack frame.
In the powerpc-specific kernel module loader, do_plt_call()
(in arch/powerpc/kernel/module_32.c), GPR r11 is also used
to generate trampoline code.
This combination crashes the kernel, in the case where the compiler
chooses to use a helper function for saving GPRs on entry, and the
module loader has placed the .init.text section far away from the
.text section, meaning that it has to generate a trampoline for
functions in the .init.text section to call the GPR save helper.
Because the trampoline trashes r11, references to the stack frame
using r11 can cause an oops.
The fix just uses GPR r12 instead of GPR r11 for generating the
trampoline code. According to the statements from Freescale, this is
safe from an EABI perspective.
I've tested the fix for kernel 2.6.33 on MPC8541.
Cc: stable@vger.kernel.org
Signed-off-by: Steffen Rumler <steffen.rumler.ext@nsn.com>
[paulus@samba.org: reworded the description]
Signed-off-by: Paul Mackerras <paulus@samba.org>
The SGI Altix UV2 BAU (Broadcast Assist Unit) as used for
tlb-shootdown (selective broadcast mode) always uses UV2
broadcast descriptor format. There is no need to clear the
'legacy' (UV1) mode, because the hardware always uses UV2 mode
for selective broadcast.
But the BIOS uses general broadcast and legacy mode, and the
hardware pays attention to the legacy mode bit for general
broadcast. So the kernel must not clear that mode bit.
Signed-off-by: Cliff Wickman <cpw@sgi.com>
Cc: <stable@kernel.org>
Link: http://lkml.kernel.org/r/E1SccoO-0002Lh-Cb@eag09.americas.sgi.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
* 'exynos-drm-fixes' of git://git.infradead.org/users/kmpark/linux-samsung:
drm/exynos: fixed blending for hdmi graphic layer
drm/exynos: Remove dummy encoder get_crtc operation implementation
drm/exynos: Keep a reference to frame buffer GEM objects
drm/exynos: Don't cast GEM object to Exynos GEM object when not needed
drm/exynos: DRIVER_BUS_PLATFORM is not a driver feature
drm/exynos: fixed size type.
drm/exynos: Use DRM_FORMAT_{NV12, YUV420} instead of DRM_FORMAT_{NV12M, YUV420M}
* 'drm-intel-fixes' of git://people.freedesktop.org/~danvet/drm-intel:
drm/i915: pch_irq_handler -> {ibx, cpt}_irq_handler
char/agp: add another Ironlake host bridge
drm/i915: fix up ivb plane 3 pageflips
drm/i915: hold forcewake around ring hw init
drm/i915: Mark the ringbuffers as being in the GTT domain
drm/i915/crt: Do not rely upon the HPD presence pin
drm/i915: Reset last_retired_head when resetting ring
9fb48c744b ("params: add 3rd arg to option handler callback
signature") added similar lines to dmesg:
initlevel:0=early, 4 registered initcalls
initlevel:1=core, 31 registered initcalls
initlevel:2=postcore, 11 registered initcalls
initlevel:3=arch, 7 registered initcalls
initlevel:4=subsys, 40 registered initcalls
initlevel:5=fs, 30 registered initcalls
initlevel:6=device, 250 registered initcalls
initlevel:7=late, 35 registered initcalls
but they don't contain any info for the general user staring at dmesg.
I'm very doubtful the count of initcalls registered per level helps
anyone so drop that output completely.
Cc: Jim Cromie <jim.cromie@gmail.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jason Baron <jbaron@redhat.com>
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Commit 026cee0086 "params:
<level>_initcall-like kernel parameters" set old-style module
parameters to level 0. And we call those level 0 calls where we used
to, early in start_kernel().
We also loop through the initcall levels and call the levelled
module_params before the corresponding initcall. Unfortunately level
0 is early_init(), so we call the standard module_param calls twice.
(Turns out most things don't care, but at least ubi.mtd does).
Change the level to -1 for standard module_param calls.
Reported-by: Benoît Thébaudeau <benoit.thebaudeau@advansee.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: stable@kernel.org
This reverts 68568add2c ("powerpc/time: Remove unnecessary sanity check
of decrementer expiration"). We do need to check whether we have reached
the expiration time of the next event, because we sometimes get an early
decrementer interrupt, most notably when we set the decrementer to 1 in
arch_irq_work_raise(). The effect of not having the sanity check is that
if timer_interrupt() gets called early, we leave the decrementer set to
its maximum value, which means we then don't get any more decrementer
interrupts for about 4 seconds (or longer, depending on timebase
frequency). I saw these pauses as a consequence of getting a stray
hypervisor decrementer interrupt left over from exiting a KVM guest.
This isn't quite a straight revert because of changes to the surrounding
code, but it restores the same algorithm as was previously used.
Cc: stable@vger.kernel.org
Acked-by: Anton Blanchard <anton@samba.org>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
This reverts commit 40af1bbdca.
It's horribly and utterly broken for at least the following reasons:
- calling sync_mm_rss() from mmput() is fundamentally wrong, because
there's absolutely no reason to believe that the task that does the
mmput() always does it on its own VM. Example: fork, ptrace, /proc -
you name it.
- calling it *after* having done mmdrop() on it is doubly insane, since
the mm struct may well be gone now.
- testing mm against NULL before you call it is insane too, since a
NULL mm there would have caused oopses long before.
.. and those are just the three bugs I found before I decided to give up
looking for me and revert it asap. I should have caught it before I
even took it, but I trusted Andrew too much.
Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Markus Trippelsdorf <markus@trippelsdorf.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 7990696 uses the ext4_{set,clear}_inode_flags() functions to
change the i_flags automatically but fails to remove the error setting
of i_flags. So we still have the problem of trashing state flags.
Fix this by removing the assignment.
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
Ext3 filesystems that are converted to use as many ext4 file system
features as possible will enable uninit_bg to speed up e2fsck times.
These file systems will have a native ext3 layout of inode tables and
block allocation bitmaps (as opposed to ext4's flex_bg layout).
Unfortunately, in these cases, when first allocating a block in an
uninitialized block group, ext4 would incorrectly calculate the number
of free blocks in that block group, and then errorneously report that
the file system was corrupt:
EXT4-fs error (device vdd): ext4_mb_generate_buddy:741: group 30, 32254 clusters in bitmap, 32258 in gd
This problem can be reproduced via:
mke2fs -q -t ext4 -O ^flex_bg /dev/vdd 5g
mount -t ext4 /dev/vdd /mnt
fallocate -l 4600m /mnt/test
The problem was caused by a bone headed mistake in the check to see if a
particular metadata block was part of the block group.
Many thanks to Kees Cook for finding and bisecting the buggy commit
which introduced this bug (commit fd034a84e1, present since v3.2).
Reported-by: Sander Eikelenboom <linux@eikelenboom.it>
Reported-by: Kees Cook <keescook@chromium.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Tested-by: Kees Cook <keescook@chromium.org>
Cc: stable@kernel.org
Merge random fixes from Andrew Morton.
* emailed from Andrew Morton <akpm@linux-foundation.org>: (11 patches)
mm: correctly synchronize rss-counters at exit/exec
btree: catch NULL value before it does harm
btree: fix tree corruption in btree_get_prev()
ipc: shm: restore MADV_REMOVE functionality on shared memory segments
drivers/platform/x86/acerhdf.c: correct Boris' mail address
c/r: prctl: drop VMA flags test on PR_SET_MM_ stack data assignment
c/r: prctl: add ability to get clear_tid_address
c/r: prctl: add minimal address test to PR_SET_MM
c/r: prctl: update prctl_set_mm_exe_file() after mm->num_exe_file_vmas removal
MAINTAINERS: whitespace fixes
shmem: replace_page must flush_dcache and others
mm->rss_stat counters have per-task delta: task->rss_stat. Before
changing task->mm pointer the kernel must flush this delta with
sync_mm_rss().
do_exit() already calls sync_mm_rss() to flush the rss-counters before
committing the rss statistics into task->signal->maxrss, taskstats,
audit and other stuff. Unfortunately the kernel does this before
calling mm_release(), which can call put_user() for processing
task->clear_child_tid. So at this point we can trigger page-faults and
task->rss_stat becomes non-zero again. As a result mm->rss_stat becomes
inconsistent and check_mm() will print something like this:
| BUG: Bad rss-counter state mm:ffff88020813c380 idx:1 val:-1
| BUG: Bad rss-counter state mm:ffff88020813c380 idx:2 val:1
This patch moves sync_mm_rss() into mm_release(), and moves mm_release()
out of do_exit() and calls it earlier. After mm_release() there should
be no pagefaults.
[akpm@linux-foundation.org: tweak comment]
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: <stable@vger.kernel.org> [3.4.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Storing NULL values in the btree is illegal and can lead to memory
corruption and possible other fun as well. Catch it on insert, instead
of waiting for the inevitable.
Signed-off-by: Joern Engel <joern@logfs.org>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The memory the parameter __key points to is used as an iterator in
btree_get_prev(), so if we save off a bkey() pointer in retry_key and
then assign that to __key, we'll end up corrupting the btree internals
when we do eg
longcpy(__key, bkey(geo, node, i), geo->keylen);
to return the key value. What we should do instead is use longcpy() to
copy the key value that retry_key points to __key.
This can cause a btree to get corrupted by seemingly read-only
operations such as btree_for_each_safe.
[akpm@linux-foundation.org: avoid the double longcpy()]
Signed-off-by: Roland Dreier <roland@purestorage.com>
Acked-by: Joern Engel <joern@logfs.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 17cf28afea ("mm/fs: remove truncate_range") removed the
truncate_range inode operation in favour of the fallocate file
operation.
When using SYSV IPC shared memory segments, calling madvise with the
MADV_REMOVE advice on an area of shared memory will attempt to invoke
the .fallocate function for the shm_file_operations, which is NULL and
therefore returns -EOPNOTSUPP to userspace. The previous behaviour
would inherit the inode_operations from the underlying tmpfs file and
invoke truncate_range there.
This patch restores the previous behaviour by wrapping the underlying
fallocate function in shm_fallocate, as we do for fsync.
[hughd@google.com: use -ENOTSUPP in shm_fallocate()]
Signed-off-by: Will Deacon <will.deacon@arm.com>
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In commit b76437579d ("procfs: mark thread stack correctly in
proc/<pid>/maps") the stack allocated via clone() is marked in
/proc/<pid>/maps as [stack:%d] thus it might be out of the former
mm->start_stack/end_stack values (and even has some custom VMA flags
set).
So to be able to restore mm->start_stack/end_stack drop vma flags test,
but still require the underlying VMA to exist.
As always note this feature is under CONFIG_CHECKPOINT_RESTORE and
requires CAP_SYS_RESOURCE to be granted.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Serge Hallyn <serge.hallyn@canonical.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Zero is written at clear_tid_address when the process exits. This
functionality is used by pthread_join().
We already have sys_set_tid_address() to change this address for the
current task but there is no way to obtain it from user space.
Without the ability to find this address and dump it we can't restore
pthread'ed apps which call pthread_join() once they have been restored.
This patch introduces the PR_GET_TID_ADDRESS prctl option which allows
the current process to obtain own clear_tid_address.
This feature is available iif CONFIG_CHECKPOINT_RESTORE is set.
[akpm@linux-foundation.org: fix prctl numbering]
Signed-off-by: Andrew Vagin <avagin@openvz.org>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Pedro Alves <palves@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Tejun Heo <tj@kernel.org>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit bde05d1ccd ("shmem: replace page if mapping excludes its zone")
is not at all likely to break for anyone, but it was an earlier version
from before review feedback was incorporated. Fix that up now.
* shmem_replace_page must flush_dcache_page after copy_highpage [akpm]
* Expand comment on why shmem_unuse_inode needs page_swapcount [akpm]
* Remove excess of VM_BUG_ONs from shmem_replace_page [wangcong]
* Check page_private matches swap before calling shmem_replace_page [hughd]
* shmem_replace_page allow for unexpected race in radix_tree lookup [hughd]
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Stephane Marchesin <marcheu@chromium.org>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Dave Airlie <airlied@gmail.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Rob Clark <rob.clark@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Some UEFI firmware will not load a .efi with a .reloc section
with a size of 0.
Therefore, we create a .efi image with 4 main areas and 3 sections.
1. PE/COFF file header
2. .setup section (covers all setup code following the first sector)
3. .reloc section (contains 1 dummy reloc entry, created in build.c)
4. .text section (covers the remaining kernel image)
To make room for the new .setup section data, the header
bugger_off_msg had to be shortened.
Reported-by: Henrik Rydberg <rydberg@euromail.se>
Signed-off-by: Jordan Justen <jordan.l.justen@intel.com>
Link: http://lkml.kernel.org/r/1339085121-12760-1-git-send-email-jordan.l.justen@intel.com
Tested-by: Lee G Rosenbaum <lee.g.rosenbaum@intel.com>
Tested-by: Henrik Rydberg <rydberg@euromail.se>
Cc: Matt Fleming <matt.fleming@intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Pull PARISC fixes from James Bottomley:
"This is a set of three bug fixes for minor build breakages that got
introduced just before 3.5-rc1 was released."
* tag 'parisc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/parisc-2.6:
[PARISC] fix code to find libgcc
[PARISC] fix compile break in use of lib/strncopy_from_user.c
[PARISC] fix missing TAINT_WARN problem
Pull tile fixes from Chris Metcalf:
"These two minor bug fixes fix build failures from some changes that
were merged in during the 3.5 merge window."
* git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile:
tile: add #include to unbreak build after generic init_task conversion
tile: remove cpu_idle_on_new_stack
Commit "62f38455 UBI: modify ubi_wl_flush function to clear work queue for a lnum"
takes the 'work_sem' semaphore in write mode for the entire loop, which is not
very good because it will block other workers for potentially long time. We do
not need to have it in write mode - read mode is enough, and we do not need to
hole it over the entire loop. So this patch turns changes the locking: takes
'work_sem' in read mode and pushes it down to the loop.
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Commit "f70b7e5 UBIFS: remove Kconfig debugging option" broke UBIFS and it
refuses to initialize if debugfs (CONFIG_DEBUG_FS) is disabled. I incorrectly
assumed that debugfs files creation function will return success if debugfs
is disabled, but they actually return -ENODEV. This patch fixes the issue.
Reported-by: Paul Parsons <lost.distance@yahoo.com>
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Tested-by: Paul Parsons <lost.distance@yahoo.com>
Commit "aa44d1d UBI: remove Kconfig debugging option" broke UBI and it
refuses to initialize if debugfs (CONFIG_DEBUG_FS) is disabled. I incorrectly
assumed that debugfs files creation function will return success if debugfs
is disabled, but they actually return -ENODEV. This patch fixes the issue.
Reported-by: Paul Parsons <lost.distance@yahoo.com>
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Tested-by: Paul Parsons <lost.distance@yahoo.com>
Cougar/Panther Point redefine the bits in SDEIIR pretty completely.
This function is just debugging, but if we're debugging we probably want
to be told accurate things instead of lies.
I'm told Lynx Point changes this yet more, but I have no idea how...
Note from Eugeni's review:
"For the record and for future enabling efforts, for LPT, bits 28-31
and 1-14 are gone since CPT/PPT (e.g., those must be zero). And there
is the bit 15 as a new addition, but we are not using it yet and
probably won't be using in foreseeable future."
Signed-off-by: Adam Jackson <ajax@redhat.com>
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=35103
Reviewed-by: Eugeni Dodonov <eugeni.dodonov@intel.com>
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Pull hwmon fix from Guenter Roeck:
"Update e-mail address in MAINTAINERS"
* tag 'hwmon-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging:
MAINTAINERS: Update my e-mail address
Pull ACPI and Power Management changes from Len Brown.
This does an evil merge to fix up what I think is a mismerge by Len to
the gma500 driver, and restore it to the mainline state.
In that driver, both branches had commented out the call to
acpi_video_register(), and Len resolved the merge to that commented-out
version.
However, in mainline, further changes by Alan (commit d839ede47a:
"gma500: opregion and ACPI" to be exact) had re-enabled the ACPI video
registration, so the current state of the driver seems to want it.
Alan is apparently still feeling the effects of partying with the Queen,
so he didn't reply to my query, but I'll do the evil merge anyway.
* 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux:
ACPI: fix acpi_bus.h build warnings when ACPI is not enabled
drivers: acpi: Fix dependency for ACPI_HOTPLUG_CPU
tools/power turbostat: fix IVB support
tools/power turbostat: fix un-intended affinity of forked program
ACPI video: use after input_unregister_device()
gma500: don't register the ACPI video bus
acpi_video: Intel video is not always i915
acpi_video: fix leaking PCI references
ACPI: Ignore invalid _PSS entries, but use valid ones
ACPI battery: only refresh the sysfs files when pertinent information changes
Pull InfiniBand/RDMA fixes from Roland Dreier:
"All in hardware drivers:
- Fix crash in cxgb4
- Fixes to new ocrdma driver
- Regression fixes for mlx4"
* tag 'rdma-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband:
IB/mlx4: Fix max_wqe capacity reported from query device
mlx4_core: Fix setting VL_cap in mlx4_SET_PORT wrapper flow
IB/mlx4: Fix EQ deallocation in legacy mode
RDMA/cxgb4: Fix crash when peer address is 0.0.0.0
RDMA/ocrdma: Remove unnecessary version.h includes
RDMA/ocrdma: Fix signaled event for SRQ_LIMIT_REACHED
RDMA/ocrdma: Correct queue free count math
1. Limit the max number of WQEs per QP reported when querying the
device, so that ib_create_qp() will not fail for a QP size that the
device claimed to support due to additional headroom WQEs being
allocated.
2. Limit qp resources accepted for ib_create_qp() to the limits
reported in ib_query_device(). In kernel space, make sure that the
limits returned to the caller following qp creation also lie within
the reported device limits. For userspace, report as before, and do
adjustment in libmlx4 (so as not to break ABI).
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Sagi Grimberg <sagig@mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Commit 096335b3f9 ("mlx4_core: Allow dynamic MTU configuration for
IB ports") modifies the port VL setting. This exposes a bug in
mlx4_common_set_port(), where the VL cap value passed in (inside the
command mailbox) is incorrectly zeroed-out:
mlx4_SET_PORT modifies the VL_cap field (byte 3 of the mailbox).
Since the SET_PORT command is paravirtualized on the master as well as
on the slaves, mlx4_SET_PORT_wrapper() is invoked on the master. This
calls mlx4_common_set_port() where mailbox byte 3 gets overwritten by
code which should only set a single bit in that byte (for the reset
qkey counter flag) -- but instead overwrites the entire byte.
The result is that when running in SR-IOV mode, the VL_cap will be set
to zero -- fix this.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Pull two md fixes from NeilBrown:
"One sparse-warning fix, one bugfix for 3.4-stable"
* tag 'md-3.5-fixes' of git://neil.brown.name/md:
md: raid1/raid10: fix problem with merge_bvec_fn
lib/raid6: fix sparse warnings in recovery functions
Pull IOMMU fixes from Joerg Roedel:
"Two patches are in here which fix AMD IOMMU specific issues. One
patch fixes a long-standing warning on resume because the
amd_iommu_resume function enabled interrupts. The other patch fixes a
deadlock in an error-path of the page-fault request handling code of
the IOMMU driver.
* tag 'iommu-fixes-3.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu:
iommu/amd: Fix deadlock in ppr-handling error path
iommu/amd: Cache pdev pointer to root-bridge
Some code was moved from init_task.c to setup.c but the appropriate
header needed to be moved as well.
Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
This routine isn't used unless CONFIG_HOMECACHE is enabled, which
isn't even available as a public configuration option yet.
Since it no longer links correctly in 3.4, just remove it for now.
Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
It does not get processed because sched_domain_level_max is 0 at the
time that setup_relax_domain_level() is run.
Simply accept the value as it is, as we don't know the value of
sched_domain_level_max until sched domain construction is completed.
Fix sched_relax_domain_level in cpuset. The build_sched_domain() routine calls
the set_domain_attribute() routine prior to setting the sd->level, however,
the set_domain_attribute() routine relies on the sd->level to decide whether
idle load balancing will be off/on.
Signed-off-by: Dimitri Sivanich <sivanich@sgi.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120605184436.GA15668@sgi.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Zheng Yan reported that event group validation can wreck event state
when Intel extra_reg allocation changes event state.
Validation shouldn't change any persistent state. Cloning events in
validate_{event,group}() isn't really pretty either, so add a few
special cases to avoid modifying the event state.
The code is restructured to minimize the special case impact.
Reported-by: Zheng Yan <zheng.z.yan@linux.intel.com>
Acked-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1338903031.28282.175.camel@twins
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Often when we run into mis-shapen topologies the balance iteration
fails to update the cpu power properly and we'll end up in /0 traps.
Always initialize the cpu-power to a semi-sane value so that we can
at least boot the machine, even if the load-balancer might not
function correctly.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-3lbhyj25sr169ha7z3qht5na@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Weird topologies can lead to asymmetric domain setups. This needs
further consideration since these setups are typically non-minimal
too.
For now, make it work by adding an extra mask selecting which CPUs
are allowed to iterate up.
The topology that triggered it is the one from David Rientjes:
10 20 20 30
20 10 20 20
20 20 10 20
30 20 20 10
resulting in boxes that wouldn't even boot.
Reported-by: David Rientjes <rientjes@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-3p86l9cuaqnxz7uxsojmz5rm@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Roland Dreier reported spurious, hard to trigger lockdep warnings
within the scheduler - without any real lockup.
This bit gives us the right clue:
> [89945.640512] [<ffffffff8103fa1a>] double_lock_balance+0x5a/0x90
> [89945.640568] [<ffffffff8104c546>] push_rt_task+0xc6/0x290
if you look at that code you'll find the double_lock_balance() in
question is the one in find_lock_lowest_rq() [yay for inlining].
Now find_lock_lowest_rq() has a bug.. it fails to use
double_unlock_balance() in one exit path, if this results in a retry in
push_rt_task() we'll call double_lock_balance() again, at which point
we'll run into said lockdep confusion.
Reported-by: Roland Dreier <roland@kernel.org>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1337282386.4281.77.camel@twins
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Commit cb83b629b ("sched/numa: Rewrite the CONFIG_NUMA sched
domain support") removed the NODE sched domain and started checking
if the node distance in SLIT table is farther than REMOTE_DISTANCE,
if so, it will lose the load balance chance at exec/fork/wake_affine
points.
But actually, even the node distance is farther than REMOTE_DISTANCE.
Modern CPUs also has QPI like connections, which ensures that memory
access is not too slow between nodes. So the above change in behavior
on NUMA machine causes a performance regression on various benchmarks:
hackbench, tbench, netperf, oltp, etc.
This patch will recover the scheduler behavior to old mode on all my
Intel platforms: NHM EP/EX, WSM EP, SNB EP/EP4S, and thus fixes the
perfromance regressions. (all of them just have 2 kinds distance, 10, 21)
Signed-off-by: Alex Shi <alex.shi@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1338965571-9812-1-git-send-email-alex.shi@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Commit 316ad24830 ("sched/x86: Rewrite set_cpu_sibling_map()")
broke the booted_cores accounting.
The problem is that the booted_cores accounting needs all the
sibling links set up. So restore the second loop and add a comment as
to why its needed.
On qemu booted with -smp sockets=1,cores=2,threads=2;
Before:
$ grep cores /proc/cpuinfo
cpu cores : 2
cpu cores : 1
cpu cores : 4
cpu cores : 3
With the patch:
$ grep cores /proc/cpuinfo
cpu cores : 2
cpu cores : 2
cpu cores : 2
cpu cores : 2
Reported-by: Prarit Bhargava <prarit@redhat.com>
Reported-by: Borislav Petkov <bp@amd64.org>
Signed-off-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120531073738.GH7511@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In current Linux, percpu variable `vector_irq' is not cleared on
offlined cpus while disabling devices' irqs. If the cpu that has
the disabled irqs in vector_irq is hotplugged,
__setup_vector_irq() hits invalid irq vector and may crash.
This bug can be reproduced as following;
# echo 0 > /sys/devices/system/cpu/cpu7/online
# modprobe -r some_driver_using_interrupts # vector_irq@cpu7 uncleared
# echo 1 > /sys/devices/system/cpu/cpu7/online # kernel may crash
This patch fixes this bug by clearing vector_irq in
__clear_irq_vector() even if the cpu is offlined.
Signed-off-by: Tomoki Sekiyama <tomoki.sekiyama.qu@hitachi.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: yrl.pp-manager.tt@hitachi.com
Cc: ltc-kernel@ml.yrl.intra.hitachi.co.jp
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Alexander Gordeev <agordeev@redhat.com>
Link: http://lkml.kernel.org/r/4FC340BE.7080101@hitachi.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When rebooting our 24 CPU Westmere servers with 3.4-rc6, we
always see this warning msg:
Restarting system.
machine restart
------------[ cut here ]------------
WARNING: at arch/x86/kernel/smp.c:125
native_smp_send_reschedule+0x74/0xa7() Hardware name: X8DTN
Modules linked in: igb [last unloaded: scsi_wait_scan]
Pid: 1, comm: systemd-shutdow Not tainted 3.4.0-rc6+ #22
Call Trace:
<IRQ> [<ffffffff8102a41f>] warn_slowpath_common+0x7e/0x96
[<ffffffff8102a44c>] warn_slowpath_null+0x15/0x17
[<ffffffff81018cf7>] native_smp_send_reschedule+0x74/0xa7
[<ffffffff810561c1>] trigger_load_balance+0x279/0x2a6
[<ffffffff81050112>] scheduler_tick+0xe0/0xe9
[<ffffffff81036768>] update_process_times+0x60/0x70
[<ffffffff81062f2f>] tick_sched_timer+0x68/0x92
[<ffffffff81046e33>] __run_hrtimer+0xb3/0x13c
[<ffffffff81062ec7>] ? tick_nohz_handler+0xd0/0xd0
[<ffffffff810474f2>] hrtimer_interrupt+0xdb/0x198
[<ffffffff81019a35>] smp_apic_timer_interrupt+0x81/0x94
[<ffffffff81655187>] apic_timer_interrupt+0x67/0x70
<EOI> [<ffffffff8101a3c4>] ? default_send_IPI_mask_allbutself_phys+0xb4/0xc4
[<ffffffff8101c680>] physflat_send_IPI_allbutself+0x12/0x14
[<ffffffff81018db4>] native_nmi_stop_other_cpus+0x8a/0xd6
[<ffffffff810188ba>] native_machine_shutdown+0x50/0x67
[<ffffffff81018926>] machine_shutdown+0xa/0xc
[<ffffffff8101897e>] native_machine_restart+0x20/0x32
[<ffffffff810189b0>] machine_restart+0xa/0xc
[<ffffffff8103b196>] kernel_restart+0x47/0x4c
[<ffffffff8103b2e6>] sys_reboot+0x13e/0x17c
[<ffffffff8164e436>] ? _raw_spin_unlock_bh+0x10/0x12
[<ffffffff810fcac9>] ? bdi_queue_work+0xcf/0xd8
[<ffffffff810fe82f>] ? __bdi_start_writeback+0xae/0xb7
[<ffffffff810e0d64>] ? iterate_supers+0xa3/0xb7
[<ffffffff816547a2>] system_call_fastpath+0x16/0x1b
---[ end trace 320af5cb1cb60c5b ]---
The root cause seems to be the
default_send_IPI_mask_allbutself_phys() takes quite some time (I
measured it could be several ms) to complete sending NMIs to all
the other 23 CPUs, and for HZ=250/1000 system, the time is long
enough for a timer interrupt to happen, which will in turn
trigger to kick load balance to a stopped CPU and cause this
warning in native_smp_send_reschedule().
So disabling the local irq before stop_other_cpu() can fix this
problem (tested 25 times reboot ok), and it is fine as there
should be nobody caring the timer interrupt in such reboot
stage.
The latest 3.4 kernel slightly changes this behavior by sending
REBOOT_VECTOR first and only send NMI_VECTOR if the REBOOT_VCTOR
fails, and this patch is still needed to prevent the problem.
Signed-off-by: Feng Tang <feng.tang@intel.com>
Acked-by: Don Zickus <dzickus@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20120530231541.4c13433a@feng-i7
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When hot-adding a CPU, the system outputs following messages
since node_to_cpumask_map[2] was not allocated memory.
Booting Node 2 Processor 32 APIC 0xc0
node_to_cpumask_map[2] NULL
Pid: 0, comm: swapper/32 Tainted: G A 3.3.5-acd #21
Call Trace:
[<ffffffff81048845>] debug_cpumask_set_cpu+0x155/0x160
[<ffffffff8105e28a>] ? add_timer_on+0xaa/0x120
[<ffffffff8150665f>] numa_add_cpu+0x1e/0x22
[<ffffffff815020bb>] identify_cpu+0x1df/0x1e4
[<ffffffff815020d6>] identify_econdary_cpu+0x16/0x1d
[<ffffffff81504614>] smp_store_cpu_info+0x3c/0x3e
[<ffffffff81505263>] smp_callin+0x139/0x1be
[<ffffffff815052fb>] start_secondary+0x13/0xeb
The reason is that the bit of node 2 was not set at
numa_nodes_parsed. numa_nodes_parsed is set by only
acpi_numa_processor_affinity_init /
acpi_numa_x2apic_affinity_init. Thus even if hot-added memory
which is same PXM as hot-added CPU is written in ACPI SRAT
Table, if the hot-added CPU is not written in ACPI SRAT table,
numa_nodes_parsed is not set.
But according to ACPI Spec Rev 5.0, it says about ACPI SRAT
table as follows: This optional table provides information that
allows OSPM to associate processors and memory ranges, including
ranges of memory provided by hot-added memory devices, with
system localities / proximity domains and clock domains.
It means that ACPI SRAT table only provides information for CPUs
present at boot time and for memory including hot-added memory.
So hot-added memory is written in ACPI SRAT table, but hot-added
CPU is not written in it. Thus numa_nodes_parsed should be set
by not only acpi_numa_processor_affinity_init /
acpi_numa_x2apic_affinity_init but also
acpi_numa_memory_affinity_init for the case.
Additionally, if system has cpuless memory node,
acpi_numa_processor_affinity_init /
acpi_numa_x2apic_affinity_init cannot set numa_nodes_parseds
since these functions cannot find cpu description for the node.
In this case, numa_nodes_parsed needs to be set by
acpi_numa_memory_affinity_init.
Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: liuj97@gmail.com
Cc: kosaki.motohiro@gmail.com
Link: http://lkml.kernel.org/r/4FCC2098.4030007@jp.fujitsu.com
[ merged it ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Fix the x86 instruction decoder to decode bsr/bsf/jmpe with
operand-size prefix (66h). This fixes the test case failure
reported by Linus, attached below.
bsf/bsr/jmpe have a special encoding. Opcode map in
Intel Software Developers Manual vol2 says they have
TZCNT/LZCNT variants if it has F3h prefix. However, there
is no information if it has other 66h or F2h prefixes.
Current instruction decoder supposes that those are
bad instructions, but it actually accepts at least
operand-size prefixes.
H. Peter Anvin further explains:
" TZCNT/LZCNT are F3 + BSF/BSR exactly because the F2 and
F3 prefixes have historically been no-ops with most instructions.
This allows software to unconditionally use the prefixed versions
and get TZCNT/LZCNT on the processors that have them if they don't
care about the difference. "
This fixes errors reported by test_get_len:
Warning: arch/x86/tools/test_get_len found difference at <em_bsf>:ffffffff81036d87
Warning: ffffffff81036de5: 66 0f bc c2 bsf %dx,%ax
Warning: objdump says 4 bytes, but insn_get_length() says 3
Warning: arch/x86/tools/test_get_len found difference at <em_bsr>:ffffffff81036ea6
Warning: ffffffff81036f04: 66 0f bd c2 bsr %dx,%ax
Warning: objdump says 4 bytes, but insn_get_length() says 3
Warning: decoded and checked 13298882 instructions with 2 warnings
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Reported-by: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: <yrl.pp-manager.tt@hitachi.com>
Link: http://lkml.kernel.org/r/20120604150911.22338.43296.stgit@localhost.localdomain
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull perf fixes from Arnaldo Carvalho de Melo:
* Endianness fixes from Jiri Olsa
* Fixes for make perf tarball
* Fix for DSO name in perf script callchains, from David Ahern
* Segfault fixes for perf top --callchain, from Namhyung Kim
* Minor function result fixes from Srikar Dronamraju
* Add missing 3rd ioctl parameter, from Namhyung Kim
* Fix pager usage in minimal embedded systems, from Avik Sil
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull MCE regression fix from Tony Luck:
"Typo/thinko in a cleanup caused a semantic change. Fix it."
* tag 'please-pull-mce' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras:
x86/mce: Fix the MCE poll timer logic
Pull arm CMA fix from Marek Szyprowski:
"This removes the ARMv6+ CMA dependency and lets one use old, well-
tested dma-mapping implementation also on ARMv6+ systems without the
need to use EXPERIMENTAL stuff."
Russell King complained (rightly) about the experimental feature being
forced on by the ARM config.
Here CMA is "continuous memory allocator", not "cross-memory attach".
We really neet to stop using insane TLA's for things that aren't big
industry standards.
* 'fixes-for-linus' of git://git.linaro.org/people/mszyprowski/linux-dma-mapping:
ARM: dma-mapping: remove unconditional dependency on CMA
Or at least plug another gapping hole. Apparrently hw desingers only
moved the bit field, but did not bother ot re-enumerate the planes
when adding support for a 3rd pipe.
Discovered by i-g-t/flip_test.
This may or may not fix the reference bugzilla, because that one
smells like we have still larger fish to fry.
v2: Fixup the impossible case to catch programming errors, noticed by
Chris Wilson.
References: https://bugs.freedesktop.org/show_bug.cgi?id=50069
Acked-by: Chris Wilson <chris@chris-wilson.co.uk>
Tested-by: Eugeni Dodonov <eugeni.dodonov@intel.com>
Eugeni Dodonov <eugeni.dodonov@intel.com>
Cc: stable@vger.kernel.org
Signed-Off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Pull two scsi target fixes from Nicholas Bellinger:
"The first is a small name-space collision fix from Stefan for the new
sbp-target / ieee-1394 code, and second is the FILEIO backend
conversion patch to always use O_DSYNC by default instead of O_SYNC as
recommended by hch. Note the latter is CC'ed stable."
* '3.5-merge-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending:
target/file: Use O_DSYNC by default for FILEIO backends
sbp-target: rename a variable to avoid name clash
Pull cgroup fix from Tejun Heo:
"This fixes the possible premature superblock release on umount bug
mentioned during v3.5-rc1 pull request.
Originally, cgroup dentry destruction path assumed that cgroup dentry
didn't have any reference left after cgroup removal thus put super
during dentry removal. Now that there can be lingering dentry
references, this led to super being put with live dentries. This
patch fixes the problem by putting super ref on dentry release instead
of removal."
* 'for-3.5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: superblock can't be released with active dentries
Pull embedded i2c update from Wolfram Sang:
"This only contains one new driver which had multiple dependencies
(pinctrl, i2c-mux-rework, new devm_* functions), so I decided to wait
for rc1. Plus, it had to wait a little for the ack of a devicetree
maintainer since the bindings were not trivial enough for me to pass
through.
So, given that, I hope there is still something like the "new driver
rule", so we could have the driver in 3.5 and people can start using
it. That would make merging support for some boards easier for 3.6
since the dependency on this driver is gone then."
* 'i2c-embedded/for-current' of git://git.pengutronix.de/git/wsa/linux:
i2c: Add generic I2C multiplexer using pinctrl API
Pull kvm fix from Avi Kivity:
"A one-liner fix for a buffer overflow"
* git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: Fix buffer overflow in kvm_set_irq()
Pull drm radeon fixes from Dave Airlie:
"This is just radeon fixes and a bunch of new PCI ids. The fixes are
for a deadlock, an audio regression, and a couple of audio fixes."
* 'drm-fixes' of git://people.freedesktop.org/~airlied/linux:
drm/radeon/kms: add new SI PCI ids
drm/radeon/kms: add new BTC PCI ids
drm/radeon/kms: add new Palm, Sumo PCI ids
drm/radeon/kms: add new Trinity PCI ids
drm/radeon: fix vm deadlocks on cayman
drm/radeon: fix gpu_init on si
drm/radeon/hdmi: don't set SEND_MAX_PACKETS bit
drm/radeon/audio: don't hardcode CRTC id
drm/radeon: make audio_init consistent across asics
This patch fixes bug in macro radix_tree_for_each_contig().
If radix_tree_next_slot() sees NULL in next slot it returns NULL, but following
radix_tree_next_chunk() switches iterating into next chunk. As result iterating
becomes non-contiguous and breaks vfs "splice" and all its users.
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Reported-and-bisected-by: Hans de Bruin <jmdebruin@xmsnet.nl>
Reported-and-bisected-by: Ondrej Zary <linux@rainbow-software.org>
Reported-bisected-and-tested-by: Toralf Förster <toralf.foerster@gmx.de>
Link: https://lkml.org/lkml/2012/6/5/64
Cc: stable <stable@vger.kernel.org> # 3.4.x
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull scheduler fixes from Ingo Molnar.
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched: Remove NULL assignment of dattr_cur
sched: Remove the last NULL entry from sched_feat_names
sched: Make sched_feat_names const
sched/rt: Fix SCHED_RR across cgroups
sched: Move nr_cpus_allowed out of 'struct sched_rt_entity'
sched: Make sure to not re-read variables after validation
sched: Fix SD_OVERLAP
sched: Don't try allocating memory from offline nodes
sched/nohz: Fix rq->cpu_load calculations some more
sched/x86: Use cpu_llc_shared_mask(cpu) for coregroup_mask
kvm_set_irq() has an internal buffer of three irq routing entries, allowing
connecting a GSI to three IRQ chips or on MSI. However setup_routing_entry()
does not properly enforce this, allowing three irqchip routes followed by
an MSI route to overflow the buffer.
Fix by ensuring that an MSI entry is added to an empty list.
Signed-off-by: Avi Kivity <avi@redhat.com>
- Properly set up the RBs
- Properly set up the SPI
- Properly set up gb_addr_config
This should fix rendering issues on certain cards.
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
Many TVs and A/V receivers don't work with this bit set. Problem was
confirmed using: Onkyo TX-SR605, Sony BRAVIA KDL-52X3500, Sony BRAVIA
KDL-40S40xx. In theory this bit shouldn't affect audio engine when
feeding it with data, however it seems it does. Driver fglrx doesn't set
that bit in any of the above cases.
This fixes a regression introduced by 3.5-rc1.
Signed-off-by: Rafał Miłecki <zajec5@gmail.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
Call it in the asic startup callback on all asics.
Previously r600 and rv770 called it in the startup
and resume callbacks while all the other asics called
it in the startup callback.
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Rafał Miłecki <zajec5@gmail.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
Sam broke this with
commit 1f2bfbd00e
Author: Sam Ravnborg <sam@ravnborg.org>
Date: Sat May 5 10:18:41 2012 +0200
kbuild: link of vmlinux moved to a script
But we should be deriving the location of libgcc in the same way as all
the other archs, so fix by adding a LIBGCC variable which is evaluated
in the makefile
Signed-off-by: James Bottomley <JBottomley@Parallels.com>
Linus broke us with
commit 36126f8f2e
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date: Sat May 26 10:43:17 2012 -0700
word-at-a-time: make the interfaces truly generic
By moving functions defined in strncopy_from_user.c into the asm-geneic
version word-at-a-time.h. Spark and OpenRisc were fixed to use this, but
not parisc. Fix by adding to generic-y in asm/Kbuild
Signed-off-by: James Bottomley <JBottomley@Parallels.com>
Al viro broke us with
commit edd63a2763
Author: Al Viro <viro@zeniv.linux.org.uk>
Date: Fri Apr 27 13:42:45 2012 -0400
set_restore_sigmask() is never called without SIGPENDING (and never should be)
Although it's pretty much our fault since parisc's asm/bug.h uses
BUGWARN_TAINT but doesn't include the file that defines it. Fix that.
Signed-off-by: James Bottomley <JBottomley@Parallels.com>
Blending for graphic layer 0 of hdmi mixer was not set so video
layer cannot be showed if graphic layer 0 is enabled.
This patch fixes blending values to support blending between
graphic layer 0 and video layer.
Signed-off-by: Seung-Woo Kim <sw0312.kim@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Signed-off-by: Inki Dae <inki.dae@samsung.com>
The encoder get_crtc operation is called to retrieve a pointer to the
CRTC the encoder is currenctly connected to, right after setting the
encoder::crtc field to the new CRTC. The implementation of this
operation returns the pointer to the new CRTC, which is then pointlessly
compared to itself.
As the operation is not mandatory, don't implement it.
Signed-off-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com>
Signed-off-by: Inki Dae <inki.dae@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
GEM objects used by frame buffers must be referenced for the whole life
of the frame buffer. Release the references in the frame buffer
destructor instead of its constructor.
Signed-off-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com>
Signed-off-by: Inki Dae <inki.dae@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
size type of drm_exynos_gem_mmap struct is changed to uint64_t and
it adds pad for the struct to be aligned as 64bit.
Signed-off-by: Inki Dae <inki.dae@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
The NV12M/YUV420M formats are identical to the already existing standard
NV12/YUV420 formats. The M variants will be removed, so convert the
driver to use the standard names.
Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Signed-off-by: Inki Dae <inki.dae@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Pull signal and vfs compile breakage fixes from Al Viro.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal:
fixups for signal breakage
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
nommu: fix compilation of nommu.c
Pull cifs fixes from Steve French.
* 'for-next' of git://git.samba.org/sfrench/cifs-2.6:
CIFS: Move get_next_mid to ops struct
CIFS: Make accessing is_valid_oplock/dump_detail ops struct field safe
CIFS: Improve identation in cifs_unlock_range
CIFS: Fix possible wrong memory allocation
Compiling 3.5-rc1 for nommu targets gives:
CC mm/nommu.o
mm/nommu.c: In function ‘sys_mmap_pgoff’:
mm/nommu.c:1489:2: error: ‘ret’ undeclared (first use in this function)
mm/nommu.c:1489:2: note: each undeclared identifier is reported only once for each function it appears in
It is trivially fixed by replacing 'ret' with the local variable that is
already defined for the return value 'retval'.
Signed-off-by: Greg Ungerer <gerg@uclinux.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull frontswap feature from Konrad Rzeszutek Wilk:
"Frontswap provides a "transcendent memory" interface for swap pages.
In some environments, dramatic performance savings may be obtained
because swapped pages are saved in RAM (or a RAM-like device) instead
of a swap disk. This tag provides the basic infrastructure along with
some changes to the existing backends."
Fix up trivial conflict in mm/Makefile due to removal of swap token code
changing a line next to the new frontswap entry.
This pull request came in before the merge window even opened, it got
delayed to after the merge window by me just wanting to make sure it had
actual users. Apparently IBM is using this on their embedded side, and
Jan Beulich says that it's already made available for SLES and OpenSUSE
users.
Also acked by Rik van Riel, and Konrad points to other people liking it
too. So in it goes.
By Dan Magenheimer (4) and Konrad Rzeszutek Wilk (2)
via Konrad Rzeszutek Wilk
* tag 'stable/frontswap.v16-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/mm:
frontswap: s/put_page/store/g s/get_page/load
MAINTAINER: Add myself for the frontswap API
mm: frontswap: config and doc files
mm: frontswap: core frontswap functionality
mm: frontswap: core swap subsystem hooks and headers
mm: frontswap: add frontswap header file
Pull irq and smpboot updates from Thomas Gleixner:
"Just cleanup patches with no functional change and a fix for suspend
issues."
* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
genirq: Introduce irq_do_set_affinity() to reduce duplicated code
genirq: Add IRQS_PENDING for nested and simple irq
* 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
smpboot, idle: Fix comment mismatch over idle_threads_init()
smpboot, idle: Optimize calls to smp_processor_id() in idle_threads_init()
Pull timer updates from Thomas Gleixner:
"The clocksource driver is pure hardware enablement and the skew option
is default off, well tested and non dangerous."
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
tick: Move skew_tick option into the HIGH_RES_TIMER section
clocksource: em_sti: Add DT support
clocksource: em_sti: Emma Mobile STI driver
clockevents: Make clockevents_config() a global symbol
tick: Add tick skew boot option
Empirical evidence suggests that we need to: On at least one ivb
machine when running the hangman i-g-t test, the rings don't properly
initialize properly - the RING_START registers seems to be stuck at
all zeros.
Holding forcewake around this register init sequences makes chip reset
reliable again. Note that this is not the first such issue:
commit f01db988ef
Author: Sean Paul <seanpaul@chromium.org>
Date: Fri Mar 16 12:43:22 2012 -0400
drm/i915: Add wait_for in init_ring_common
added delay loops to make RING_START and RING_CTL initialization
reliable on the blt ring at boot-up. So I guess it won't hurt if we do
this unconditionally for all force_wake needing gpus.
To avoid copy&pasting of the HAS_FORCE_WAKE check I've added a new
intel_info bit for that.
v2: Fixup missing commas in static struct and properly handling the
error case in init_ring_common, both noticed by Jani Nikula.
Cc: stable@vger.kernel.org
Reported-and-tested-by: Yang Guang <guang.a.yang@intel.com>
Reviewed-by: Eugeni Dodonov <eugeni.dodonov@intel.com>
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=50522
Signed-Off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
By correctly describing the rinbuffers as being in the GTT domain, it
appears that we are more careful with the management of the CPU cache
upon resume and so prevent some coherency issue when submitting commands
to the GPU later. A secondary effect is that the debug logs are then
consistent with the actual usage (i.e. they no longer describe the
ringbuffers as being in the CPU write domain when we are accessing them
through an wc iomapping.)
Reported-and-tested-by: Daniel Gnoutcheff <daniel@gnoutcheff.name>
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=41092
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: stable@vger.kernel.org
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Cyrill Gorcunov reports that I broke the fdinfo files with commit
30a08bf2d3 ("proc: move fd symlink i_mode calculations into
tid_fd_revalidate()"), and he's quite right.
The tid_fd_revalidate() function is not just used for the <tid>/fd
symlinks, it's also used for the <tid>/fdinfo/<fd> files, and the
permission model for those are different.
So do the dynamic symlink permission handling just for symlinks, making
the fdinfo files once more appear as the proper regular files they are.
Of course, Al Viro argued (probably correctly) that we shouldn't do the
symlink permission games at all, and make the symlinks always just be
the normal 'lrwxrwxrwx'. That would have avoided this issue too, but
since somebody noticed that the permissions had changed (which was the
reason for that original commit 30a08bf2d3 in the first place), people
do apparently use this feature.
[ Basically, you can use the symlink permission data as a cheap "fdinfo"
replacement, since you see whether the file is open for reading and/or
writing by just looking at st_mode of the symlink. So the feature
does make sense, even if the pain it has caused means we probably
shouldn't have done it to begin with. ]
Reported-and-tested-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is useful for SoCs whose I2C module's signals can be routed to
different sets of pins at run-time, using the pinctrl API.
+-----+ +-----+
| dev | | dev |
+------------------------+ +-----+ +-----+
| SoC | | |
| /----|------+--------+
| +---+ +------+ | child bus A, on first set of pins
| |I2C|---|Pinmux| |
| +---+ +------+ | child bus B, on second set of pins
| \----|------+--------+--------+
| | | | |
+------------------------+ +-----+ +-----+ +-----+
| dev | | dev | | dev |
+-----+ +-----+ +-----+
Signed-off-by: Stephen Warren <swarren@nvidia.com>
Acked-by: Linus Walleij <linus.walleij@linaro.org>
Acked-by: Rob Herring <rob.herring@calxeda.com>
Signed-off-by: Wolfram Sang <w.sang@pengutronix.de>
In the error path of the ppr_notifer it can happen that the
iommu->lock is taken recursivly. This patch fixes the
problem by releasing the iommu->lock before any notifier is
invoked. This also requires to move the erratum workaround
for the ppr-log (interrupt may be faster than data in the log)
one function up.
Cc: stable@vger.kernel.org # v3.3, v3.4
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
At some point pci_get_bus_and_slot started to enable
interrupts. Since this function is used in the
amd_iommu_resume path it will enable interrupts on resume
which causes a warning. The fix will use a cached pointer
to the root-bridge to re-enable the IOMMU in case the BIOS
is broken.
Cc: stable@vger.kernel.org
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Commit e605b743f3 ("IB/mlx4: Increase the number of vectors (EQs)
available for ULPs") didn't handle correctly the case where there
aren't enough MSI-X vectors to increase the number of EQs, so only the
legacy EQs are allocated. This results in an attempt to memset() to
zero the EQ table which was never allocated and a kernel crash.
Fix this by checking in the teardown flow if the table of EQs was ever
allocated. Also remove some unneeded setting to zero of the EQ
related fields in struct mlx4_ib_dev.
Signed-off-by: Shlomo Pongratz <shlomop@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
CMA has been enabled unconditionally on all ARMv6+ systems to solve the
long standing issue of double kernel mappings for all dma coherent
buffers. This however created a dependency on CONFIG_EXPERIMENTAL for
the whole ARM architecture what should be really avoided. This patch
removes this dependency and lets one use old, well-tested dma-mapping
implementation also on ARMv6+ systems without the need to use
EXPERIMENTAL stuff.
Reported-by: Russell King <linux@arm.linux.org.uk>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
When using rping -c -a 0.0.0.0 with iw_cxgb4, the system crashes when
rdma_connect() is called. ip_dev_find() will return NULL, but pdev is
accessed anyway.
Checking that pdev is NULL and returning -ENODEV prevents the system
from crashing.
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@linux.vnet.ibm.com>
Acked-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
Update bugfix-video branch to 2.5-rc1
so I don't have to again resolve the
conflict in these patches vs. upstream.
Conflicts:
drivers/gpu/drm/gma500/psb_drv.c
text conflict: add comment vs delete neighboring line
keep just this:
/* igd_opregion_init(&dev_priv->opregion_dev); */
/* acpi_video_register(); */
Signed-off-by: Len Brown <len.brown@intel.com>
introduced in Linux-3.5-rc1 by
66886d6f8c
(ACPI: Add stubs for (un)register_acpi_bus_type)
Fix header file warnings when CONFIG_ACPI is not enabled:
include/acpi/acpi_bus.h:443:42: warning: 'struct acpi_bus_type' declared inside parameter list
include/acpi/acpi_bus.h:443:42: warning: its scope is only this definition or declaration, which is probably not
include/acpi/acpi_bus.h:444:44: warning: 'struct acpi_bus_type' declared inside parameter list
Signed-off-by: Len Brown <len.brown@intel.com>
Should be 'exynos5_xxx' instead of 'exonys5_xxx'.
It happened at the commit 30b842889e ("Merge tag 'soc2' of
git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc")
during v3.5 merge window.
Signed-off-by: Kukjin Kim <kgene.kim@samsung.com>
[ My bad - Linus ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix the following build warning:
warning: (ACPI_HOTPLUG_CPU) selects ACPI_CONTAINER which has unmet direct dependencies (ACPI && EXPERIMENTAL)
Signed-off-by: Fabio Estevam <fabio.estevam@freescale.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Initial IVB support went into turbostat in Linux-3.1:
553575f1ae
(tools turbostat: recognize and run properly on IVB)
However, when running on IVB, turbostat would fail
to report the new couters added with SNB, c7, pc2 and pc7.
So in scenarios where these counters are non-zero on IVB,
turbostat would report erroneous residencey results.
In particular c7 time would be added to c1 time,
since c1 time is calculated as "that which is left over".
Also, turbostat reports MHz capabilities when passed
the "-v" option, and it would incorrectly report 133MHz
bclk instead of 100MHz bclk for IVB, which would inflate
GHz reported with that option.
This patch is a backport of a fix already included in turbostat v2.
Signed-off-by: Len Brown <len.brown@intel.com>
Linux 3.4 included a modification to turbostat to
lower cross-call overhead by using scheduler affinity:
15aaa34654
(tools turbostat: reduce measurement overhead due to IPIs)
In the use-case where turbostat forks a child program,
that change had the un-intended side-effect of binding
the child to the last cpu in the system.
This change removed the binding before forking the child.
This is a back-port of a fix already included in turbostat v2.
Signed-off-by: Len Brown <len.brown@intel.com>
Pull some left-over PM patches from Rafael J. Wysocki.
* 'pm-acpi' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
ACPI / PM: Make acpi_pm_device_sleep_state() follow the specification
ACPI / PM: Make __acpi_bus_get_power() cover D3cold correctly
ACPI / PM: Fix error messages in drivers/acpi/bus.c
rtc-cmos / PM: report wakeup event on ACPI RTC alarm
ACPI / PM: Generate wakeup events on fixed power button
This reverts commit 5ceb9ce6fe.
That commit seems to be the cause of the mm compation list corruption
issues that Dave Jones reported. The locking (or rather, absense
there-of) is dubious, as is the use of the 'page' variable once it has
been found to be outside the pageblock range.
So revert it for now, we can re-visit this for 3.6. If we even need to:
as Minchan Kim says, "The patch wasn't a bug fix and even test workload
was very theoretical".
Reported-and-tested-by: Dave Jones <davej@redhat.com>
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Cc: Kyungmin Park <kyungmin.park@samsung.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
New tmpfs use of !PageUptodate pages for fallocate() is triggering the
WARNING: at mm/page-writeback.c:1990 when __set_page_dirty_nobuffers()
is called from migrate_page_copy() for compaction.
It is anomalous that migration should use __set_page_dirty_nobuffers()
on an address_space that does not participate in dirty and writeback
accounting; and this has also been observed to insert surprising dirty
tags into a tmpfs radix_tree, despite tmpfs not using tags at all.
We should probably give migrate_page_copy() a better way to preserve the
tag and migrate accounting info, when mapping_cap_account_dirty(). But
that needs some more work: so in the interim, avoid the warning by using
a simple SetPageDirty on PageSwapBacked pages.
Reported-and-tested-by: Dave Jones <davej@redhat.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The comment above it says "Stat data, not accessed from path walking",
but in fact some of inode fields we use for the common stat data was way
down at the end of the inode, causing unnecessary cache misses for the
common stat operations.
The inode structure is pretty big, and this can change padding depending
on field width, but at least on the common 64-bit configurations this
doesn't change the size. Some of our inode layout has historically been
to tro to avoid unnecessary padding fields, but cache locality is at
least as important for layout, if not more.
Noticed by looking at kernel profiles, and noticing that the "i_blkbits"
access stood out like a sore thumb.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Maintainer activities moved to home server. Update e-mail address to reflect
reality.
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Acked-by: Jean Delvare <khali@linux-fr.org>
Convert to use O_DSYNC for all cases at FILEIO backend creation time to
avoid the extra syncing of pure timestamp updates with legacy O_SYNC during
default operation as recommended by hch. Continue to do this independently of
Write Cache Enable (WCE) bit, as WCE=0 is currently the default for all backend
devices and enabled by user on per device basis via attrib/emulate_write_cache.
This patch drops the now unnecessary fd_buffered_io= token usage that was
originally signalling when to explictly disable O_SYNC at backend creation
time for buffered I/O operation. This can end up being dangerous for a number
of reasons during physical node failure, so go ahead and drop this option
for now when O_DSYNC is used as the default.
Also allow explict FUA WRITEs -> vfs_fsync_range() call to function in
fd_execute_cmd() independently of WCE bit setting.
Reported-by: Christoph Hellwig <hch@lst.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
We can't use "input" anymore after calling input_unregister_device().
The call to input_free_device() is a double free. The normal way to
deal with this is to make input_register_device() the last function
called in the function.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Dmitry Torokhov <dtor@mail.ru>
Signed-off-by: Len Brown <len.brown@intel.com>
We are not yet ready for this and it makes a mess on some devices.
Signed-off-by: Alan Cox <alan@linux.intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Stop it poking at random registers on the i740 cards that may be out there
still.
As per Matthew's feedback remove the conditional checks and never enable the
opregion handling unless an appropriate driver has been loaded.
Signed-off-by: Alan Cox <alan@linux.intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
when cifs_reconnect sets maxBuf to 0 and we try to calculate a size
of memory we need to store locks.
Signed-off-by: Pavel Shilovsky <pshilovsky@samba.org>
Signed-off-by: Steve French <sfrench@us.ibm.com>
We swap the sample_id_all header by u64 pointers. Some members of the
header happen to be 32 bit values. We need to handle them separatelly.
Together with other endianity patches, this change fixies perf report
discrepancies on origin and target systems as described in test 1 below,
e.g. following perf report diff:
...
0.12% ps [kernel.kallsyms] [k] clear_page
- 0.12% awk bash [.] alloc_word_desc
+ 0.12% awk bash [.] yyparse
0.11% beah-rhts-task libpython2.6.so.1.0 [.] 0x5560e
0.10% perf libc-2.12.so [.] __ctype_toupper_loc
- 0.09% rhts-test-runne bash [.] maybe_make_export_env
+ 0.09% rhts-test-runne bash [.] 0x385a0
0.09% ps [kernel.kallsyms] [k] page_fault
...
Note, running following to test perf endianity handling:
test 1)
- origin system:
# perf record -a -- sleep 10 (any perf record will do)
# perf report > report.origin
# perf archive perf.data
- copy the perf.data, report.origin and perf.data.tar.bz2
to a target system and run:
# tar xjvf perf.data.tar.bz2 -C ~/.debug
# perf report > report.target
# diff -u report.origin report.target
- the diff should produce no output
(besides some white space stuff and possibly different
date/TZ output)
test 2)
- origin system:
# perf record -ag -fo /tmp/perf.data -- sleep 1
- mount origin system root to the target system on /mnt/origin
- target system:
# perf script --symfs /mnt/origin -I -i /mnt/origin/tmp/perf.data \
--kallsyms /mnt/origin/proc/kallsyms
- complete perf.data header is displayed
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Tested-by: David Ahern <dsahern@gmail.com>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1338380624-7443-4-git-send-email-jolsa@redhat.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Adding endianity swapping for event header attached via sample_id_all.
Currently we dont do that and it's causing wrong data to be read when
running report on architecture with different endianity than the record.
The perf is currently able to process 32-bit PPC samples on 32-bit
and 64-bit x86.
Together with other endianity patches, this change fixies perf report
discrepancies on origin and target systems as described in test 1
below, e.g. following perf report diff:
...
0.12% ps [kernel.kallsyms] [k] clear_page
- 0.12% awk bash [.] alloc_word_desc
+ 0.12% awk bash [.] yyparse
0.11% beah-rhts-task libpython2.6.so.1.0 [.] 0x5560e
0.10% perf libc-2.12.so [.] __ctype_toupper_loc
- 0.09% rhts-test-runne bash [.] maybe_make_export_env
+ 0.09% rhts-test-runne bash [.] 0x385a0
0.09% ps [kernel.kallsyms] [k] page_fault
...
Note, running following to test perf endianity handling:
test 1)
- origin system:
# perf record -a -- sleep 10 (any perf record will do)
# perf report > report.origin
# perf archive perf.data
- copy the perf.data, report.origin and perf.data.tar.bz2
to a target system and run:
# tar xjvf perf.data.tar.bz2 -C ~/.debug
# perf report > report.target
# diff -u report.origin report.target
- the diff should produce no output
(besides some white space stuff and possibly different
date/TZ output)
test 2)
- origin system:
# perf record -ag -fo /tmp/perf.data -- sleep 1
- mount origin system root to the target system on /mnt/origin
- target system:
# perf script --symfs /mnt/origin -I -i /mnt/origin/tmp/perf.data \
--kallsyms /mnt/origin/proc/kallsyms
- complete perf.data header is displayed
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Tested-by: David Ahern <dsahern@gmail.com>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1338380624-7443-3-git-send-email-jolsa@redhat.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Currently we dont care about the file object's endianness. It's possible
we read buildid file object from different architecture than we are
currentlly running on. So we need to care about properly reading such
object's data - handle different endianness properly.
Adding:
needs_swap DSO field
dso__swap_init function to initialize DSO's needs_swap
DSO__SWAP to read the data with proper swaps
Together with other endianity patches, this change fixies perf report
discrepancies on origin and target systems as described in test 1 below,
e.g. following perf report diff:
...
0.12% ps [kernel.kallsyms] [k] clear_page
- 0.12% awk bash [.] alloc_word_desc
+ 0.12% awk bash [.] yyparse
0.11% beah-rhts-task libpython2.6.so.1.0 [.] 0x5560e
0.10% perf libc-2.12.so [.] __ctype_toupper_loc
- 0.09% rhts-test-runne bash [.] maybe_make_export_env
+ 0.09% rhts-test-runne bash [.] 0x385a0
0.09% ps [kernel.kallsyms] [k] page_fault
...
Note, running following to test perf endianity handling:
test 1)
- origin system:
# perf record -a -- sleep 10 (any perf record will do)
# perf report > report.origin
# perf archive perf.data
- copy the perf.data, report.origin and perf.data.tar.bz2
to a target system and run:
# tar xjvf perf.data.tar.bz2 -C ~/.debug
# perf report > report.target
# diff -u report.origin report.target
- the diff should produce no output
(besides some white space stuff and possibly different
date/TZ output)
test 1)
- origin system:
# perf record -ag -fo /tmp/perf.data -- sleep 1
- mount origin system root to the target system on /mnt/origin
- target system:
# perf script --symfs /mnt/origin -I -i /mnt/origin/tmp/perf.data \
--kallsyms /mnt/origin/proc/kallsyms
- complete perf.data header is displayed
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1338380624-7443-2-git-send-email-jolsa@redhat.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Whilst most monitors do wire up the HPD presence pin, it seems quite a
few KVM do not. Therefore if we simply rely on the HPD pin being
asserted to indicate a connected monitor we fail miserable, so fall back
to performing a DCC query for the EDID.
Reported-and-tested-by: Matthieu LAVIE <boiteamadmax@hotmail.com>
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=50501
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
The new merge_bvec_fn which calls the corresponding function
in subsidiary devices requires that mddev->merge_check_needed
be set if any child has a merge_bvec_fn.
However were were only setting that when a device was hot-added,
not when a device was present from the start.
This bug was introduced in 3.4 so patch is suitable for 3.4.y
kernels. However that are conflicts in raid10.c so a separate
patch will be needed for 3.4.y.
Cc: stable@vger.kernel.org
Reported-by: Sebastian Riemer <sebastian.riemer@profitbricks.com>
Signed-off-by: NeilBrown <neilb@suse.de>
'int login_id' shadows 'static atomic_t login_id'.
Seen as compilation warning on x86-32.
Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de>
Acked-by: Chris Boot <bootc@bootc.net>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
$ perf script -i /tmp/perf.data
...
gcc 13623 544315.062858: context-switches:
ffffffff815f65c9 __schedule ([kernel.kallsyms])
ffffffff81087cea __cond_resched ([kernel.kallsyms])
ffffffff815f6b92 _cond_resched ([kernel.kallsyms])
ffffffff815fb87a do_page_fault ([kernel.kallsyms])
ffffffff815f8465 page_fault ([kernel.kallsyms])
2b7a71ea0303 _dl_lookup_symbol_x ([kernel.kallsyms])
2b7a71ea1eb5 _dl_relocate_object ([kernel.kallsyms])
2b7a71e99b2e dl_main ([kernel.kallsyms])
2b7a71eab7f4 _dl_sysdep_start ([kernel.kallsyms])
All DSO's in a callchain are printed as [kernel.kallsyms].
git bisect chased it to:
547a92e0ae is the first bad commit
commit 547a92e0ae
Author: Akihiro Nagai <akihiro.nagai.hw@hitachi.com>
Date: Mon Jan 30 13:42:57 2012 +0900
perf script: Unify the expressions indicating "unknown"
The perf script command uses various expressions to indicate "unknown".
It is unfriendly for user scripts to parse it. So, this patch unifies
the expressions to "[unknown]".
Looks like a copy-paste in that the other references use al.map but this one
should be node->map.
With this patch you get:
$ perf script -i /tmp/perf.data
...
gcc 13623 544315.062858: context-switches:
ffffffff815f65c9 __schedule ([kernel.kallsyms])
ffffffff81087cea __cond_resched ([kernel.kallsyms])
ffffffff815f6b92 _cond_resched ([kernel.kallsyms])
ffffffff815fb87a do_page_fault ([kernel.kallsyms])
ffffffff815f8465 page_fault ([kernel.kallsyms])
2b7a71ea0303 _dl_lookup_symbol_x (/lib64/ld-2.14.90.so)
2b7a71ea1eb5 _dl_relocate_object (/lib64/ld-2.14.90.so)
2b7a71e99b2e dl_main (/lib64/ld-2.14.90.so)
2b7a71eab7f4 _dl_sysdep_start (/lib64/ld-2.14.90.so)
Signed-off-by: David Ahern <dsahern@gmail.com>
Cc: Akihiro Nagai <akihiro.nagai.hw@hitachi.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1338353906-60706-1-git-send-email-dsahern@gmail.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
When no event is specified the tools use perf_evlist__add_default(), that will
call event_attr_init to initialize the KVM exclusion bits.
When the change was made to the tools so that by default guest samples would be
excluded, the changes were made just to the parsing routines and to
perf_evlist__add_default(), not to perf_evlist__add_attrs, that is used so far
just by perf stat to add multiple events, according to the level of detail
specified.
Recently the tools were changed to reconstruct the event name from all the
details in perf_event_attr, not just from .type and .config, but taking into
account all the feature bits (.exclude_{guest,host,user,kernel,etc},
.precise_ip, etc).
That is when we noticed that the default for perf stat wasn't the one for the
rest of the tools, i.e. the .exclude_guest bit wasn't being set.
I.e. the default, that doesn't call event_attr_init was showing the :HG
modifier:
$ perf stat usleep 1
Performance counter stats for 'usleep 1':
0.942119 task-clock # 0.454 CPUs utilized
1 context-switches # 0.001 M/sec
0 CPU-migrations # 0.000 K/sec
126 page-faults # 0.134 M/sec
693,193 cycles:HG # 0.736 GHz [40.11%]
407,461 stalled-cycles-frontend:HG # 58.78% frontend cycles idle [72.29%]
365,403 stalled-cycles-backend:HG # 52.71% backend cycles idle
465,982 instructions:HG # 0.67 insns per cycle
# 0.87 stalled cycles per insn
89,760 branches:HG # 95.275 M/sec
6,178 branch-misses:HG # 6.88% of all branches
0.002077228 seconds time elapsed
While if one explicitely specifies the same events, which will make the parsing code
to be called and thus event_attr_init is called:
$ perf stat -e task-clock,context-switches,migrations,page-faults,cycles,stalled-cycles-frontend,stalled-cycles-backend,instructions,branches,branch-misses usleep 1
Performance counter stats for 'usleep 1':
1.040349 task-clock # 0.500 CPUs utilized
2 context-switches # 0.002 M/sec
0 CPU-migrations # 0.000 K/sec
127 page-faults # 0.122 M/sec
587,966 cycles # 0.565 GHz [13.18%]
459,167 stalled-cycles-frontend # 78.09% frontend cycles idle
390,249 stalled-cycles-backend # 66.37% backend cycles idle
504,006 instructions # 0.86 insns per cycle
# 0.91 stalled cycles per insn
96,455 branches # 92.714 M/sec
6,522 branch-misses # 6.76% of all branches [96.12%]
0.002078681 seconds time elapsed
Fix it by introducing a perf_evlist__add_default_attrs method that will call
evlist_attr_init in all the perf_event_attr entries before adding the events.
Reported-by: Ingo Molnar <mingo@kernel.org>
Cc: David Ahern <dsahern@gmail.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Namhyung Kim <namhyung@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Link: http://lkml.kernel.org/n/tip-4eysr236r0pgiyum9epwxw7s@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
task_tick_rt() has an optimization to only reschedule SCHED_RR tasks
if they were the only element on their rq. However, with cgroups
a SCHED_RR task could be the only element on its per-cgroup rq but
still be competing with other SCHED_RR tasks in its parent's
cgroup. In this case, the SCHED_RR task in the child cgroup would
never yield at the end of its timeslice. If the child cgroup
rt_runtime_us was the same as the parent cgroup rt_runtime_us,
the task in the parent cgroup would starve completely.
Modify task_tick_rt() to check that the task is the only task on its
rq, and that the each of the scheduling entities of its ancestors
is also the only entity on its rq.
Signed-off-by: Colin Cross <ccross@android.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1337229266-15798-1-git-send-email-ccross@android.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
SD_OVERLAP exists to allow overlapping groups, overlapping groups
appear in NUMA topologies that aren't fully connected.
The typical result of not fully connected NUMA is that each cpu (or
rather node) will have different spans for a particular distance.
However due to how sched domains are traversed -- only the first cpu
in the mask goes one level up -- the next level only cares about the
spans of the cpus that went up.
Due to this two things were observed to be broken:
- build_overlap_sched_groups() -- since its possible the cpu we're
building the groups for exists in multiple (or all) groups, the
selection criteria of the first group didn't ensure there was a cpu
for which is was true that cpumask_first(span) == cpu. Thus load-
balancing would terminate.
- update_group_power() -- assumed that the cpu span of the first
group of the domain was covered by all groups of the child domain.
The above explains why this isn't true, so deal with it.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: David Rientjes <rientjes@google.com>
Link: http://lkml.kernel.org/r/1337788843.9783.14.camel@laptop
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Follow up on commit 556061b00 ("sched/nohz: Fix rq->cpu_load[]
calculations") since while that fixed the busy case it regressed the
mostly idle case.
Add a callback from the nohz exit to also age the rq->cpu_load[]
array. This closes the hole where either there was no nohz load
balance pass during the nohz, or there was a 'significant' amount of
idle time between the last nohz balance and the nohz exit.
So we'll update unconditionally from the tick to not insert any
accidental 0 load periods while busy, and we try and catch up from
nohz idle balance and nohz exit. Both these are still prone to missing
a jiffy, but that has always been the case.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: pjt@google.com
Cc: Venkatesh Pallipadi <venki@google.com>
Link: http://lkml.kernel.org/n/tip-kt0trz0apodbf84ucjfdbr1a@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Merge back Linus's latest branch so that we pick up the uprobes changes.
( I tested this branch locally and while it's one from the middle of the
merge window it's a good one to base further work off. )
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Correct queue free count math for SQ, RQ for all hardware type.
Update user-kernel ABI interface.
Signed-off-by: Parav Pandit <parav.pandit@emulex.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
The comparison between the system sleep state being entered
and the lowest system sleep state the given device may wake up
from in acpi_pm_device_sleep_state() is reversed, because the
specification (ACPI 5.0) says that for wakeup to work:
"The sleeping state being entered must be less than or equal to the
power state declared in element 1 of the _PRW object."
In other words, the state returned by _PRW is the deepest
(lowest-power) system sleep state the device is capable of waking up
the system from.
Moreover, acpi_pm_device_sleep_state() also should check if the
wakeup capability is supported through ACPI, because in principle it
may be done via native PCIe PME, for example, in which case _SxW
should not be evaluated.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
After recent changes of the ACPI device power states definitions, if
power resources are not used for the device's power management, the
state returned by __acpi_bus_get_power() cannot exceed D3hot, because
the return values of _PSC are 0 through 3. However, if the _PR3
method is not present for the device and _PS3 returns 3, we have to
assume that the device is in D3cold, so the value returned by
__acpi_bus_get_power() in that case should be 4.
Similarly, acpi_power_get_inferred_state() should take the power
resources for the D3hot state into account in general, so that it
can return 3 if those resources are "on" or 4 (D3cold) otherwise.
Fix the the above two issues and make sure that if both _PSC and
_PR3 are present for the device, the power resources listed by _PR3
will be used to determine if the number 3 returned by _PSC is meant
to represent D3cold or D3hot.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
After recent changes of the ACPI device low-power states definitions
kernel messages in drivers/acpi/bus.c need to be updated so that they
include the correct names of the states in question (currently is
"D3" for D3hot and "D4" for D3cold, which is incorrect).
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
When the ACPI-driven RTC alarm wakes the system, report it as a wakeup
event. This allows userspace to determine that the reason for system
wakeup was RTC alarm.
Signed-off-by: Daniel Drake <dsd@laptop.org>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
When the system is woken up by the ACPI fixed power button, currently there
is no way of userspace becoming aware that the power button was pressed.
OLPC would like to know this, so that we can respond appropriately.
For example, if the system was woken up by a network packet, we know
we can go back to sleep very quickly. But if the user explicitly woke the
system with the power button, we're going to want to stay awake for a
while.
The wakeup count mechanism seems like a good fit for communicating this.
Mark the fixed power button as wakeup-enabled, and increment its wakeup
counter when the system is woken with the power button. (The wakeup counter
is also incremented when the power button is pressed during system
operation; this is already handled by an existing acpi-button codepath).
Signed-off-by: Daniel Drake <dsd@laptop.org>
Acked-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
When we reset the ring control registers, including the HEAD and TAIL of
the ring, we also need to reset associated state. In this instance, we
were failing to reset the cached value of ring->last_retired_head and so
upon the first request for more space following a resume would
potentially (depending on a narrow race window) believe that the HEAD had
advanced much further than reality.
This is a regression from:
commit a71d8d9452
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date: Wed Feb 15 11:25:36 2012 +0000
drm/i915: Record the tail at each request and use it to estimate the head
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: stable@vger.kernel.org # 3.4
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Make the recovery functions static to fix the following sparse warnings:
lib/raid6/recov.c:25:6: warning: symbol 'raid6_2data_recov_intx1' was
not declared. Should it be static?
lib/raid6/recov.c:69:6: warning: symbol 'raid6_datap_recov_intx1' was
not declared. Should it be static?
lib/raid6/recov_ssse3.c:22:6: warning: symbol 'raid6_2data_recov_ssse3'
was not declared. Should it be static?
lib/raid6/recov_ssse3.c:197:6: warning: symbol 'raid6_datap_recov_ssse3'
was not declared. Should it be static?
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Jim Kukunas <james.t.kukunas@linux.intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
48ddbe1946 "cgroup: make css->refcnt clearing on cgroup removal
optional" allowed a css to linger after the associated cgroup is
removed. As a css holds a reference on the cgroup's dentry, it means
that cgroup dentries may linger for a while.
cgroup_create() does grab an active reference on the superblock to
prevent it from going away while there are !root cgroups; however, the
reference is put from cgroup_diput() which is invoked on cgroup
removal, so cgroup dentries which are removed but persisting due to
lingering csses already have released their superblock active refs
allowing superblock to be killed while those dentries are around.
Given the right condition, this makes cgroup_kill_sb() call
kill_litter_super() with dentries with non-zero d_count leading to
BUG() in shrink_dcache_for_umount_subtree().
Fix it by adding cgroup_dops->d_release() operation and moving
deactivate_super() to it. cgroup_diput() now marks dentry->d_fsdata
with itself if superblock should be deactivated and cgroup_d_release()
deactivates the superblock on dentry release.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Sasha Levin <levinsasha928@gmail.com>
Tested-by: Sasha Levin <levinsasha928@gmail.com>
LKML-Reference: <CA+1xoqe5hMuxzCRhMy7J0XchDk2ZnuxOHJKikROk1-ReAzcT6g@mail.gmail.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Let the user decide whether power consumption or jitter is the
more important consideration for their machines.
Quoting removal commit af5ab277de:
"Historically, Linux has tried to make the regular timer tick on the
various CPUs not happen at the same time, to avoid contention on
xtime_lock.
Nowadays, with the tickless kernel, this contention no longer happens
since time keeping and updating are done differently. In addition,
this skew is actually hurting power consumption in a measurable way on
many-core systems."
Problems:
- Contrary to the above, systems do encounter contention on both
xtime_lock and RCU structure locks when the tick is synchronized.
- Moderate sized RT systems suffer intolerable jitter due to the tick
being synchronized.
- SGI reports the same for their large systems.
- Fully utilized systems reap no power saving benefit from skew removal,
but do suffer from resulting induced lock contention.
- 0209f649 rcu: limit rcu_node leaf-level fanout
This patch was born to combat lock contention which testing showed
to have been _induced by_ skew removal. Skew the tick, contention
disappeared virtually completely.
Signed-off-by: Mike Galbraith <mgalbraith@suse.de>
Link: http://lkml.kernel.org/r/1336472458.21924.78.camel@marge.simpson.net
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Every interrupt which is an active wakeup source needs the ability to
abort suspend if there is a pending irq. Right now only edge and level
irqs can do that.
|
+---------+
| INTC |
+---------+
| GPIO_IRQ
+------------+
| gpio-exp |
+------------+
| |
GPIO0_IRQ GPIO1_IRQ
In the above diagram, gpio expander has irq number GPIO_IRQ, it is
connected with two sub GPIO pins, GPIO0 and GPIO1.
During suspend, we set IRQF_NO_SUSPEND for GPIO_IRQ so that gpio
expander driver can handle the sub irq GPIO0_IRQ and GPIO1_IRQ, and
these two irqs themselves can further be handled by simple or nested
irq in some drivers(typically gpio and mfd driver). If they are used
as wakeup sources during suspend, we want them to be able to abort
suspend too.
Setting IRQS_PENDING flag in handle_nested_irq() and handle_simple_irq()
when the irq is disabled allows check_wakeup_irqs() to identify such
irqs as source for aborting suspend.
Signed-off-by: Ning Jiang <ning.n.jiang@gmail.com>
Cc: rjw@sisk.pl
Link: http://lkml.kernel.org/r/CAH3Oq6T905%2B3fkF43NAMMFvJvq7dsk_so6T2vQ8ZJrA5xiU3YA@mail.gmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This patch 4of4 adds configuration and documentation files including a FAQ.
[v14: updated docs/FAQ to use zcache and RAMster as examples]
[v10: no change]
[v9: akpm@linux-foundation.org: sysfs->debugfs; no longer need Doc/ABI file]
[v8: rebase to 3.0-rc4]
[v7: rebase to 3.0-rc3]
[v6: rebase to 3.0-rc1]
[v5: change config default to n]
[v4: rebase to 2.6.39]
Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com>
Acked-by: Jan Beulich <JBeulich@novell.com>
Acked-by: Seth Jennings <sjenning@linux.vnet.ibm.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Matthew Wilcox <matthew@wil.cx>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Rik Riel <riel@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
This patch, 3of4, provides the core frontswap code that interfaces between
the hooks in the swap subsystem and a frontswap backend via frontswap_ops.
---
New file added: mm/frontswap.c
[v14: add support for writethrough, per suggestion by aarcange@redhat.com]
[v11: sjenning@linux.vnet.ibm.com: s/puts/failed_puts/]
[v10: sjenning@linux.vnet.ibm.com: fix debugfs calls on 32-bit]
[v9: akpm@linux-foundation.org: change "flush" to "invalidate", part 1]
[v9: akpm@linux-foundation.org: mark some statics __read_mostly]
[v9: akpm@linux-foundation.org: add clarifying comments]
[v9: akpm@linux-foundation.org: no need to loop repeating try_to_unuse]
[v9: error27@gmail.com: remove superfluous check for NULL]
[v8: rebase to 3.0-rc4]
[v8: kamezawa.hiroyu@jp.fujitsu.com: add comment to clarify find_next_to_unuse]
[v7: rebase to 3.0-rc3]
[v7: JBeulich@novell.com: use new static inlines, no-ops if not config'd]
[v6: rebase to 3.1-rc1]
[v6: lliubbo@gmail.com: use vzalloc]
[v6: lliubbo@gmail.com: fix null pointer deref if vzalloc fails]
[v6: konrad.wilk@oracl.com: various checks and code clarifications/comments]
[v4: rebase to 2.6.39]
Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com>
Acked-by: Jan Beulich <JBeulich@novell.com>
Acked-by: Seth Jennings <sjenning@linux.vnet.ibm.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Matthew Wilcox <matthew@wil.cx>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Rik Riel <riel@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
[v12: Squashed s/flush/invalidate/ in]
[v15: A bit of cleanup and seperate DEBUGFS]
Signed-off-by: Konrad Wilk <konrad.wilk@oracle.com>
This patch, 2of4, contains the changes to the core swap subsystem.
This includes:
(1) makes available core swap data structures (swap_lock, swap_list and
swap_info) that are needed by frontswap.c but we don't need to expose them
to the dozens of files that include swap.h so we create a new swapfile.h
just to extern-ify these and modify their declarations to non-static
(2) adds frontswap-related elements to swap_info_struct. Frontswap_map
points to vzalloc'ed one-bit-per-swap-page metadata that indicates
whether the swap page is in frontswap or in the device and frontswap_pages
counts how many pages are in frontswap.
(3) adds hooks in the swap subsystem and extends try_to_unuse so that
frontswap_shrink can do a "partial swapoff".
Note that a failed frontswap_map allocation is safe... failure is noted
by lack of "FS" in the subsequent printk.
---
[v14: rebase to 3.4-rc2]
[v10: no change]
[v9: akpm@linux-foundation.org: mark some statics __read_mostly]
[v9: akpm@linux-foundation.org: add clarifying comments]
[v9: akpm@linux-foundation.org: no need to loop repeating try_to_unuse]
[v9: error27@gmail.com: remove superfluous check for NULL]
[v8: rebase to 3.0-rc4]
[v8: kamezawa.hiroyu@jp.fujitsu.com: change counter to atomic_t to avoid races]
[v8: kamezawa.hiroyu@jp.fujitsu.com: comment to clarify informational counters]
[v7: rebase to 3.0-rc3]
[v7: JBeulich@novell.com: add new swap struct elements only if config'd]
[v6: rebase to 3.0-rc1]
[v6: lliubbo@gmail.com: fix null pointer deref if vzalloc fails]
[v6: konrad.wilk@oracl.com: various checks and code clarifications/comments]
[v5: no change from v4]
[v4: rebase to 2.6.39]
Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com>
Reviewed-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Jan Beulich <JBeulich@novell.com>
Acked-by: Seth Jennings <sjenning@linux.vnet.ibm.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Matthew Wilcox <matthew@wil.cx>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Rik Riel <riel@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
[v11: Rebased, fixed mm/swapfile.c context change]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Frontswap is the alter ego of cleancache, the "yang" to cleancache's
"yin"... and more precisely frontswap is the provider of anonymous
pages to transcendent memory to nicely complement cleancache's providing
of clean pagecache pages to transcendent memory. For optimal use
of transcendent memory, both are necessary... because a kernel
under memory pressure first reclaims clean pagecache pages and,
when under more memory pressure, starts swapping anonymous pages.
Frontswap and cleancache (which was merged at 3.0) are the "frontends"
and the only necessary changes to the core kernel for transcendent memory;
all other supporting code -- the "backends" -- is implemented as drivers.
See the LWN.net article "Transcendent memory in a nutshell" for a detailed
overview of frontswap and related kernel parts:
https://lwn.net/Articles/454795/
Frontswap code was first posted publicly in January 2009 and on LKML in
May 2009, and has remained functionally stable for nearly three years now.
It is barely invasive, touching only the swap subsystem and adds less
than 100 lines of code to existing swap subsystem code files.
It has improved syntactically substantially between V1 and this posting
of V14, thanks to the review of a few kernel developers, and has adapted
easily to at least one major swap subsystem change. As of 3.4, there are
three in-tree users of frontswap patiently waiting for this patchset and
for CONFIG_FRONTSWAP to be enabled: zcache (staging driver merged at
2.6.39), Xen tmem (merged at 3.0 and 3.1) and RAMster (staging driver
merged at 3.4). In addition, a RFC has been posted for a KVM backend.
The frontswap patchset has been in linux-next since next-110603. Earlier
versions of frontswap already ship in the Oracle Unbreakable Enterprise Kernel
and SuSE SLES.
This patch, 1of4, provides the header file for the core code for frontswap
that interfaces between the hooks in the swap subsystem and a frontswap
backend via frontswap_ops.
---
New file added: include/linux/frontswap.h
[v14: add support for writethrough, per suggestion by aarcange@redhat.com]
[v14: rebase to 3.4-rc2]
[v11: konrad.wilk@oracle.com: squashed s/flush/invalidate/ in]
[v10: no change]
[v9: akpm@linux-foundation.org: change "flush" to "invalidate", part 1]
[v8: rebase to 3.0-rc4]
[v7: rebase to 3.0-rc3]
[v7: JBeulich@novell.com: new static inlines resolve to no-ops if not config'd]
[v7: JBeulich@novell.com: avoid redundant shifts/divides for *_bit lib calls]
[v6: rebase to 3.1-rc1]
[v5: no change from v4]
[v4: rebase to 2.6.39]
Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Jan Beulich <JBeulich@novell.com>
Acked-by: Seth Jennings <sjenning@linux.vnet.ibm.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Matthew Wilcox <matthew@wil.cx>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Rik Riel <riel@redhat.com>
[v15: int/bool on some functions]
Signed-off-by: Konrad Wilk <konrad.wilk@oracle.com>
Don't use inode->i_blkbits which might be stale, instead calculate the blksize
information from the freshly obtained attributes.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Now we store attr->ino at inode->i_ino, return attr->ino at the
first time and then return inode->i_ino if the attribute timeout
isn't expired. That's wrong on 32 bit platforms because attr->ino
is 64 bit and inode->i_ino is 32 bit in this case.
Fix this by saving 64 bit ino in fuse_inode structure and returning
it every time we call getattr. Also squash attr->ino into inode->i_ino
explicitly.
Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
The EliteBook 8560W has non-initialized entries in its _PSS ACPI
table. Instead of bailing out when the first non-initialized entry is
found, ignore it and use only the valid entries. Only bail out if there
is no valid entry at all.
[v3: Fixes suggested by Konrad]
Signed-off-by: Marco Aurelio da Costa <costa@gamic.com>
Signed-off-by: Len Brown <len.brown@intel.com>
We only need to regenerate the sysfs files when the capacity units
change, avoid the update otherwise.
The origin of this issue is dates way back to 2.6.38:
da8aeb92d4
(ACPI / Battery: Update information on info notification and resume)
cc: <stable@vger.kernel.org>
Signed-off-by: Andy Whitcroft <apw@canonical.com>
Tested-by: Ralf Jung <post@ralfj.de>
Signed-off-by: Len Brown <len.brown@intel.com>
fallocate filesystem operation preallocates media space for the given file.
If fallocate returns success then any subsequent write to the given range
never fails with 'not enough space' error.
Signed-off-by: Anatol Pomozov <anatol.pomozov@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
This patch replaces the code for getting an number from a
userspace buffer by a simple call to kstroul_from_user.
This makes it easier to read and less error prone.
Signed-off-by: Peter Huewe <peterhuewe@gmx.de>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2012-04-25 12:25:05 +02:00
197 changed files with 3836 additions and 1466 deletions
@@ -697,10 +697,10 @@ static int mlx4_common_set_port(struct mlx4_dev *dev, int slave, u32 in_mod,
if(slave!=dev->caps.function)
memset(inbox->buf,0,256);
if(dev->flags&MLX4_FLAG_OLD_PORT_CMDS){
*(u8*)inbox->buf=!!reset_qkey_viols<<6;
*(u8*)inbox->buf|=!!reset_qkey_viols<<6;
((__be32*)inbox->buf)[2]=agg_cap_mask;
}else{
((u8*)inbox->buf)[3]=!!reset_qkey_viols;
((u8*)inbox->buf)[3]|=!!reset_qkey_viols;
((__be32*)inbox->buf)[1]=agg_cap_mask;
}
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.