| Commit message (Collapse) | Author | Age | Files | Lines |
| |\ |
|
| | |\
| | |
| | |
| | | |
This is the 3.4.103 stable release
|
| | | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
commit 504d58745c9ca28d33572e2d8a9990b43e06075d upstream.
clockevents_increase_min_delta() calls printk() from under
hrtimer_bases.lock. That causes lock inversion on scheduler locks because
printk() can call into the scheduler. Lockdep puts it as:
======================================================
[ INFO: possible circular locking dependency detected ]
3.15.0-rc8-06195-g939f04b #2 Not tainted
-------------------------------------------------------
trinity-main/74 is trying to acquire lock:
(&port_lock_key){-.....}, at: [<811c60be>] serial8250_console_write+0x8c/0x10c
but task is already holding lock:
(hrtimer_bases.lock){-.-...}, at: [<8103caeb>] hrtimer_try_to_cancel+0x13/0x66
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #5 (hrtimer_bases.lock){-.-...}:
[<8104a942>] lock_acquire+0x92/0x101
[<8142f11d>] _raw_spin_lock_irqsave+0x2e/0x3e
[<8103c918>] __hrtimer_start_range_ns+0x1c/0x197
[<8107ec20>] perf_swevent_start_hrtimer.part.41+0x7a/0x85
[<81080792>] task_clock_event_start+0x3a/0x3f
[<810807a4>] task_clock_event_add+0xd/0x14
[<8108259a>] event_sched_in+0xb6/0x17a
[<810826a2>] group_sched_in+0x44/0x122
[<81082885>] ctx_sched_in.isra.67+0x105/0x11f
[<810828e6>] perf_event_sched_in.isra.70+0x47/0x4b
[<81082bf6>] __perf_install_in_context+0x8b/0xa3
[<8107eb8e>] remote_function+0x12/0x2a
[<8105f5af>] smp_call_function_single+0x2d/0x53
[<8107e17d>] task_function_call+0x30/0x36
[<8107fb82>] perf_install_in_context+0x87/0xbb
[<810852c9>] SYSC_perf_event_open+0x5c6/0x701
[<810856f9>] SyS_perf_event_open+0x17/0x19
[<8142f8ee>] syscall_call+0x7/0xb
-> #4 (&ctx->lock){......}:
[<8104a942>] lock_acquire+0x92/0x101
[<8142f04c>] _raw_spin_lock+0x21/0x30
[<81081df3>] __perf_event_task_sched_out+0x1dc/0x34f
[<8142cacc>] __schedule+0x4c6/0x4cb
[<8142cae0>] schedule+0xf/0x11
[<8142f9a6>] work_resched+0x5/0x30
-> #3 (&rq->lock){-.-.-.}:
[<8104a942>] lock_acquire+0x92/0x101
[<8142f04c>] _raw_spin_lock+0x21/0x30
[<81040873>] __task_rq_lock+0x33/0x3a
[<8104184c>] wake_up_new_task+0x25/0xc2
[<8102474b>] do_fork+0x15c/0x2a0
[<810248a9>] kernel_thread+0x1a/0x1f
[<814232a2>] rest_init+0x1a/0x10e
[<817af949>] start_kernel+0x303/0x308
[<817af2ab>] i386_start_kernel+0x79/0x7d
-> #2 (&p->pi_lock){-.-...}:
[<8104a942>] lock_acquire+0x92/0x101
[<8142f11d>] _raw_spin_lock_irqsave+0x2e/0x3e
[<810413dd>] try_to_wake_up+0x1d/0xd6
[<810414cd>] default_wake_function+0xb/0xd
[<810461f3>] __wake_up_common+0x39/0x59
[<81046346>] __wake_up+0x29/0x3b
[<811b8733>] tty_wakeup+0x49/0x51
[<811c3568>] uart_write_wakeup+0x17/0x19
[<811c5dc1>] serial8250_tx_chars+0xbc/0xfb
[<811c5f28>] serial8250_handle_irq+0x54/0x6a
[<811c5f57>] serial8250_default_handle_irq+0x19/0x1c
[<811c56d8>] serial8250_interrupt+0x38/0x9e
[<810510e7>] handle_irq_event_percpu+0x5f/0x1e2
[<81051296>] handle_irq_event+0x2c/0x43
[<81052cee>] handle_level_irq+0x57/0x80
[<81002a72>] handle_irq+0x46/0x5c
[<810027df>] do_IRQ+0x32/0x89
[<8143036e>] common_interrupt+0x2e/0x33
[<8142f23c>] _raw_spin_unlock_irqrestore+0x3f/0x49
[<811c25a4>] uart_start+0x2d/0x32
[<811c2c04>] uart_write+0xc7/0xd6
[<811bc6f6>] n_tty_write+0xb8/0x35e
[<811b9beb>] tty_write+0x163/0x1e4
[<811b9cd9>] redirected_tty_write+0x6d/0x75
[<810b6ed6>] vfs_write+0x75/0xb0
[<810b7265>] SyS_write+0x44/0x77
[<8142f8ee>] syscall_call+0x7/0xb
-> #1 (&tty->write_wait){-.....}:
[<8104a942>] lock_acquire+0x92/0x101
[<8142f11d>] _raw_spin_lock_irqsave+0x2e/0x3e
[<81046332>] __wake_up+0x15/0x3b
[<811b8733>] tty_wakeup+0x49/0x51
[<811c3568>] uart_write_wakeup+0x17/0x19
[<811c5dc1>] serial8250_tx_chars+0xbc/0xfb
[<811c5f28>] serial8250_handle_irq+0x54/0x6a
[<811c5f57>] serial8250_default_handle_irq+0x19/0x1c
[<811c56d8>] serial8250_interrupt+0x38/0x9e
[<810510e7>] handle_irq_event_percpu+0x5f/0x1e2
[<81051296>] handle_irq_event+0x2c/0x43
[<81052cee>] handle_level_irq+0x57/0x80
[<81002a72>] handle_irq+0x46/0x5c
[<810027df>] do_IRQ+0x32/0x89
[<8143036e>] common_interrupt+0x2e/0x33
[<8142f23c>] _raw_spin_unlock_irqrestore+0x3f/0x49
[<811c25a4>] uart_start+0x2d/0x32
[<811c2c04>] uart_write+0xc7/0xd6
[<811bc6f6>] n_tty_write+0xb8/0x35e
[<811b9beb>] tty_write+0x163/0x1e4
[<811b9cd9>] redirected_tty_write+0x6d/0x75
[<810b6ed6>] vfs_write+0x75/0xb0
[<810b7265>] SyS_write+0x44/0x77
[<8142f8ee>] syscall_call+0x7/0xb
-> #0 (&port_lock_key){-.....}:
[<8104a62d>] __lock_acquire+0x9ea/0xc6d
[<8104a942>] lock_acquire+0x92/0x101
[<8142f11d>] _raw_spin_lock_irqsave+0x2e/0x3e
[<811c60be>] serial8250_console_write+0x8c/0x10c
[<8104e402>] call_console_drivers.constprop.31+0x87/0x118
[<8104f5d5>] console_unlock+0x1d7/0x398
[<8104fb70>] vprintk_emit+0x3da/0x3e4
[<81425f76>] printk+0x17/0x19
[<8105bfa0>] clockevents_program_min_delta+0x104/0x116
[<8105c548>] clockevents_program_event+0xe7/0xf3
[<8105cc1c>] tick_program_event+0x1e/0x23
[<8103c43c>] hrtimer_force_reprogram+0x88/0x8f
[<8103c49e>] __remove_hrtimer+0x5b/0x79
[<8103cb21>] hrtimer_try_to_cancel+0x49/0x66
[<8103cb4b>] hrtimer_cancel+0xd/0x18
[<8107f102>] perf_swevent_cancel_hrtimer.part.60+0x2b/0x30
[<81080705>] task_clock_event_stop+0x20/0x64
[<81080756>] task_clock_event_del+0xd/0xf
[<81081350>] event_sched_out+0xab/0x11e
[<810813e0>] group_sched_out+0x1d/0x66
[<81081682>] ctx_sched_out+0xaf/0xbf
[<81081e04>] __perf_event_task_sched_out+0x1ed/0x34f
[<8142cacc>] __schedule+0x4c6/0x4cb
[<8142cae0>] schedule+0xf/0x11
[<8142f9a6>] work_resched+0x5/0x30
other info that might help us debug this:
Chain exists of:
&port_lock_key --> &ctx->lock --> hrtimer_bases.lock
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(hrtimer_bases.lock);
lock(&ctx->lock);
lock(hrtimer_bases.lock);
lock(&port_lock_key);
*** DEADLOCK ***
4 locks held by trinity-main/74:
#0: (&rq->lock){-.-.-.}, at: [<8142c6f3>] __schedule+0xed/0x4cb
#1: (&ctx->lock){......}, at: [<81081df3>] __perf_event_task_sched_out+0x1dc/0x34f
#2: (hrtimer_bases.lock){-.-...}, at: [<8103caeb>] hrtimer_try_to_cancel+0x13/0x66
#3: (console_lock){+.+...}, at: [<8104fb5d>] vprintk_emit+0x3c7/0x3e4
stack backtrace:
CPU: 0 PID: 74 Comm: trinity-main Not tainted 3.15.0-rc8-06195-g939f04b #2
00000000 81c3a310 8b995c14 81426f69 8b995c44 81425a99 8161f671 8161f570
8161f538 8161f559 8161f538 8b995c78 8b142bb0 00000004 8b142fdc 8b142bb0
8b995ca8 8104a62d 8b142fac 000016f2 81c3a310 00000001 00000001 00000003
Call Trace:
[<81426f69>] dump_stack+0x16/0x18
[<81425a99>] print_circular_bug+0x18f/0x19c
[<8104a62d>] __lock_acquire+0x9ea/0xc6d
[<8104a942>] lock_acquire+0x92/0x101
[<811c60be>] ? serial8250_console_write+0x8c/0x10c
[<811c6032>] ? wait_for_xmitr+0x76/0x76
[<8142f11d>] _raw_spin_lock_irqsave+0x2e/0x3e
[<811c60be>] ? serial8250_console_write+0x8c/0x10c
[<811c60be>] serial8250_console_write+0x8c/0x10c
[<8104af87>] ? lock_release+0x191/0x223
[<811c6032>] ? wait_for_xmitr+0x76/0x76
[<8104e402>] call_console_drivers.constprop.31+0x87/0x118
[<8104f5d5>] console_unlock+0x1d7/0x398
[<8104fb70>] vprintk_emit+0x3da/0x3e4
[<81425f76>] printk+0x17/0x19
[<8105bfa0>] clockevents_program_min_delta+0x104/0x116
[<8105cc1c>] tick_program_event+0x1e/0x23
[<8103c43c>] hrtimer_force_reprogram+0x88/0x8f
[<8103c49e>] __remove_hrtimer+0x5b/0x79
[<8103cb21>] hrtimer_try_to_cancel+0x49/0x66
[<8103cb4b>] hrtimer_cancel+0xd/0x18
[<8107f102>] perf_swevent_cancel_hrtimer.part.60+0x2b/0x30
[<81080705>] task_clock_event_stop+0x20/0x64
[<81080756>] task_clock_event_del+0xd/0xf
[<81081350>] event_sched_out+0xab/0x11e
[<810813e0>] group_sched_out+0x1d/0x66
[<81081682>] ctx_sched_out+0xaf/0xbf
[<81081e04>] __perf_event_task_sched_out+0x1ed/0x34f
[<8104416d>] ? __dequeue_entity+0x23/0x27
[<81044505>] ? pick_next_task_fair+0xb1/0x120
[<8142cacc>] __schedule+0x4c6/0x4cb
[<81047574>] ? trace_hardirqs_off_caller+0xd7/0x108
[<810475b0>] ? trace_hardirqs_off+0xb/0xd
[<81056346>] ? rcu_irq_exit+0x64/0x77
Fix the problem by using printk_deferred() which does not call into the
scheduler.
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
| | | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
commit aac74dc495456412c4130a1167ce4beb6c1f0b38 upstream.
After learning we'll need some sort of deferred printk functionality in
the timekeeping core, Peter suggested we rename the printk_sched function
so it can be reused by needed subsystems.
This only changes the function name. No logic changes.
Signed-off-by: John Stultz <john.stultz@linaro.org>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Jiri Bohac <jbohac@suse.cz>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
| | |\|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
This is the 3.4.100 stable release
Conflicts:
Makefile
Change-Id: Iff7496cbcc27ba14ab19c6f4ed57ea6e6484fdb6
|
| | | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
commit 4320f6b1d9db4ca912c5eb6ecb328b2e090e1586 upstream.
The commit [247bc037: PM / Sleep: Mitigate race between the freezer
and request_firmware()] introduced the finer state control, but it
also leads to a new bug; for example, a bug report regarding the
firmware loading of intel BT device at suspend/resume:
https://bugzilla.novell.com/show_bug.cgi?id=873790
The root cause seems to be a small window between the process resume
and the clear of usermodehelper lock. The request_firmware() function
checks the UMH lock and gives up when it's in UMH_DISABLE state. This
is for avoiding the invalid f/w loading during suspend/resume phase.
The problem is, however, that usermodehelper_enable() is called at the
end of thaw_processes(). Thus, a thawed process in between can kick
off the f/w loader code path (in this case, via btusb_setup_intel())
even before the call of usermodehelper_enable(). Then
usermodehelper_read_trylock() returns an error and request_firmware()
spews WARN_ON() in the end.
This oneliner patch fixes the issue just by setting to UMH_FREEZING
state again before restarting tasks, so that the call of
request_firmware() will be blocked until the end of this function
instead of returning an error.
Fixes: 247bc0374254 (PM / Sleep: Mitigate race between the freezer and request_firmware())
Link: https://bugzilla.novell.com/show_bug.cgi?id=873790
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
| | | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
commit 16927776ae757d0d132bdbfabbfe2c498342bd59 upstream.
Sharvil noticed with the posix timer_settime interface, using the
CLOCK_REALTIME_ALARM or CLOCK_BOOTTIME_ALARM clockid, if the users
tried to specify a relative time timer, it would incorrectly be
treated as absolute regardless of the state of the flags argument.
This patch corrects this, properly checking the absolute/relative flag,
as well as adds further error checking that no invalid flag bits are set.
Reported-by: Sharvil Nanavati <sharvil@google.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Sharvil Nanavati <sharvil@google.com>
Link: http://lkml.kernel.org/r/1404767171-6902-1-git-send-email-john.stultz@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
| | | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
commit 27e35715df54cbc4f2d044f681802ae30479e7fb upstream.
When the rtmutex fast path is enabled the slow unlock function can
create the following situation:
spin_lock(foo->m->wait_lock);
foo->m->owner = NULL;
rt_mutex_lock(foo->m); <-- fast path
free = atomic_dec_and_test(foo->refcnt);
rt_mutex_unlock(foo->m); <-- fast path
if (free)
kfree(foo);
spin_unlock(foo->m->wait_lock); <--- Use after free.
Plug the race by changing the slow unlock to the following scheme:
while (!rt_mutex_has_waiters(m)) {
/* Clear the waiters bit in m->owner */
clear_rt_mutex_waiters(m);
owner = rt_mutex_owner(m);
spin_unlock(m->wait_lock);
if (cmpxchg(m->owner, owner, 0) == owner)
return;
spin_lock(m->wait_lock);
}
So in case of a new waiter incoming while the owner tries the slow
path unlock we have two situations:
unlock(wait_lock);
lock(wait_lock);
cmpxchg(p, owner, 0) == owner
mark_rt_mutex_waiters(lock);
acquire(lock);
Or:
unlock(wait_lock);
lock(wait_lock);
mark_rt_mutex_waiters(lock);
cmpxchg(p, owner, 0) != owner
enqueue_waiter();
unlock(wait_lock);
lock(wait_lock);
wakeup_next waiter();
unlock(wait_lock);
lock(wait_lock);
acquire(lock);
If the fast path is disabled, then the simple
m->owner = NULL;
unlock(m->wait_lock);
is sufficient as all access to m->owner is serialized via
m->wait_lock;
Also document and clarify the wakeup_next_waiter function as suggested
by Oleg Nesterov.
Reported-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20140611183852.937945560@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
| | | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
commit 3d5c9340d1949733eb37616abd15db36aef9a57c upstream.
Even in the case when deadlock detection is not requested by the
caller, we can detect deadlocks. Right now the code stops the lock
chain walk and keeps the waiter enqueued, even on itself. Silly not to
yell when such a scenario is detected and to keep the waiter enqueued.
Return -EDEADLK unconditionally and handle it at the call sites.
The futex calls return -EDEADLK. The non futex ones dequeue the
waiter, throw a warning and put the task into a schedule loop.
Tagged for stable as it makes the code more robust.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Brad Mouring <bmouring@ni.com>
Link: http://lkml.kernel.org/r/20140605152801.836501969@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
| | | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
commit 82084984383babe728e6e3c9a8e5c46278091315 upstream.
When we walk the lock chain, we drop all locks after each step. So the
lock chain can change under us before we reacquire the locks. That's
harmless in principle as we just follow the wrong lock path. But it
can lead to a false positive in the dead lock detection logic:
T0 holds L0
T0 blocks on L1 held by T1
T1 blocks on L2 held by T2
T2 blocks on L3 held by T3
T4 blocks on L4 held by T4
Now we walk the chain
lock T1 -> lock L2 -> adjust L2 -> unlock T1 ->
lock T2 -> adjust T2 -> drop locks
T2 times out and blocks on L0
Now we continue:
lock T2 -> lock L0 -> deadlock detected, but it's not a deadlock at all.
Brad tried to work around that in the deadlock detection logic itself,
but the more I looked at it the less I liked it, because it's crystal
ball magic after the fact.
We actually can detect a chain change very simple:
lock T1 -> lock L2 -> adjust L2 -> unlock T1 -> lock T2 -> adjust T2 ->
next_lock = T2->pi_blocked_on->lock;
drop locks
T2 times out and blocks on L0
Now we continue:
lock T2 ->
if (next_lock != T2->pi_blocked_on->lock)
return;
So if we detect that T2 is now blocked on a different lock we stop the
chain walk. That's also correct in the following scenario:
lock T1 -> lock L2 -> adjust L2 -> unlock T1 -> lock T2 -> adjust T2 ->
next_lock = T2->pi_blocked_on->lock;
drop locks
T3 times out and drops L3
T2 acquires L3 and blocks on L4 now
Now we continue:
lock T2 ->
if (next_lock != T2->pi_blocked_on->lock)
return;
We don't have to follow up the chain at that point, because T2
propagated our priority up to T4 already.
[ Folded a cleanup patch from peterz ]
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reported-by: Brad Mouring <bmouring@ni.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20140605152801.930031935@linutronix.de
Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
| | | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
commit 397335f004f41e5fcf7a795e94eb3ab83411a17c upstream.
The current deadlock detection logic does not work reliably due to the
following early exit path:
/*
* Drop out, when the task has no waiters. Note,
* top_waiter can be NULL, when we are in the deboosting
* mode!
*/
if (top_waiter && (!task_has_pi_waiters(task) ||
top_waiter != task_top_pi_waiter(task)))
goto out_unlock_pi;
So this not only exits when the task has no waiters, it also exits
unconditionally when the current waiter is not the top priority waiter
of the task.
So in a nested locking scenario, it might abort the lock chain walk
and therefor miss a potential deadlock.
Simple fix: Continue the chain walk, when deadlock detection is
enabled.
We also avoid the whole enqueue, if we detect the deadlock right away
(A-A). It's an optimization, but also prevents that another waiter who
comes in after the detection and before the task has undone the damage
observes the situation and detects the deadlock and returns
-EDEADLOCK, which is wrong as the other task is not in a deadlock
situation.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Link: http://lkml.kernel.org/r/20140522031949.725272460@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
| | | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
commit 099ed151675cd1d2dbeae1dac697975f6a68716d upstream.
Disabling reading and writing to the trace file should not be able to
disable all function tracing callbacks. There's other users today
(like kprobes and perf). Reading a trace file should not stop those
from happening.
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
| | | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
commit 391acf970d21219a2a5446282d3b20eace0c0d7a upstream.
When runing with the kernel(3.15-rc7+), the follow bug occurs:
[ 9969.258987] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:586
[ 9969.359906] in_atomic(): 1, irqs_disabled(): 0, pid: 160655, name: python
[ 9969.441175] INFO: lockdep is turned off.
[ 9969.488184] CPU: 26 PID: 160655 Comm: python Tainted: G A 3.15.0-rc7+ #85
[ 9969.581032] Hardware name: FUJITSU-SV PRIMEQUEST 1800E/SB, BIOS PRIMEQUEST 1000 Series BIOS Version 1.39 11/16/2012
[ 9969.706052] ffffffff81a20e60 ffff8803e941fbd0 ffffffff8162f523 ffff8803e941fd18
[ 9969.795323] ffff8803e941fbe0 ffffffff8109995a ffff8803e941fc58 ffffffff81633e6c
[ 9969.884710] ffffffff811ba5dc ffff880405c6b480 ffff88041fdd90a0 0000000000002000
[ 9969.974071] Call Trace:
[ 9970.003403] [<ffffffff8162f523>] dump_stack+0x4d/0x66
[ 9970.065074] [<ffffffff8109995a>] __might_sleep+0xfa/0x130
[ 9970.130743] [<ffffffff81633e6c>] mutex_lock_nested+0x3c/0x4f0
[ 9970.200638] [<ffffffff811ba5dc>] ? kmem_cache_alloc+0x1bc/0x210
[ 9970.272610] [<ffffffff81105807>] cpuset_mems_allowed+0x27/0x140
[ 9970.344584] [<ffffffff811b1303>] ? __mpol_dup+0x63/0x150
[ 9970.409282] [<ffffffff811b1385>] __mpol_dup+0xe5/0x150
[ 9970.471897] [<ffffffff811b1303>] ? __mpol_dup+0x63/0x150
[ 9970.536585] [<ffffffff81068c86>] ? copy_process.part.23+0x606/0x1d40
[ 9970.613763] [<ffffffff810bf28d>] ? trace_hardirqs_on+0xd/0x10
[ 9970.683660] [<ffffffff810ddddf>] ? monotonic_to_bootbased+0x2f/0x50
[ 9970.759795] [<ffffffff81068cf0>] copy_process.part.23+0x670/0x1d40
[ 9970.834885] [<ffffffff8106a598>] do_fork+0xd8/0x380
[ 9970.894375] [<ffffffff81110e4c>] ? __audit_syscall_entry+0x9c/0xf0
[ 9970.969470] [<ffffffff8106a8c6>] SyS_clone+0x16/0x20
[ 9971.030011] [<ffffffff81642009>] stub_clone+0x69/0x90
[ 9971.091573] [<ffffffff81641c29>] ? system_call_fastpath+0x16/0x1b
The cause is that cpuset_mems_allowed() try to take
mutex_lock(&callback_mutex) under the rcu_read_lock(which was hold in
__mpol_dup()). And in cpuset_mems_allowed(), the access to cpuset is
under rcu_read_lock, so in __mpol_dup, we can reduce the rcu_read_lock
protection region to protect the access to cpuset only in
current_cpuset_is_being_rebound(). So that we can avoid this bug.
This patch is a temporary solution that just addresses the bug
mentioned above, can not fix the long-standing issue about cpuset.mems
rebinding on fork():
"When the forker's task_struct is duplicated (which includes
->mems_allowed) and it races with an update to cpuset_being_rebound
in update_tasks_nodemask() then the task's mems_allowed doesn't get
updated. And the child task's mems_allowed can be wrong if the
cpuset's nodemask changes before the child has been added to the
cgroup's tasklist."
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
| | | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Date Sun, 18 Aug 2013 16:25:16 +0800
Since cpu_avg_load_per_task is used only by cfs scheduler, its meaning
should present the average cfs type task load in the current run queue.
Thus we change it to h_nr_running for well presenting its meaning.
Signed-off-by: Lei Wen <leiwen@marvell.com>
Signed-off-by: Paul Reioux <reioux@gmail.com>
|
| | | |
| | |
| | |
| | |
| | |
| | | |
from patch sched: change load balance number to h_nr_running of run queue
Signed-off-by: Paul Reioux <reioux@gmail.com>
|
| | | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Date Sun, 18 Aug 2013 16:25:15 +0800
Since rq->nr_running would include both migration and rt task, it is not
reasonable to seek to move nr_running number of task in the load_balance
function, since it only apply to cfs type.
Change it to cfs's h_nr_running, which could well present the task
number in current cfs queue.
Signed-off-by: Lei Wen <leiwen@marvell.com>
backported to Linux 3.4 by faux123
Signed-off-by: Paul Reioux <reioux@gmail.com>
|
| | | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
This RFC patch builds on patch 2 and periodically decays that max value to
do idle balancing per sched domain.
Though we want to decay it fairly consistently, we may not want to lower it by
too much each time, especially since avg_idle is capped based on that value.
So I thought that decaying the value every second and lowering it by half a
percent each time appeared to be fairly reasonable.
This change would allow us to remove the limit we set on each domain's max cost
to idle balance. Also, since the max can be reduced now, we now have to
update rq->max_idle balance_cost more frequently. So after every idle balance,
we loop through the sched domain to find the max sd's newidle load balance cost
for any one domain. Then we will set rq->max_idle_balance_cost to that value.
Since we are now decaying the max cost to do idle balancing, that max cost can
also become not high enough. One possible explanation for why is that
besides the time spent on each newidle load balance, there are other costs
associated with attempting idle balancing. Idle balance also releases and
reacquires a spin lock. That cost is not counted when we keep track of each
domain's cost to do newidle load balance. Also, acquiring the rq locks can
potentially prevent other CPUs from running something useful. And after
migrating tasks, it might potentially have to pay the costs of cache misses and
refreshing tasks' cache.
Because of that, this patch also compares avg_idle with max cost to do idle
balancing + sched_migration_cost. While using the max cost helps reduce
overestimating the average idle, the sched_migration_cost can help account
for those additional costs of idle balancing.
Signed-off-by: Jason Low <jason.low2@hp.com>
[peterz: rewrote the logic, but kept the spirit]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
backported to Linux 3.4
Signed-off-by: Paul Reioux <reioux@gmail.com>
|
| | | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Date Thu, 29 Aug 2013 13:05:35 -0700
In this patch, we keep track of the max cost we spend doing idle load balancing
for each sched domain. If the avg time the CPU remains idle is less then the
time we have already spent on idle balancing + the max cost of idle balancing
in the sched domain, then we don't continue to attempt the balance. We also
keep a per rq variable, max_idle_balance_cost, which keeps track of the max
time spent on newidle load balances throughout all its domains. Additionally,
we swap sched_migration_cost used in idle_balance for rq->max_idle_balance_cost.
By using the max, we avoid overrunning the average. This further reduces the chance
we attempt balancing when the CPU is not idle for longer than the cost to balance.
I also limited the max cost of each domain to 5*sysctl_sched_migration_cost as
a way to prevent the max from becoming too inflated.
Signed-off-by: Jason Low <jason.low2@hp.com>
backported for Linux 3.4
Signed-off-by: Paul Reioux <reioux@gmail.com>
|
| | | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Date Thu, 29 Aug 2013 13:05:34 -0700
When updating avg_idle, if the delta exceeds some max value, then avg_idle
gets set to the max, regardless of what the previous avg was. This can cause
avg_idle to often be overestimated.
This patch modifies the way we update avg_idle by always updating it with the
function call to update_avg() first. Then, if avg_idle exceeds the max, we set
it to the max.
Signed-off-by: Jason Low <jason.low2@hp.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Paul Reioux <reioux@gmail.com>
|
| | | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Date Mon, 17 Jun 2013 21:00:24 +0800
We cannot compare two load directly from two cpus, since the cpu power
over two cpu may vary largely.
Suppose we meet such two kind of cpus.
CPU A:
No real time work, and there are 3 task, with rq->load.weight
being 512.
CPU B:
Has real time work, and it take 3/4 of the cpu power, which
makes CFS only take 1/4, that is 1024/4=256 cpu power. And over its CFS
runqueue, there is only one task with weight as 128.
Since both cpu's CFS task take for half of the CFS's cpu power, it
should be considered as balanced in such case.
But original judgment like:
if ((max_cpu_load - min_cpu_load) >= avg_load_per_task &&
(max_nr_running - min_nr_running) > 1)
It makes (512-128)>=((512+128)/4), and lead to imbalance conclusion...
Make the load as scaled, to avoid such case.
Signed-off-by: Lei Wen <leiwen@marvell.com>
modified for Mako kernel from LKML reference
Signed-off-by: Paul Reioux <reioux@gmail.com>
|
| | | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Date Mon, 17 Jun 2013 21:00:23 +0800
Since for max_load and this_load, they are the value that already be
scaled. It is not reasonble to get a minimum value between the scaled
and non-scaled value, like below example.
min(sds->busiest_load_per_task, sds->max_load);
Also add comment over in what condition, there would be cpu power gain
in move the load.
Signed-off-by: Lei Wen <leiwen@marvell.com>
Signed-off-by: Paul Reioux <reioux@gmail.com>
|
| | | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Date Mon, 17 Jun 2013 21:00:22 +0800
Actually all below item could be repalced by scaled_busy_load_per_task
(sds->busiest_load_per_task * SCHED_POWER_SCALE)
/sds->busiest->sgp->power;
Signed-off-by: Lei Wen <leiwen@marvell.com>
Signed-off-by: Paul Reioux <reioux@gmail.com>
|
| | | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Date Tue, 11 Jun 2013 16:32:45 +0530
sd can't be NULL in init_sched_groups_power() and so checking it for NULL isn't
useful. In case it is required, then also we need to rearrange the code a bit as
we already accessed invalid pointer sd to get sg: sg = sd->groups.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Paul Reioux <reioux@gmail.com>
|
| | | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Date Tue, 11 Jun 2013 16:32:44 +0530
In build_sched_groups() we don't need to call get_group() for cpus which are
already covered in previous iterations. So, call get_group() after checking if
cpu is covered or not.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Paul Reioux <reioux@gmail.com>
|
| | | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Date Tue, 11 Jun 2013 16:32:43 +0530
In the beginning of build_sched_groups() we called sched_domain_span() and
cached its return value in span. Few statements later we are calling it again to
get the same pointer.
Lets use the cached value instead as it hasn't changed in between.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Paul Reioux <reioux@gmail.com>
|
| | | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Date Tue, 4 Jun 2013 16:50:19 +0530
build_sched_domain() never uses parameter struct s_data *d and so passing it is
useless.
Remove it.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Paul Reioux <reioux@gmail.com>
|
| | | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
We are saving first scheduling domain for a cpu in build_sched_domains() by
iterating over the nested sd->child list. We don't actually need to do it this
way.
tl will be equal to sched_domain_topology for the first iteration and so we can
set *per_cpu_ptr(d.sd, i) based on that. So, save pointer to first SD while
running the iteration loop over tl's.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Paul Reioux <reioux@gmail.com>
|
| | | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Date Tue, 4 Jun 2013 16:50:18 +0530
We are saving first scheduling domain for a cpu in build_sched_domains() by
iterating over the nested sd->child list. We don't actually need to do it this
way.
*per_cpu_ptr(d.sd, i) is guaranteed to be NULL in the beginning as we have
called __visit_domain_allocation_hell() which does a memset to zero for struct
s_data.
So, save pointer to first SD while running the iteration loop over tl's.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Paul Reioux <reioux@gmail.com>
|
| | | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Date Wed, 3 Apr 2013 16:20:48 +0300
When using a large number of threads performing AIO operations the
IOCTX list may get a significant number of entries which will cause
significant overhead. For example, when running this fio script:
rw=randrw; size=256k ;directory=/mnt/fio; ioengine=libaio; iodepth=1
blocksize=1024; numjobs=512; thread; loops=100
on an EXT2 filesystem mounted on top of a ramdisk we can observe up to
30% CPU time spent by lookup_ioctx:
32.51% [guest.kernel] [g] lookup_ioctx
9.19% [guest.kernel] [g] __lock_acquire.isra.28
4.40% [guest.kernel] [g] lock_release
4.19% [guest.kernel] [g] sched_clock_local
3.86% [guest.kernel] [g] local_clock
3.68% [guest.kernel] [g] native_sched_clock
3.08% [guest.kernel] [g] sched_clock_cpu
2.64% [guest.kernel] [g] lock_release_holdtime.part.11
2.60% [guest.kernel] [g] memcpy
2.33% [guest.kernel] [g] lock_acquired
2.25% [guest.kernel] [g] lock_acquire
1.84% [guest.kernel] [g] do_io_submit
This patchs converts the ioctx list to a radix tree. For a performance
comparison the above FIO script was run on a 2 sockets 8 core
machine. This are the results for the original list based
implementation and for the radix tree based implementation:
cores 1 2 4 8 16 32
list 111025 ms 62219 ms 34193 ms 22998 ms 19335 ms 15956 ms
radix 75400 ms 42668 ms 23923 ms 17206 ms 15820 ms 13295 ms
% of radix
relative 68% 69% 70% 75% 82% 83%
to list
To consider the impact of the patch on the typical case of having
only one ctx per process the following FIO script was run:
rw=randrw; size=100m ;directory=/mnt/fio; ioengine=libaio; iodepth=1
blocksize=1024; numjobs=1; thread; loops=100
on the same system and the results are the following:
list 65241 ms
radix 65402 ms
% of radix
relative 100.25%
to list
Cc: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Octavian Purdila <octavian.purdila@intel.com>
modified for Mako hybrid from LKML
Signed-off-by: Paul Reioux <reioux@gmail.com>
|
| | | | |
|
| | | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Date Mon, 15 Apr 2013 10:37:59 -0400
If it is confirmed that all the supported architectures can allow a
negative mutex count without incorrect behavior, we can then back
out the architecture specific change and allow the mutex count to
go to any negative number. That should further reduce contention for
non-x86 architecture.
If this is not the case, this patch should be dropped.
Signed-off-by: Waiman Long <Waiman.Long@hp.com>
|
| | | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Date Mon, 15 Apr 2013 10:37:58 -0400
The current mutex spinning code (with MUTEX_SPIN_ON_OWNER option turned
on) allow multiple tasks to spin on a single mutex concurrently. A
potential problem with the current approach is that when the mutex
becomes available, all the spinning tasks will try to acquire the
mutex more or less simultaneously. As a result, there will be a lot of
cacheline bouncing especially on systems with a large number of CPUs.
This patch tries to reduce this kind of contention by putting the
mutex spinners into a queue so that only the first one in the queue
will try to acquire the mutex. This will reduce contention and allow
all the tasks to move forward faster.
The queuing of mutex spinners is done using an MCS lock based
implementation which will further reduce contention on the mutex
cacheline than a similar ticket spinlock based implementation. This
patch will add a new field into the mutex data structure for holding
the MCS lock. This expands the mutex size by 8 bytes for 64-bit system
and 4 bytes for 32-bit system. This overhead will be avoid if the
MUTEX_SPIN_ON_OWNER option is turned off.
The following table shows the jobs per minute (JPM) scalability data
on an 8-node 80-core Westmere box with a 3.7.10 kernel. The numactl
command is used to restrict the running of the fserver workloads to
1/2/4/8 nodes with hyperthreading off.
+-----------------+-----------+-----------+-------------+----------+
| Configuration | Mean JPM | Mean JPM | Mean JPM | % Change |
| | w/o patch | patch 1 | patches 1&2 | 1->1&2 |
+-----------------+------------------------------------------------+
| | User Range 1100 - 2000 |
+-----------------+------------------------------------------------+
| 8 nodes, HT off | 227972 | 227237 | 305043 | +34.2% |
| 4 nodes, HT off | 393503 | 381558 | 394650 | +3.4% |
| 2 nodes, HT off | 334957 | 325240 | 338853 | +4.2% |
| 1 node , HT off | 198141 | 197972 | 198075 | +0.1% |
+-----------------+------------------------------------------------+
| | User Range 200 - 1000 |
+-----------------+------------------------------------------------+
| 8 nodes, HT off | 282325 | 312870 | 332185 | +6.2% |
| 4 nodes, HT off | 390698 | 378279 | 393419 | +4.0% |
| 2 nodes, HT off | 336986 | 326543 | 340260 | +4.2% |
| 1 node , HT off | 197588 | 197622 | 197582 | 0.0% |
+-----------------+-----------+-----------+-------------+----------+
At low user range 10-100, the JPM differences were within +/-1%. So
they are not that interesting.
The fserver workload uses mutex spinning extensively. With just
the mutex change in the first patch, there is no noticeable change
in performance. Rather, there is a slight drop in performance. This
mutex spinning patch more than recovers the lost performance and show
a significant increase of +30% at high user load with the full 8 nodes.
Similar improvements were also seen in a 3.8 kernel.
The table below shows the %time spent by different kernel functions
as reported by perf when running the fserver workload at 1500 users
with all 8 nodes.
+-----------------------+-----------+---------+-------------+
| Function | % time | % time | % time |
| | w/o patch | patch 1 | patches 1&2 |
+-----------------------+-----------+---------+-------------+
| __read_lock_failed | 34.96% | 34.91% | 29.14% |
| __write_lock_failed | 10.14% | 10.68% | 7.51% |
| mutex_spin_on_owner | 3.62% | 3.42% | 2.33% |
| mspin_lock | N/A | N/A | 9.90% |
| __mutex_lock_slowpath | 1.46% | 0.81% | 0.14% |
| _raw_spin_lock | 2.25% | 2.50% | 1.10% |
+-----------------------+-----------+---------+-------------+
The fserver workload for an 8-node system is dominated by the
contention in the read/write lock. Mutex contention also plays a
role. With the first patch only, mutex contention is down (as shown by
the __mutex_lock_slowpath figure) which help a little bit. We saw only
a few percents improvement with that.
By applying patch 2 as well, the single mutex_spin_on_owner figure is
now split out into an additional mspin_lock figure. The time increases
from 3.42% to 11.23%. It shows a great reduction in contention among
the spinners leading to a 30% improvement. The time ratio 9.9/2.33=4.3
indicates that there are on average 4+ spinners waiting in the spin_lock
loop for each spinner in the mutex_spin_on_owner loop. Contention in
other locking functions also go down by quite a lot.
The table below shows the performance change of both patches 1 & 2 over
patch 1 alone in other AIM7 workloads (at 8 nodes, hyperthreading off).
+--------------+---------------+----------------+-----------------+
| Workload | mean % change | mean % change | mean % change |
| | 10-100 users | 200-1000 users | 1100-2000 users |
+--------------+---------------+----------------+-----------------+
| alltests | 0.0% | -0.8% | +0.6% |
| five_sec | -0.3% | +0.8% | +0.8% |
| high_systime | +0.4% | +2.4% | +2.1% |
| new_fserver | +0.1% | +14.1% | +34.2% |
| shared | -0.5% | -0.3% | -0.4% |
| short | -1.7% | -9.8% | -8.3% |
+--------------+---------------+----------------+-----------------+
The short workload is the only one that shows a decline in performance
probably due to the spinner locking and queuing overhead.
Signed-off-by: Waiman Long <Waiman.Long@hp.com>
|
| | | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Date Mon, 15 Apr 2013 10:37:57 -0400
In the __mutex_lock_common() function, an initial entry into
the lock slow path will cause two atomic_xchg instructions to be
issued. Together with the atomic decrement in the fast path, a total
of three atomic read-modify-write instructions will be issued in
rapid succession. This can cause a lot of cache bouncing when many
tasks are trying to acquire the mutex at the same time.
This patch will reduce the number of atomic_xchg instructions used by
checking the counter value first before issuing the instruction. The
atomic_read() function is just a simple memory read. The atomic_xchg()
function, on the other hand, can be up to 2 order of magnitude or even
more in cost when compared with atomic_read(). By using atomic_read()
to check the value first before calling atomic_xchg(), we can avoid a
lot of unnecessary cache coherency traffic. The only downside with this
change is that a task on the slow path will have a tiny bit
less chance of getting the mutex when competing with another task
in the fast path.
The same is true for the atomic_cmpxchg() function in the
mutex-spin-on-owner loop. So an atomic_read() is also performed before
calling atomic_cmpxchg().
The mutex locking and unlocking code for the x86 architecture can allow
any negative number to be used in the mutex count to indicate that some
tasks are waiting for the mutex. I am not so sure if that is the case
for the other architectures. So the default is to avoid atomic_xchg()
if the count has already been set to -1. For x86, the check is modified
to include all negative numbers to cover a larger case.
The following table shows the jobs per minutes (JPM) scalability data
on an 8-node 80-core Westmere box with a 3.7.10 kernel. The numactl
command is used to restrict the running of the high_systime workloads
to 1/2/4/8 nodes with hyperthreading on and off.
+-----------------+-----------+------------+----------+
| Configuration | Mean JPM | Mean JPM | % Change |
| | w/o patch | with patch | |
+-----------------+-----------------------------------+
| | User Range 1100 - 2000 |
+-----------------+-----------------------------------+
| 8 nodes, HT on | 36980 | 148590 | +301.8% |
| 8 nodes, HT off | 42799 | 145011 | +238.8% |
| 4 nodes, HT on | 61318 | 118445 | +51.1% |
| 4 nodes, HT off | 158481 | 158592 | +0.1% |
| 2 nodes, HT on | 180602 | 173967 | -3.7% |
| 2 nodes, HT off | 198409 | 198073 | -0.2% |
| 1 node , HT on | 149042 | 147671 | -0.9% |
| 1 node , HT off | 126036 | 126533 | +0.4% |
+-----------------+-----------------------------------+
| | User Range 200 - 1000 |
+-----------------+-----------------------------------+
| 8 nodes, HT on | 41525 | 122349 | +194.6% |
| 8 nodes, HT off | 49866 | 124032 | +148.7% |
| 4 nodes, HT on | 66409 | 106984 | +61.1% |
| 4 nodes, HT off | 119880 | 130508 | +8.9% |
| 2 nodes, HT on | 138003 | 133948 | -2.9% |
| 2 nodes, HT off | 132792 | 131997 | -0.6% |
| 1 node , HT on | 116593 | 115859 | -0.6% |
| 1 node , HT off | 104499 | 104597 | +0.1% |
+-----------------+------------+-----------+----------+
At low user range 10-100, the JPM differences were within +/-1%. So
they are not that interesting.
AIM7 benchmark run has a pretty large run-to-run variance due to random
nature of the subtests executed. So a difference of less than +-5%
may not be really significant.
This patch improves high_systime workload performance at 4 nodes
and up by maintaining transaction rates without significant drop-off
at high node count. The patch has practically no impact on 1 and 2
nodes system.
The table below shows the percentage time (as reported by perf
record -a -s -g) spent on the __mutex_lock_slowpath() function by
the high_systime workload at 1500 users for 2/4/8-node configurations
with hyperthreading off.
+---------------+-----------------+------------------+---------+
| Configuration | %Time w/o patch | %Time with patch | %Change |
+---------------+-----------------+------------------+---------+
| 8 nodes | 65.34% | 0.69% | -99% |
| 4 nodes | 8.70% | 1.02% | -88% |
| 2 nodes | 0.41% | 0.32% | -22% |
+---------------+-----------------+------------------+---------+
It is obvious that the dramatic performance improvement at 8
nodes was due to the drastic cut in the time spent within the
__mutex_lock_slowpath() function.
The table below show the improvements in other AIM7 workloads (at 8
nodes, hyperthreading off).
+--------------+---------------+----------------+-----------------+
| Workload | mean % change | mean % change | mean % change |
| | 10-100 users | 200-1000 users | 1100-2000 users |
+--------------+---------------+----------------+-----------------+
| alltests | +0.6% | +104.2% | +185.9% |
| five_sec | +1.9% | +0.9% | +0.9% |
| fserver | +1.4% | -7.7% | +5.1% |
| new_fserver | -0.5% | +3.2% | +3.1% |
| shared | +13.1% | +146.1% | +181.5% |
| short | +7.4% | +5.0% | +4.2% |
+--------------+---------------+----------------+-----------------+
Signed-off-by: Waiman Long <Waiman.Long@hp.com>
Reviewed-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
|
| | | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Re-compute time-average nr_running when it is read. This would
prevent reading stalled average value if there were no run-queue
changes for a long time. New average value is returned to the reader,
but not stored to avoid concurrent writes. Light-weight sequential
counter synchronization is used to assure data consistency for
re-computing average.
Change-Id: I8e4ea1b28ea00b3ddaf6ef7cdcd27866f87d360b
Signed-off-by: Alex Frid <afrid@nvidia.com>
(cherry picked from commit 527a759d9b40bf57958eb002edd2bb82014dab99)
Reviewed-on: http://git-master/r/111637
Reviewed-by: Sai Gurrappadi <sgurrappadi@nvidia.com>
Tested-by: Sai Gurrappadi <sgurrappadi@nvidia.com>
Reviewed-by: Automatic_Commit_Validation_User
Reviewed-by: Peter Boonstoppel <pboonstoppel@nvidia.com>
Reviewed-by: Yu-Huan Hsu <yhsu@nvidia.com>
forward ported to Linux 3.4 for use on Mako
Signed-off-by: faux123 <reioux@gmail.com>
Conflicts:
kernel/sched/sched.h
|
| | | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Compute the time-average number of running tasks per run-queue for a
trailing window of a fixed time period. The detla add/sub to the
average value is weighted by the amount of time per nr_running value
relative to the total measurement period.
Change-Id: I076e24ff4ed65bed3b8dd8d2b279a503318071ff
Signed-off-by: Diwakar Tundlam <dtundlam@nvidia.com>
(cherry picked from commit 3a12d7499cee352e8a46eaf700259ba3c733f0e3)
Reviewed-on: http://git-master/r/111635
Reviewed-by: Automatic_Commit_Validation_User
Reviewed-by: Sai Gurrappadi <sgurrappadi@nvidia.com>
Tested-by: Sai Gurrappadi <sgurrappadi@nvidia.com>
Reviewed-by: Peter Boonstoppel <pboonstoppel@nvidia.com>
Reviewed-by: Yu-Huan Hsu <yhsu@nvidia.com>
forward ported to Linux 3.4 for use on S-800
Signed-off-by: Paul Reioux <reioux@gmail.com>
|
| |/ /
| |
| |
| |
| |
| |
| |
| |
| | |
This is the 3.4.100 stable release
Conflicts:
Makefile
Change-Id: Iff7496cbcc27ba14ab19c6f4ed57ea6e6484fdb6
|
| | |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
commit 27e35715df54cbc4f2d044f681802ae30479e7fb upstream.
When the rtmutex fast path is enabled the slow unlock function can
create the following situation:
spin_lock(foo->m->wait_lock);
foo->m->owner = NULL;
rt_mutex_lock(foo->m); <-- fast path
free = atomic_dec_and_test(foo->refcnt);
rt_mutex_unlock(foo->m); <-- fast path
if (free)
kfree(foo);
spin_unlock(foo->m->wait_lock); <--- Use after free.
Plug the race by changing the slow unlock to the following scheme:
while (!rt_mutex_has_waiters(m)) {
/* Clear the waiters bit in m->owner */
clear_rt_mutex_waiters(m);
owner = rt_mutex_owner(m);
spin_unlock(m->wait_lock);
if (cmpxchg(m->owner, owner, 0) == owner)
return;
spin_lock(m->wait_lock);
}
So in case of a new waiter incoming while the owner tries the slow
path unlock we have two situations:
unlock(wait_lock);
lock(wait_lock);
cmpxchg(p, owner, 0) == owner
mark_rt_mutex_waiters(lock);
acquire(lock);
Or:
unlock(wait_lock);
lock(wait_lock);
mark_rt_mutex_waiters(lock);
cmpxchg(p, owner, 0) != owner
enqueue_waiter();
unlock(wait_lock);
lock(wait_lock);
wakeup_next waiter();
unlock(wait_lock);
lock(wait_lock);
acquire(lock);
If the fast path is disabled, then the simple
m->owner = NULL;
unlock(m->wait_lock);
is sufficient as all access to m->owner is serialized via
m->wait_lock;
Also document and clarify the wakeup_next_waiter function as suggested
by Oleg Nesterov.
Reported-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20140611183852.937945560@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
| | |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
commit 3d5c9340d1949733eb37616abd15db36aef9a57c upstream.
Even in the case when deadlock detection is not requested by the
caller, we can detect deadlocks. Right now the code stops the lock
chain walk and keeps the waiter enqueued, even on itself. Silly not to
yell when such a scenario is detected and to keep the waiter enqueued.
Return -EDEADLK unconditionally and handle it at the call sites.
The futex calls return -EDEADLK. The non futex ones dequeue the
waiter, throw a warning and put the task into a schedule loop.
Tagged for stable as it makes the code more robust.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Brad Mouring <bmouring@ni.com>
Link: http://lkml.kernel.org/r/20140605152801.836501969@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
| | |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
commit 82084984383babe728e6e3c9a8e5c46278091315 upstream.
When we walk the lock chain, we drop all locks after each step. So the
lock chain can change under us before we reacquire the locks. That's
harmless in principle as we just follow the wrong lock path. But it
can lead to a false positive in the dead lock detection logic:
T0 holds L0
T0 blocks on L1 held by T1
T1 blocks on L2 held by T2
T2 blocks on L3 held by T3
T4 blocks on L4 held by T4
Now we walk the chain
lock T1 -> lock L2 -> adjust L2 -> unlock T1 ->
lock T2 -> adjust T2 -> drop locks
T2 times out and blocks on L0
Now we continue:
lock T2 -> lock L0 -> deadlock detected, but it's not a deadlock at all.
Brad tried to work around that in the deadlock detection logic itself,
but the more I looked at it the less I liked it, because it's crystal
ball magic after the fact.
We actually can detect a chain change very simple:
lock T1 -> lock L2 -> adjust L2 -> unlock T1 -> lock T2 -> adjust T2 ->
next_lock = T2->pi_blocked_on->lock;
drop locks
T2 times out and blocks on L0
Now we continue:
lock T2 ->
if (next_lock != T2->pi_blocked_on->lock)
return;
So if we detect that T2 is now blocked on a different lock we stop the
chain walk. That's also correct in the following scenario:
lock T1 -> lock L2 -> adjust L2 -> unlock T1 -> lock T2 -> adjust T2 ->
next_lock = T2->pi_blocked_on->lock;
drop locks
T3 times out and drops L3
T2 acquires L3 and blocks on L4 now
Now we continue:
lock T2 ->
if (next_lock != T2->pi_blocked_on->lock)
return;
We don't have to follow up the chain at that point, because T2
propagated our priority up to T4 already.
[ Folded a cleanup patch from peterz ]
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reported-by: Brad Mouring <bmouring@ni.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20140605152801.930031935@linutronix.de
Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
| | |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
commit 397335f004f41e5fcf7a795e94eb3ab83411a17c upstream.
The current deadlock detection logic does not work reliably due to the
following early exit path:
/*
* Drop out, when the task has no waiters. Note,
* top_waiter can be NULL, when we are in the deboosting
* mode!
*/
if (top_waiter && (!task_has_pi_waiters(task) ||
top_waiter != task_top_pi_waiter(task)))
goto out_unlock_pi;
So this not only exits when the task has no waiters, it also exits
unconditionally when the current waiter is not the top priority waiter
of the task.
So in a nested locking scenario, it might abort the lock chain walk
and therefor miss a potential deadlock.
Simple fix: Continue the chain walk, when deadlock detection is
enabled.
We also avoid the whole enqueue, if we detect the deadlock right away
(A-A). It's an optimization, but also prevents that another waiter who
comes in after the detection and before the task has undone the damage
observes the situation and detects the deadlock and returns
-EDEADLOCK, which is wrong as the other task is not in a deadlock
situation.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Link: http://lkml.kernel.org/r/20140522031949.725272460@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
| | |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
commit 099ed151675cd1d2dbeae1dac697975f6a68716d upstream.
Disabling reading and writing to the trace file should not be able to
disable all function tracing callbacks. There's other users today
(like kprobes and perf). Reading a trace file should not stop those
from happening.
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
| | |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
commit 391acf970d21219a2a5446282d3b20eace0c0d7a upstream.
When runing with the kernel(3.15-rc7+), the follow bug occurs:
[ 9969.258987] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:586
[ 9969.359906] in_atomic(): 1, irqs_disabled(): 0, pid: 160655, name: python
[ 9969.441175] INFO: lockdep is turned off.
[ 9969.488184] CPU: 26 PID: 160655 Comm: python Tainted: G A 3.15.0-rc7+ #85
[ 9969.581032] Hardware name: FUJITSU-SV PRIMEQUEST 1800E/SB, BIOS PRIMEQUEST 1000 Series BIOS Version 1.39 11/16/2012
[ 9969.706052] ffffffff81a20e60 ffff8803e941fbd0 ffffffff8162f523 ffff8803e941fd18
[ 9969.795323] ffff8803e941fbe0 ffffffff8109995a ffff8803e941fc58 ffffffff81633e6c
[ 9969.884710] ffffffff811ba5dc ffff880405c6b480 ffff88041fdd90a0 0000000000002000
[ 9969.974071] Call Trace:
[ 9970.003403] [<ffffffff8162f523>] dump_stack+0x4d/0x66
[ 9970.065074] [<ffffffff8109995a>] __might_sleep+0xfa/0x130
[ 9970.130743] [<ffffffff81633e6c>] mutex_lock_nested+0x3c/0x4f0
[ 9970.200638] [<ffffffff811ba5dc>] ? kmem_cache_alloc+0x1bc/0x210
[ 9970.272610] [<ffffffff81105807>] cpuset_mems_allowed+0x27/0x140
[ 9970.344584] [<ffffffff811b1303>] ? __mpol_dup+0x63/0x150
[ 9970.409282] [<ffffffff811b1385>] __mpol_dup+0xe5/0x150
[ 9970.471897] [<ffffffff811b1303>] ? __mpol_dup+0x63/0x150
[ 9970.536585] [<ffffffff81068c86>] ? copy_process.part.23+0x606/0x1d40
[ 9970.613763] [<ffffffff810bf28d>] ? trace_hardirqs_on+0xd/0x10
[ 9970.683660] [<ffffffff810ddddf>] ? monotonic_to_bootbased+0x2f/0x50
[ 9970.759795] [<ffffffff81068cf0>] copy_process.part.23+0x670/0x1d40
[ 9970.834885] [<ffffffff8106a598>] do_fork+0xd8/0x380
[ 9970.894375] [<ffffffff81110e4c>] ? __audit_syscall_entry+0x9c/0xf0
[ 9970.969470] [<ffffffff8106a8c6>] SyS_clone+0x16/0x20
[ 9971.030011] [<ffffffff81642009>] stub_clone+0x69/0x90
[ 9971.091573] [<ffffffff81641c29>] ? system_call_fastpath+0x16/0x1b
The cause is that cpuset_mems_allowed() try to take
mutex_lock(&callback_mutex) under the rcu_read_lock(which was hold in
__mpol_dup()). And in cpuset_mems_allowed(), the access to cpuset is
under rcu_read_lock, so in __mpol_dup, we can reduce the rcu_read_lock
protection region to protect the access to cpuset only in
current_cpuset_is_being_rebound(). So that we can avoid this bug.
This patch is a temporary solution that just addresses the bug
mentioned above, can not fix the long-standing issue about cpuset.mems
rebinding on fork():
"When the forker's task_struct is duplicated (which includes
->mems_allowed) and it races with an update to cpuset_being_rebound
in update_tasks_nodemask() then the task's mems_allowed doesn't get
updated. And the child task's mems_allowed can be wrong if the
cpuset's nodemask changes before the child has been added to the
cgroup's tasklist."
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
| |\ \
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Conflicts:
kernel/events/core.c
net/ipv4/ping.c
Change-Id: Ic359877769a851a4693579e5f0df7555fcfd1461
|
| | |\ \ |
|
| | | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Events with a duplicate constraint are set to state=OFF when detected,
so that their duplicate counts are not read. However, they were not being
cleaned up because the core code only cleaned up ACTIVE events.
This resulted in counters not being freed and eventually running out
of resources.
Clean up the events with state==OFF that were marked that way because of
constraint duplication.
Ensure counts are not updated for OFF events.
Change-Id: If532801c79e6ad6809869eb0a3063774f00c92c3
Signed-off-by: Neil Leeder <nleeder@codeaurora.org>
|
| | |\| | |
|
| | | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
When CPUs are hotplugged off and on, let the various
sw events continue after the hotplug.
Change-Id: Id1aaf30c459c9cf7c9c38967f9ccad56d4062fd3
Signed-off-by: Neil Leeder <nleeder@codeaurora.org>
|
| | |\| | |
|
| | | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Keep event list alive across a CPU hotplug so that perf
can resume when the CPU comes back online. Bring a CPU
online when exiting a perf session so it can be cleaned
up properly.
Change-Id: Ie0e4a43f751beb77afdc84e9d52b21780f279d80
Signed-off-by: Neil Leeder <nleeder@codeaurora.org>
|
| | | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
We could re-read rq->rt_avg after we validated it was smaller than
total, invalidating the check and resulting in an unintended negative.
Change-Id: I8543974aad539107768e9e513ca3a8c4cb79b2ff
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: David Rientjes <rientjes@google.com>
Link: http://lkml.kernel.org/r/1337688268.9698.29.camel@twins
Signed-off-by: Ingo Molnar <mingo@kernel.org>
CRs-Fixed: 497236
Git-commit: b654f7de41b0e3903ee2b51d3b8db77fe52ce728
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Signed-off-by: Srivatsa Vaddagiri <vatsa@codeaurora.org>
Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>
|