aboutsummaryrefslogtreecommitdiff
path: root/kernel
Commit message (Collapse)AuthorAgeFilesLines
* Merge remote-tracking branch 'lineageos/cm-14.1' into ElementalX-6.00-cmflar22017-07-216-517/+0
|\
| * time: Remove CONFIG_TIMER_STATSKees Cook2017-07-026-517/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently CONFIG_TIMER_STATS exposes process information across namespaces: kernel/time/timer_list.c print_timer(): SEQ_printf(m, ", %s/%d", tmp, timer->start_pid); /proc/timer_list: #11: <0000000000000000>, hrtimer_wakeup, S:01, do_nanosleep, cron/2570 Given that the tracer can give the same information, this patch entirely removes CONFIG_TIMER_STATS. Change-Id: I46f71dd592c2d241aacb1bfe7165c07254bc4298 Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: John Stultz <john.stultz@linaro.org> Cc: Nicolas Pitre <nicolas.pitre@linaro.org> Cc: linux-doc@vger.kernel.org Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Xing Gao <xgao01@email.wm.edu> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Jessica Frazelle <me@jessfraz.com> Cc: kernel-hardening@lists.openwall.com Cc: Nicolas Iooss <nicolas.iooss_linux@m4x.org> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Petr Mladek <pmladek@suse.com> Cc: Richard Cochran <richardcochran@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Michal Marek <mmarek@suse.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Olof Johansson <olof@lixom.net> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: linux-api@vger.kernel.org Cc: Arjan van de Ven <arjan@linux.intel.com> Link: http://lkml.kernel.org/r/20170208192659.GA32582@beast Signed-off-by: Thomas Gleixner <tglx@linutronix.de> haggertk: Backported to 3.4 Signed-off-by: Kevin F. Haggerty <haggertk@lineageos.org>
* | Merge remote-tracking branch 'lineageos/cm-14.1' into ElementalX-6.00-cmflar22017-07-032-6/+6
|\|
| * ring-buffer: Prevent overflow of size in ring_buffer_resize()Steven Rostedt (Red Hat)2017-06-261-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If the size passed to ring_buffer_resize() is greater than MAX_LONG - BUF_PAGE_SIZE then the DIV_ROUND_UP() will return zero. Here's the details: # echo 18014398509481980 > /sys/kernel/debug/tracing/buffer_size_kb tracing_entries_write() processes this and converts kb to bytes. 18014398509481980 << 10 = 18446744073709547520 and this is passed to ring_buffer_resize() as unsigned long size. size = DIV_ROUND_UP(size, BUF_PAGE_SIZE); Where DIV_ROUND_UP(a, b) is (a + b - 1)/b BUF_PAGE_SIZE is 4080 and here 18446744073709547520 + 4080 - 1 = 18446744073709551599 where 18446744073709551599 is still smaller than 2^64 2^64 - 18446744073709551599 = 17 But now 18446744073709551599 / 4080 = 4521260802379792 and size = size * 4080 = 18446744073709551360 This is checked to make sure its still greater than 2 * 4080, which it is. Then we convert to the number of buffer pages needed. nr_page = DIV_ROUND_UP(size, BUF_PAGE_SIZE) but this time size is 18446744073709551360 and 2^64 - (18446744073709551360 + 4080 - 1) = -3823 Thus it overflows and the resulting number is less than 4080, which makes 3823 / 4080 = 0 an nr_pages is set to this. As we already checked against the minimum that nr_pages may be, this causes the logic to fail as well, and we crash the kernel. There's no reason to have the two DIV_ROUND_UP() (that's just result of historical code changes), clean up the code and fix this bug. Change-Id: I7744dfdd1c3be9676f767139002b5f57c41d87b2 Cc: stable@vger.kernel.org # 3.5+ Fixes: 83f40318dab00 ("ring-buffer: Make removal of ring buffer pages atomic") Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
| * cgroup: prefer %pK to %pNick Desaulniers2017-06-261-1/+1
| | | | | | | | | | | | | | | | | | | | Prevents leaking kernel pointers when using kptr_restrict. Bug: 30149174 Change-Id: I0fa3cd8d4a0d9ea76d085bba6020f1eda073c09b Git-repo: https://android.googlesource.com/kernel/msm.git Git-commit: 505e48f32f1321ed7cf80d49dd5f31b16da445a8 Signed-off-by: Srinivasa Rao Kuppala <srkupp@codeaurora.org>
* | Merge remote-tracking branch 'lineageos/cm-14.1' into ElementalX-6.00-cmflar22017-06-121-0/+7
|\|
| * perf: don't leave group_entry on sibling list (use-after-free)John Dias2017-06-071-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | When perf_group_detach is called on a group leader, it should empty its sibling list. Otherwise, when a sibling is later deallocated, list_del_event() removes the sibling's group_entry from its current list, which can be the now-deallocated group leader's sibling list (use-after-free bug). Bug: 32402548 Change-Id: I99f6bc97c8518df1cb0035814368012ba72ab1f1 Signed-off-by: John Dias <joaodias@google.com>
* | Merge remote-tracking branch 'lineageos/cm-14.1' into ElementalX-6.00-cmflar22017-06-055-3/+53
|\|
| * kernel: Only expose su when daemon is runningTom Marshall2017-05-193-0/+39
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Note: this is for the 3.4 kernel It has been claimed that the PG implementation of 'su' has security vulnerabilities even when disabled. Unfortunately, the people that find these vulnerabilities often like to keep them private so they can profit from exploits while leaving users exposed to malicious hackers. In order to reduce the attack surface for vulnerabilites, it is therefore necessary to make 'su' completely inaccessible when it is not in use (except by the root and system users). Change-Id: Ia7d50ba46c3d932c2b0ca5fc8e9ec69ec9045f85
| * trace: resolve stack corruption due to string copyAmey Telawane2017-05-011-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Strcpy has no limit on string being copied which causes stack corruption leading to kernel panic. Use strlcpy to resolve the issue by providing length of string to be copied. CRs-fixed: 1048480 CAF-Change-Id: Ib290b25f7e0ff96927b8530e5c078869441d409f Signed-off-by: Amey Telawane <ameyt@codeaurora.org> CVE-2017-0605 Change-Id: I300bf476a38a15d515a2e1d795a53650b209a701 (cherry picked from commit 2161ae9a70b12cf18ac8e5952a20161ffbccb477)
| * perf: Tighten (and fix) the grouping conditionPeter Zijlstra2017-05-011-2/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The fix from 9fc81d87420d ("perf: Fix events installation during moving group") was incomplete in that it failed to recognise that creating a group with events for different CPUs is semantically broken -- they cannot be co-scheduled. Furthermore, it leads to real breakage where, when we create an event for CPU Y and then migrate it to form a group on CPU X, the code gets confused where the counter is programmed -- triggered in practice as well by me via the perf fuzzer. Fix this by tightening the rules for creating groups. Only allow grouping of counters that can be co-scheduled in the same context. This means for the same task and/or the same cpu. Fixes: 9fc81d87420d ("perf: Fix events installation during moving group") Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20150123125834.090683288@infradead.org Signed-off-by: Ingo Molnar <mingo@kernel.org> CVE-2015-9004 Change-Id: I8775ff596bbf97f6eeaaa679c319618fb7a9c639 (cherry picked from commit c3c87e770458aa004bd7ed3f29945ff436fd6511)
* | Adjust Makefilesflar22016-11-222-10/+3
| |
* | mm: remove swap token codeRik van Riel2016-11-221-9/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The swap token code no longer fits in with the current VM model. It does not play well with cgroups or the better NUMA placement code in development, since we have only one swap token globally. It also has the potential to mess with scalability of the system, by increasing the number of non-reclaimable pages on the active and inactive anon LRU lists. Last but not least, the swap token code has been broken for a year without complaints, as reported by Konstantin Khlebnikov. This suggests we no longer have much use for it. The days of sub-1G memory systems with heavy use of swap are over. If we ever need thrashing reducing code in the future, we will have to implement something that does scale. Change-Id: I6d287cfc3c3206ca24da2de0c1392e5fdfcfabe8 Signed-off-by: Rik van Riel <riel@redhat.com> Cc: Konstantin Khlebnikov <khlebnikov@openvz.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Hugh Dickins <hughd@google.com> Acked-by: Bob Picco <bpicco@meloft.net> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Git-commit: e709ffd6169ccd259eb5874e853303e91e94e829 Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git Signed-off-by: Laura Abbott <lauraa@codeaurora.org> Signed-off-by: franciscofranco <franciscofranco.1990@gmail.com> Signed-off-by: flar2 <asegaert@gmail.com>
* | Implement kexec-hardbootVojtech Bocek2016-11-221-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | "Allows hard booting (i.e., with a full hardware reboot) to a kernel previously loaded in memory by kexec. This works around the problem of soft-booted kernel hangs due to improper device shutdown and/or reinitialization." More info in /arch/arm/Kconfig. Original author: Mike Kasick <mike@kasick.org> Vojtech Bocek <vbocek@gmail.com>: I've ported it to flo, it is based of my grouper port, which is based of Asus TF201 patches ported by Jens Andersen <jens.andersen@gmail.com>. I've moved atags copying from guest to the host kernel, which means there is no need to patch the guest kernel, assuming the --mem-min in kexec call is within the first 256MB of System RAM, otherwise it will take a long time to load. I've also fixed /proc/atags entry, which would give the kexec-tools userspace binary only the first 1024 bytes of atags, see arch/arm/kernel/atags.c for more details. Other than that, memory-reservation code for the hardboot page and some assembler to do the watchdog reset on MSM chip are new for this device. ayysir <dresadd09691@gmail.com>: kexec: use mem_text_write_kernel_word to set reboot_code_buffer args in order to avoid protection faults (writes to read-only kernel memory) when CONFIG_STRICT_MEMORY_RWX is enabled. Signed-off-by: Vojtech Bocek <vbocek@gmail.com> Signed-off-by: flar2 <asegaert@gmail.com>
* | update ARM topology and cpu_power driverflar22016-11-221-1/+1
|/ | | | Signed-off-by: flar2 <asegaert@gmail.com>
* BACKPORT: audit: fix a double fetch in audit_log_single_execve_arg()Paul Moore2016-11-111-170/+167
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | (cherry picked from commit 43761473c254b45883a64441dd0bc85a42f3645c) There is a double fetch problem in audit_log_single_execve_arg() where we first check the execve(2) argumnets for any "bad" characters which would require hex encoding and then re-fetch the arguments for logging in the audit record[1]. Of course this leaves a window of opportunity for an unsavory application to munge with the data. This patch reworks things by only fetching the argument data once[2] into a buffer where it is scanned and logged into the audit records(s). In addition to fixing the double fetch, this patch improves on the original code in a few other ways: better handling of large arguments which require encoding, stricter record length checking, and some performance improvements (completely unverified, but we got rid of some strlen() calls, that's got to be a good thing). As part of the development of this patch, I've also created a basic regression test for the audit-testsuite, the test can be tracked on GitHub at the following link: * https://github.com/linux-audit/audit-testsuite/issues/25 [1] If you pay careful attention, there is actually a triple fetch problem due to a strnlen_user() call at the top of the function. [2] This is a tiny white lie, we do make a call to strnlen_user() prior to fetching the argument data. I don't like it, but due to the way the audit record is structured we really have no choice unless we copy the entire argument at once (which would require a rather wasteful allocation). The good news is that with this patch the kernel no longer relies on this strnlen_user() value for anything beyond recording it in the log, we also update it with a trustworthy value whenever possible. Reported-by: Pengfei Wang <wpengfeinudt@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Paul Moore <paul@paul-moore.com> Change-Id: I10e979e94605e3cf8d461e3e521f8f9837228aa5 Bug: 30956807
* perf: Fix race in swevent hashPeter Zijlstra2016-11-111-7/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | There's a race on CPU unplug where we free the swevent hash array while it can still have events on. This will result in a use-after-free which is BAD. Simply do not free the hash array on unplug. This leaves the thing around and no use-after-free takes place. When the last swevent dies, we do a for_each_possible_cpu() iteration anyway to clean these up, at which time we'll free it, so no leakage will occur. Change-Id: I751faf3215bbdaa6b6358f3a752bdd24126cfa0b Reported-by: Sasha Levin <sasha.levin@oracle.com> Tested-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Signed-off-by: Ingo Molnar <mingo@kernel.org>
* FROMLIST: mm: mmap: Add new /proc tunable for mmap_base ASLR.dcashman2016-10-291-0/+22
| | | | | | | | | | | | | | | | | (cherry picked from commit https://lkml.org/lkml/2015/12/21/337) ASLR only uses as few as 8 bits to generate the random offset for the mmap base address on 32 bit architectures. This value was chosen to prevent a poorly chosen value from dividing the address space in such a way as to prevent large allocations. This may not be an issue on all platforms. Allow the specification of a minimum number of bits so that platforms desiring greater ASLR protection may determine where to place the trade-off. Bug: 24047224 Signed-off-by: Daniel Cashman <dcashman@android.com> Signed-off-by: Daniel Cashman <dcashman@google.com> Change-Id: Ic74424e07710cd9ccb4a02871a829d14ef0cc4bc
* cpu: Handle smpboot_unpark_threads() uniformlyPaul E. McKenney2016-10-291-1/+1
| | | | | | | | | | | | | | | Commit 00df35f99191 (cpu: Defer smpboot kthread unparking until CPU known to scheduler) put the online path's call to smpboot_unpark_threads() into a CPU-hotplug notifier. This commit places the offline-failure paths call into the same notifier for the sake of uniformity. Note that it is not currently possible to place the offline path's call to smpboot_park_threads() into an existing notifier because the CPU_DYING notifiers run in a restricted environment, and the CPU_UP_PREPARE notifiers run too soon. Change-Id: I27fff8de4ec3f0193c1c9cf9ccb5701d08faa1ca Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
* cpu: Defer smpboot kthread unparking until CPU known to schedulerPaul E. McKenney2016-10-291-3/+31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, smpboot_unpark_threads() is invoked before the incoming CPU has been added to the scheduler's runqueue structures. This might potentially cause the unparked kthread to run on the wrong CPU, since the correct CPU isn't fully set up yet. That causes a sporadic, hard to debug boot crash triggering on some systems, reported by Borislav Petkov, and bisected down to: 2a442c9c6453 ("x86: Use common outgoing-CPU-notification code") This patch places smpboot_unpark_threads() in a CPU hotplug notifier with priority set so that these kthreads are unparked just after the CPU has been added to the runqueues. Change-Id: I8921987de9c2a2f475cc63dc82662d6ebf6e8725 Reported-and-tested-by: Borislav Petkov <bp@suse.de> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Git-commit: 00df35f991914db6b8bde8cf09808e19a9cffc3d Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git Signed-off-by: Matt Wagantall <mattw@codeaurora.org>
* smpboot: use kmemleak_not_leak for smpboot_thread_dataVignesh Radhakrishnan2016-10-291-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Kmemleak reports the following memory leak : [<ffffffc0002faef8>] create_object+0x140/0x274 [<ffffffc000cc3598>] kmemleak_alloc+0x80/0xbc [<ffffffc0002f707c>] kmem_cache_alloc_trace+0x148/0x1d8 [<ffffffc00024504c>] __smpboot_create_thread.part.2+0x2c/0xec [<ffffffc0002452b4>] smpboot_register_percpu_thread+0x90/0x118 [<ffffffc0016067c0>] spawn_ksoftirqd+0x1c/0x30 [<ffffffc000200824>] do_one_initcall+0xb0/0x14c [<ffffffc001600820>] kernel_init_freeable+0x84/0x1e0 [<ffffffc000cc273c>] kernel_init+0x10/0xcc [<ffffffc000203bbc>] ret_from_fork+0xc/0x50 This memory allocated here points to smpboot_thread_data. Data is used as an argument for this kthread. This will be used when smpboot_thread_fn runs. Therefore, is not a leak. Call kmemleak_not_leak for smpboot_thread_data pointer to ensure that kmemleak doesn't report it as a memory leak. Change-Id: I02b0a7debea3907b606856e069d63d7991b67cd9 Signed-off-by: Vignesh Radhakrishnan <vigneshr@codeaurora.org> Signed-off-by: Prasad Sodagudi <psodagud@codeaurora.org>
* smpboot: Add missing get_online_cpus() in smpboot_register_percpu_thread()Lai Jiangshan2016-10-291-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The following race exists in the smpboot percpu threads management: CPU0 CPU1 cpu_up(2) get_online_cpus(); smpboot_create_threads(2); smpboot_register_percpu_thread(); for_each_online_cpu(); __smpboot_create_thread(); __cpu_up(2); This results in a missing per cpu thread for the newly onlined cpu2 and in a NULL pointer dereference on a consecutive offline of that cpu. Proctect smpboot_register_percpu_thread() with get_online_cpus() to prevent that. [ tglx: Massaged changelog and removed the change in smpboot_unregister_percpu_thread() because that's an optimization and therefor not stable material. ] Change-Id: I8c92a64bf35c3e77c8dd81761e9c8f71b2f94817 Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Cc: David Rientjes <rientjes@google.com> Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/1406777421-12830-1-git-send-email-laijs@cn.fujitsu.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* kthread: Fix the race condition when kthread is parkedSubbaraman Narayanamurthy2016-10-291-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While stressing the CPU hotplug path, sometimes we hit a problem as shown below. [57056.416774] ------------[ cut here ]------------ [57056.489232] ksoftirqd/1 (14): undefined instruction: pc=c01931e8 [57056.489245] Code: e594a000 eb085236 e15a0000 0a000000 (e7f001f2) [57056.489259] ------------[ cut here ]------------ [57056.492840] kernel BUG at kernel/kernel/smpboot.c:134! [57056.513236] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP ARM [57056.519055] Modules linked in: wlan(O) mhi(O) [57056.523394] CPU: 0 PID: 14 Comm: ksoftirqd/1 Tainted: G W O 3.10.0-g3677c61-00008-g180c060 #1 [57056.532595] task: f0c8b000 ti: f0e78000 task.ti: f0e78000 [57056.537991] PC is at smpboot_thread_fn+0x124/0x218 [57056.542750] LR is at smpboot_thread_fn+0x11c/0x218 [57056.547528] pc : [<c01931e8>] lr : [<c01931e0>] psr: 200f0013 [57056.547528] sp : f0e79f30 ip : 00000000 fp : 00000000 [57056.558983] r10: 00000001 r9 : 00000000 r8 : f0e78000 [57056.564192] r7 : 00000001 r6 : c1195758 r5 : f0e78000 r4 : f0e5fd00 [57056.570701] r3 : 00000001 r2 : f0e79f20 r1 : 00000000 r0 : 00000000 This issue was always seen in the context of "ksoftirqd". It seems to be happening because of a potential race condition in __kthread_parkme where just after completing the parked completion, before the ksoftirqd task has been scheduled again, it can go into running state. Fix this by waiting for the task state to parked after waiting the parked completion. CRs-Fixed: 659674 Change-Id: If3f0e9b706eeb5d30d5a32f84378d35bb03fe794 Signed-off-by: Subbaraman Narayanamurthy <subbaram@codeaurora.org>
* kthread: Prevent unpark race which puts threads on the wrong cpuThomas Gleixner2016-10-292-26/+40
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The smpboot threads rely on the park/unpark mechanism which binds per cpu threads on a particular core. Though the functionality is racy: CPU0 CPU1 CPU2 unpark(T) wake_up_process(T) clear(SHOULD_PARK) T runs leave parkme() due to !SHOULD_PARK bind_to(CPU2) BUG_ON(wrong CPU) We cannot let the tasks move themself to the target CPU as one of those tasks is actually the migration thread itself, which requires that it starts running on the target cpu right away. The solution to this problem is to prevent wakeups in park mode which are not from unpark(). That way we can guarantee that the association of the task to the target cpu is working correctly. Add a new task state (TASK_PARKED) which prevents other wakeups and use this state explicitly for the unpark wakeup. Peter noticed: Also, since the task state is visible to userspace and all the parked tasks are still in the PID space, its a good hint in ps and friends that these tasks aren't really there for the moment. The migration thread has another related issue. CPU0 CPU1 Bring up CPU2 create_thread(T) park(T) wait_for_completion() parkme() complete() sched_set_stop_task() schedule(TASK_PARKED) The sched_set_stop_task() call is issued while the task is on the runqueue of CPU1 and that confuses the hell out of the stop_task class on that cpu. So we need the same synchronizaion before sched_set_stop_task(). Change-Id: I9ad6fbe65992ad5b5cb9a252470a56ec51a4ff4f Reported-by: Dave Jones <davej@redhat.com> Reported-and-tested-by: Dave Hansen <dave@sr71.net> Reported-and-tested-by: Borislav Petkov <bp@alien8.de> Acked-by: Peter Ziljstra <peterz@infradead.org> Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Cc: dhillf@gmail.com Cc: Ingo Molnar <mingo@kernel.org> Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1304091635430.21884@ionos Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* stop_machine: Mark per cpu stopper enabled earlyThomas Gleixner2016-10-292-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 14e568e78 (stop_machine: Use smpboot threads) introduced the following regression: Before this commit the stopper enabled bit was set in the online notifier. CPU0 CPU1 cpu_up cpu online hotplug_notifier(ONLINE) stopper(CPU1)->enabled = true; ... stop_machine() The conversion to smpboot threads moved the enablement to the wakeup path of the parked thread. The majority of users seem to have the following working order: CPU0 CPU1 cpu_up cpu online unpark_threads() wakeup(stopper[CPU1]) .... stopper thread runs stopper(CPU1)->enabled = true; stop_machine() But Konrad and Sander have observed: CPU0 CPU1 cpu_up cpu online unpark_threads() wakeup(stopper[CPU1]) .... stop_machine() stopper thread runs stopper(CPU1)->enabled = true; Now the stop machinery kicks CPU0 into the stop loop, where it gets stuck forever because the queue code saw stopper(CPU1)->enabled == false, so CPU0 waits for CPU1 to enter stomp_machine, but the CPU1 stopper work got discarded due to enabled == false. Add a pre_unpark function to the smpboot thread descriptor and call it before waking the thread. This fixes the problem at hand, but the stop_machine code should be more robust. The stopper->enabled flag smells fishy at best. Thanks to Konrad for going through a loop of debug patches and providing the information to decode this issue. Change-Id: I636875cf71ea5c5315eb0eb8599a8ebb9eadabf8 Reported-and-tested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Reported-and-tested-by: Sander Eikelenboom <linux@eikelenboom.it> Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1302261843240.22263@ionos Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* stop_machine: Use smpboot threadsThomas Gleixner2016-10-292-86/+52
| | | | | | | | | | | | | | | | | | | Use the smpboot thread infrastructure. Mark the stopper thread selfparking and park it after it has finished the take_cpu_down() work. Change-Id: If478c40c32a6a1c41c6f73da422d0f2401ee8d17 Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Paul McKenney <paulmck@linux.vnet.ibm.com> Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Cc: Arjan van de Veen <arjan@infradead.org> Cc: Paul Turner <pjt@google.com> Cc: Richard Weinberger <rw@linutronix.de> Cc: Magnus Damm <magnus.damm@gmail.com> Link: http://lkml.kernel.org/r/20130131120741.686315164@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* stop_machine: Store task reference in a separate per cpu variableThomas Gleixner2016-10-291-16/+16
| | | | | | | | | | | | | | | | | | | To allow the stopper thread being managed by the smpboot thread infrastructure separate out the task storage from the stopper data structure. Change-Id: I5270d00c93125618ddde65c1d753c9623740d184 Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Paul McKenney <paulmck@linux.vnet.ibm.com> Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Cc: Arjan van de Veen <arjan@infradead.org> Cc: Paul Turner <pjt@google.com> Cc: Richard Weinberger <rw@linutronix.de> Cc: Magnus Damm <magnus.damm@gmail.com> Link: http://lkml.kernel.org/r/20130131120741.626690384@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* smpboot: Allow selfparking per cpu threadsThomas Gleixner2016-10-291-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | The stop machine threads are still killed when a cpu goes offline. The reason is that the thread is used to bring the cpu down, so it can't be parked along with the other per cpu threads. Allow a per cpu thread to be excluded from automatic parking, so it can park itself once it's done Add a create callback function as well. Change-Id: I6c7496b9da7984cfd513d2e7ee681f0df3206c26 Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Paul McKenney <paulmck@linux.vnet.ibm.com> Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Cc: Arjan van de Veen <arjan@infradead.org> Cc: Paul Turner <pjt@google.com> Cc: Richard Weinberger <rw@linutronix.de> Cc: Magnus Damm <magnus.damm@gmail.com> Link: http://lkml.kernel.org/r/20130131120741.553993267@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* softirq: Use hotplug thread infrastructureThomas Gleixner2016-10-291-84/+27
| | | | | | | | | | | | | | [ paulmck: Call rcu_note_context_switch() with interrupts enabled. ] Change-Id: I87f9f3a3fb1856f0d5c5712be39b4c2c6ecd468f Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Namhyung Kim <namhyung@kernel.org> Link: http://lkml.kernel.org/r/20120716103948.456416747@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* rcu: Use smp_hotplug_thread facility for RCUs per-CPU kthreadPaul E. McKenney2016-10-294-177/+41
| | | | | | | | | | | | | | | | Bring RCU into the new-age CPU-hotplug fold by modifying RCU's per-CPU kthread code to use the new smp_hotplug_thread facility. [ tglx: Adapted it to use callbacks and to the simplified rcu yield ] Change-Id: I55d4c5448eb4ea05debb75c3442316ce45728647 Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Namhyung Kim <namhyung@kernel.org> Link: http://lkml.kernel.org/r/20120716103948.673354828@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* hotplug: Fix UP bug in smpboot hotplug codePaul E. McKenney2016-10-292-2/+5
| | | | | | | | | | | | Because kernel subsystems need their per-CPU kthreads on UP systems as well as on SMP systems, the smpboot hotplug kthread functions must be provided in UP builds as well as in SMP builds. This commit therefore adds smpboot.c to UP builds and excludes irrelevant code via #ifdef. Change-Id: Idaaa4943bd35d389ad6e9e4bd807ae2c067c1931 Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* smpboot: Provide infrastructure for percpu hotplug threadsThomas Gleixner2016-10-293-1/+242
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Provide a generic interface for setting up and tearing down percpu threads. On registration the threads for already online cpus are created and started. On deregistration (modules) the threads are stoppped. During hotplug operations the threads are created, started, parked and unparked. The datastructure for registration provides a pointer to percpu storage space and optional setup, cleanup, park, unpark functions. These functions are called when the thread state changes. Each implementation has to provide a function which is queried and returns whether the thread should run and the thread function itself. The core code handles all state transitions and avoids duplicated code in the call sites. [ paulmck: Preemption leak fix ] Change-Id: Ibd0993d9e7f95c47aee75836632b2cb950aa777c Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Reviewed-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Namhyung Kim <namhyung@kernel.org> Link: http://lkml.kernel.org/r/20120716103948.352501068@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* kthread: Implement park/unpark facilityThomas Gleixner2016-10-291-19/+166
| | | | | | | | | | | | | | | | To avoid the full teardown/setup of per cpu kthreads in the case of cpu hot(un)plug, provide a facility which allows to put the kthread into a park position and unpark it when the cpu comes online again. Change-Id: Id76d714f0ca35744665f74bce4a8e9310abdb9ac Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Reviewed-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Link: http://lkml.kernel.org/r/20120716103948.236618824@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* rcu: Yield simplerThomas Gleixner2016-10-293-184/+41
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The rcu_yield() code is amazing. It's there to avoid starvation of the system when lots of (boosting) work is to be done. Now looking at the code it's functionality is: Make the thread SCHED_OTHER and very nice, i.e. get it out of the way Arm a timer with 2 ticks schedule() Now if the system goes idle the rcu task returns, regains SCHED_FIFO and plugs on. If the systems stays busy the timer fires and wakes a per node kthread which in turn makes the per cpu thread SCHED_FIFO and brings it back on the cpu. For the boosting thread the "make it FIFO" bit is missing and it just runs some magic boost checks. Now this is a lot of code with extra threads and complexity. It's way simpler to let the tasks when they detect overload schedule away for 2 ticks and defer the normal wakeup as long as they are in yielded state and the cpu is not idle. That solves the same problem and the only difference is that when the cpu goes idle it's not guaranteed that the thread returns right away, but it won't be longer out than two ticks, so no harm is done. If that's an issue than it is way simpler just to wake the task from idle as RCU has callbacks there anyway. Change-Id: I82e134c4dd765b6b01eeb099cdabd9325e2e049a Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Namhyung Kim <namhyung@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Link: http://lkml.kernel.org/r/20120716103948.131256723@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* smpboot: Remove leftover declarationThomas Gleixner2016-10-291-2/+0
| | | | | Change-Id: I53a1ca8e0dc23f546bb3886b9d456ca3fc195693 Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* smpboot, idle: Fix comment mismatch over idle_threads_init()Srivatsa S. Bhat2016-10-291-4/+7
| | | | | | | | | | | | | | The comment over idle_threads_init() really talks about the functionality of idle_init(). Move that comment to idle_init(), and add a suitable comment over idle_threads_init(). Change-Id: Ib4fa008f4a6154f7234728efda18bf61ced93206 Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Cc: suresh.b.siddha@intel.com Cc: venki@google.com Cc: nikunj@linux.vnet.ibm.com Link: http://lkml.kernel.org/r/20120524151100.2549.66501.stgit@srivatsabhat.in.ibm.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* smpboot, idle: Optimize calls to smp_processor_id() in idle_threads_init()Srivatsa S. Bhat2016-10-291-2/+4
| | | | | | | | | | | | | | | While trying to initialize idle threads for all cpus, idle_threads_init() calls smp_processor_id() in a loop, which is unnecessary. The intent is to initialize idle threads for all non-boot cpus. So just use a variable to note the boot cpu and use it in the loop. Change-Id: If7afecebd4c714329d1b48a803980eec927532c8 Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Cc: suresh.b.siddha@intel.com Cc: venki@google.com Cc: nikunj@linux.vnet.ibm.com Link: http://lkml.kernel.org/r/20120524151055.2549.64309.stgit@srivatsabhat.in.ibm.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* smp, idle: Allocate idle thread for each possible cpu during bootSuresh Siddha2016-10-294-56/+31
| | | | | | | | | | | | | | | | | | | | | | | | | percpu areas are already allocated during boot for each possible cpu. percpu idle threads can be considered as an extension of the percpu areas, and allocate them for each possible cpu during boot. This will eliminate the need for workqueue based idle thread allocation. In future we can move the idle thread area into the percpu area too. [ tglx: Moved the loop into smpboot.c and added an error check when the init code failed to allocate an idle thread for a cpu which should be onlined ] Change-Id: Iff19a6a5eb339531336bee82aee04fe6b55c385b Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Cc: Tejun Heo <tj@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: venki@google.com Link: http://lkml.kernel.org/r/1334966930.28674.245.camel@sbsiddha-desk.sc.intel.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* smp: Provide generic idle thread allocationThomas Gleixner2016-10-294-2/+96
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | All SMP architectures have magic to fork the idle task and to store it for reusage when cpu hotplug is enabled. Provide a generic infrastructure for it. Create/reinit the idle thread for the cpu which is brought up in the generic code and hand the thread pointer to the architecture code via __cpu_up(). Note, that fork_idle() is called via a workqueue, because this guarantees that the idle thread does not get a reference to a user space VM. This can happen when the boot process did not bring up all possible cpus and a later cpu_up() is initiated via the sysfs interface. In that case fork_idle() would be called in the context of the user space task and take a reference on the user space VM. Change-Id: Ie46c038970876da1f5e31c77533a2743cad31f43 Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Russell King <linux@arm.linux.org.uk> Cc: Mike Frysinger <vapier@gentoo.org> Cc: Jesper Nilsson <jesper.nilsson@axis.com> Cc: Richard Kuo <rkuo@codeaurora.org> Cc: Tony Luck <tony.luck@intel.com> Cc: Hirokazu Takata <takata@linux-m32r.org> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: David Howells <dhowells@redhat.com> Cc: James E.J. Bottomley <jejb@parisc-linux.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Paul Mundt <lethal@linux-sh.org> Cc: David S. Miller <davem@davemloft.net> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Richard Weinberger <richard@nod.at> Cc: x86@kernel.org Acked-by: Venkatesh Pallipadi <venki@google.com> Link: http://lkml.kernel.org/r/20120420124557.102478630@linutronix.de
* smp: Add generic smpboot facilityThomas Gleixner2016-10-294-0/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Start a new file, which will hold SMP and CPU hotplug related generic infrastructure. Change-Id: I7eb92936558bfd48298b6546dc1d19d1542daac6 Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Russell King <linux@arm.linux.org.uk> Cc: Mike Frysinger <vapier@gentoo.org> Cc: Jesper Nilsson <jesper.nilsson@axis.com> Cc: Richard Kuo <rkuo@codeaurora.org> Cc: Tony Luck <tony.luck@intel.com> Cc: Hirokazu Takata <takata@linux-m32r.org> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: David Howells <dhowells@redhat.com> Cc: James E.J. Bottomley <jejb@parisc-linux.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Paul Mundt <lethal@linux-sh.org> Cc: David S. Miller <davem@davemloft.net> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Richard Weinberger <richard@nod.at> Cc: x86@kernel.org Link: http://lkml.kernel.org/r/20120420124557.035417523@linutronix.de
* smp: Add task_struct argument to __cpu_up()Thomas Gleixner2016-10-291-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Preparatory patch to make the idle thread allocation for secondary cpus generic. Change-Id: I93b918d42eceb5bd3c8281fb48504f34352f382d Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Russell King <linux@arm.linux.org.uk> Cc: Mike Frysinger <vapier@gentoo.org> Cc: Jesper Nilsson <jesper.nilsson@axis.com> Cc: Richard Kuo <rkuo@codeaurora.org> Cc: Tony Luck <tony.luck@intel.com> Cc: Hirokazu Takata <takata@linux-m32r.org> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: David Howells <dhowells@redhat.com> Cc: James E.J. Bottomley <jejb@parisc-linux.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Paul Mundt <lethal@linux-sh.org> Cc: David S. Miller <davem@davemloft.net> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Richard Weinberger <richard@nod.at> Cc: x86@kernel.org Link: http://lkml.kernel.org/r/20120420124556.964170564@linutronix.de
* pipe: limit the per-user amount of pages allocated in pipesWilly Tarreau2016-10-291-0/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On no-so-small systems, it is possible for a single process to cause an OOM condition by filling large pipes with data that are never read. A typical process filling 4000 pipes with 1 MB of data will use 4 GB of memory. On small systems it may be tricky to set the pipe max size to prevent this from happening. This patch makes it possible to enforce a per-user soft limit above which new pipes will be limited to a single page, effectively limiting them to 4 kB each, as well as a hard limit above which no new pipes may be created for this user. This has the effect of protecting the system against memory abuse without hurting other users, and still allowing pipes to work correctly though with less data at once. The limit are controlled by two new sysctls : pipe-user-pages-soft, and pipe-user-pages-hard. Both may be disabled by setting them to zero. The default soft limit allows the default number of FDs per process (1024) to create pipes of the default size (64kB), thus reaching a limit of 64MB before starting to create only smaller pipes. With 256 processes limited to 1024 FDs each, this results in 1024*64kB + (256*1024 - 1024) * 4kB = 1084 MB of memory allocated for a user. The hard limit is disabled by default to avoid breaking existing applications that make intensive use of pipes (eg: for splicing). Reported-by: socketpair@gmail.com Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Mitigates: CVE-2013-4312 (Linux 2.0+) Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Willy Tarreau <w@1wt.eu> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Conflicts: Documentation/sysctl/fs.txt fs/pipe.c include/linux/sched.h Change-Id: Ic7c678af18129943e16715fdaa64a97a7f0854be
* hrtimer: Prevent remote enqueue of leftmost timersLeon Ma2016-10-291-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 012a45e3f4af68e86d85cce060c6c2fed56498b2 upstream. If a cpu is idle and starts an hrtimer which is not pinned on that same cpu, the nohz code might target the timer to a different cpu. In the case that we switch the cpu base of the timer we already have a sanity check in place, which determines whether the timer is earlier than the current leftmost timer on the target cpu. In that case we enqueue the timer on the current cpu because we cannot reprogram the clock event device on the target. If the timers base is already the target CPU we do not have this sanity check in place so we enqueue the timer as the leftmost timer in the target cpus rb tree, but we cannot reprogram the clock event device on the target cpu. So the timer expires late and subsequently prevents the reprogramming of the target cpu clock event device until the previously programmed event fires or a timer with an earlier expiry time gets enqueued on the target cpu itself. Add the same target check as we have for the switch base case and start the timer on the current cpu if it would become the leftmost timer on the target. [ tglx: Rewrote subject and changelog ] Change-Id: I18754822387ae503ec460f95203567c5b25597c0 Signed-off-by: Leon Ma <xindong.ma@intel.com> Link: http://lkml.kernel.org/r/1398847391-5994-1-git-send-email-xindong.ma@intel.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* rcu: Fix batch-limit size problemEric Dumazet2016-10-291-7/+8
| | | | | | | | | | | | | | | | | | | | | | | commit 878d7439d0f45a95869e417576774673d1fa243f upstream. Commit 29c00b4a1d9e27 (rcu: Add event-tracing for RCU callback invocation) added a regression in rcu_do_batch() Under stress, RCU is supposed to allow to process all items in queue, instead of a batch of 10 items (blimit), but an integer overflow makes the effective limit being 1. So, unless there is frequent idle periods (during which RCU ignores batch limits), RCU can be forced into a state where it cannot keep up with the callback-generation rate, eventually resulting in OOM. This commit therefore converts a few variables in rcu_do_batch() from int to long to fix this problem, along with the module parameters controlling the batch limits. Change-Id: I6d6dcb94d2a31f0cf00d3121a7b92e49aa7ab108 Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* f2fs: avoid hungtask problem caused by losing wake_upYunlei He2016-10-291-0/+1
| | | | | | | | | | | | | | | | | | | The D state of wait_on_all_pages_writeback should be waken by function f2fs_write_end_io when all writeback pages have been succesfully written to device. It's possible that wake_up comes between get_pages and io_schedule. Maybe in this case it will lost wake_up and still in D state even if all pages have been write back to device, and finally, the whole system will be into the hungtask state. if (!get_pages(sbi, F2FS_WRITEBACK)) break; <--------- wake_up io_schedule(); Signed-off-by: Yunlei He <heyunlei@huawei.com> Signed-off-by: Biao He <hebiao6@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
* prctl: make PR_SET_TIMERSLACK_PID pid namespace awareMicha Kalfon2016-10-291-2/+2
| | | | | | | | | | Make PR_SET_TIMERSLACK_PID consider pid namespace and resolve the target pid in the caller's namespace. Otherwise, calls from pid namespace other than init would fail or affect the wrong task. Change-Id: I1da15196abc4096536713ce03714e99d2e63820a Signed-off-by: Micha Kalfon <micha@cellrox.com> Acked-by: Oren Laadan <orenl@cellrox.com>
* __ptrace_may_access() should not deny sub-threadsMark Grondona2016-10-291-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | (cherry pick from commit 73af963f9f3036dffed55c3a2898598186db1045) __ptrace_may_access() checks get_dumpable/ptrace_has_cap/etc if task != current, this can can lead to surprising results. For example, a sub-thread can't readlink("/proc/self/exe") if the executable is not readable. setup_new_exec()->would_dump() notices that inode_permission(MAY_READ) fails and then it does set_dumpable(suid_dumpable). After that get_dumpable() fails. (It is not clear why proc_pid_readlink() checks get_dumpable(), perhaps we could add PTRACE_MODE_NODUMPABLE) Change __ptrace_may_access() to use same_thread_group() instead of "task == current". Any security check is pointless when the tasks share the same ->mm. Signed-off-by: Mark Grondona <mgrondona@llnl.gov> Signed-off-by: Ben Woodard <woodard@redhat.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Bug: 26016905 Change-Id: If9e2a0eb3339d26d50a9d84671a189fe405f36a3
* sched: Remove redundant update_runtime notifierNeil Zhang2016-10-293-44/+0
| | | | | | | | | | | | | | | | | | | | migration_call() will do all the things that update_runtime() does. So let's remove it. Furthermore, there is potential risk that the current code will catch BUG_ON at line 689 of rt.c when do cpu hotplug while there are realtime threads running because of enabling runtime twice while the rt_runtime may already changed. Change-Id: If2d953316d93c6b7e32f94bd49f2c10e64de6ed8 Signed-off-by: Neil Zhang <zhangwm@marvell.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1365685499-26515-1-git-send-email-zhangwm@marvell.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Git-commit: c5405a495e88d93cf9b4f4cc91507c7f4afcb901 Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git [mattw@codeaurora.org: resolved trivial header file context conflict] Signed-off-by: Matt Wagantall <mattw@codeaurora.org>
* sched/rt: Reduce rq lock contention by eliminating locking of non-feasible ↵Tim Chen2016-10-291-1/+16
| | | | | | | | | | | | | | | | | | | | | | | | target commit 80e3d87b2c5582db0ab5e39610ce3707d97ba409 upstream. This patch adds checks that prevens futile attempts to move rt tasks to a CPU with active tasks of equal or higher priority. This reduces run queue lock contention and improves the performance of a well known OLTP benchmark by 0.7%. Change-Id: I82d8e3a8523a98c060b30945fdff13d7af908220 Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Shawn Bohrer <sbohrer@rgmadvisors.com> Cc: Suruchi Kadu <suruchi.a.kadu@intel.com> Cc: Doug Nelson<doug.nelson@intel.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1421430374.2399.27.camel@schen9-desk2.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Zefan Li <lizefan@huawei.com>
* softirq: reduce latenciesEric Dumazet2016-10-291-8/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit c10d73671ad30f54692f7f69f0e09e75d3a8926a upstream. In various network workloads, __do_softirq() latencies can be up to 20 ms if HZ=1000, and 200 ms if HZ=100. This is because we iterate 10 times in the softirq dispatcher, and some actions can consume a lot of cycles. This patch changes the fallback to ksoftirqd condition to : - A time limit of 2 ms. - need_resched() being set on current task When one of this condition is met, we wakeup ksoftirqd for further softirq processing if we still have pending softirqs. Using need_resched() as the only condition can trigger RCU stalls, as we can keep BH disabled for too long. I ran several benchmarks and got no significant difference in throughput, but a very significant reduction of latencies (one order of magnitude) : In following bench, 200 antagonist "netperf -t TCP_RR" are started in background, using all available cpus. Then we start one "netperf -t TCP_RR", bound to the cpu handling the NIC IRQ (hard+soft) Before patch : RT_LATENCY,MIN_LATENCY,MAX_LATENCY,P50_LATENCY,P90_LATENCY,P99_LATENCY,MEAN_LATENCY,STDDEV_LATENCY MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : first burst 0 : cpu bind RT_LATENCY=550110.424 MIN_LATENCY=146858 MAX_LATENCY=997109 P50_LATENCY=305000 P90_LATENCY=550000 P99_LATENCY=710000 MEAN_LATENCY=376989.12 STDDEV_LATENCY=184046.92 After patch : RT_LATENCY,MIN_LATENCY,MAX_LATENCY,P50_LATENCY,P90_LATENCY,P99_LATENCY,MEAN_LATENCY,STDDEV_LATENCY MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : first burst 0 : cpu bind RT_LATENCY=40545.492 MIN_LATENCY=9834 MAX_LATENCY=78366 P50_LATENCY=33583 P90_LATENCY=59000 P99_LATENCY=69000 MEAN_LATENCY=38364.67 STDDEV_LATENCY=12865.26 Change-Id: I94f96a9040a018644d3e2150f54acfd9a080992d Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: David Miller <davem@davemloft.net> Cc: Tom Herbert <therbert@google.com> Cc: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net> [xr: Backported to 3.4: Adjust context] Signed-off-by: Rui Xiang <rui.xiang@huawei.com> Signed-off-by: Zefan Li <lizefan@huawei.com>