aboutsummaryrefslogtreecommitdiff
path: root/kernel
Commit message (Collapse)AuthorAgeFilesLines
* __ptrace_may_access() should not deny sub-threadsMark Grondona2017-05-251-1/+1
| | | | | | | (cherry pick from commit 73af963f9f3036dffed55c3a2898598186db1045) __ptrace_may_access() checks get_dumpable/ptrace_has_cap/etc if task != current, this can can lead to surprising results.
* perf: Tighten (and fix) the grouping conditionPeter Zijlstra2017-05-031-2/+13
| | | | | | | | | | | | | | | | The fix from 9fc81d87420d ("perf: Fix events installation during moving group") was incomplete in that it failed to recognise that creating a group with events for different CPUs is semantically broken -- they cannot be co-scheduled. Furthermore, it leads to real breakage where, when we create an event for CPU Y and then migrate it to form a group on CPU X, the code gets confused where the counter is programmed -- triggered in practice as well by me via the perf fuzzer. Fix this by tightening the rules for creating groups. Only allow grouping of counters that can be co-scheduled in the same context. This means for the same task and/or the same cpu.
* [PATCH] sched: Remove stale power aware scheduling remnants andfranciscofranco2017-03-222-225/+7
| | | | | | | dysfunctional knobs - http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=8e7fbcbc22c12414bcc9dfdd683637f58fb32759 Signed-off-by: franciscofranco <franciscofranco.1990@gmail.com>
* revert some adjustmentsAndDiSa2017-03-171-1/+1
|
* suspend: shorten time to enter sleep stateAndDiSa2017-03-151-1/+1
|
* fix mergeAndDiSa2017-02-181-3/+9
|
* hrtimer: Allow concurrent hrtimer_start() for self restarting timersPeter Zijlstra (Intel)2017-02-181-3/+9
|
* [PATCH] Fix lockup related to stop_machine being stuckAndDiSa2017-02-181-3/+6
|
* [PATCH] workqueue: cond_resched() after processing each work itemTejun Heo2017-02-181-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit b22ce2785d97423846206cceec4efee0c4afd980 upstream. This becomes dangerous when a self-requeueing work item which is waiting for something to happen races against stop_machine. Such self-requeueing work item would requeue itself indefinitely hogging the kworker and CPU it's running on while stop_machine would wait for that CPU to enter stop_machine while preventing anything else from happening on all other CPUs. The two would deadlock. Jamie Liu reports that this deadlock scenario exists around scsi_requeue_run_queue() and libata port multiplier support, where one port may exclude command processing from other ports. With the right timing, scsi_requeue_run_queue() can end up requeueing itself trying to execute an IO which is asked to be retried while another device has an exclusive access, which in turn can't make forward progress due to stop_machine. Fix it by invoking cond_resched() after executing each work item. Change-Id: I90b68266fc75edada8a75fd4e1e943018f4c3221 Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Jamie Liu <jamieliu@google.com> References: http://thread.gmane.org/gmane.linux.kernel/1552567 [bwh: Backported to 3.2: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Cc: Qiang Huang <h.huangqiang@huawei.com> Cc: Li Zefan <lizefan@huawei.com> Cc: Jianguo Wu <wujianguo@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Francisco Franco <franciscofranco.1990@gmail.com>
* audit: wait_for_auditd() should use TASK_UNINTERRUPTIBLEOleg Nesterov2017-02-181-1/+1
|
* [PATCH] hrtimer: Fix ktime_add_ns() overflow on 32bit architecturesDavid Engraf2017-02-181-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 51fd36f3fad8447c487137ae26b9d0b3ce77bb25 upstream. One can trigger an overflow when using ktime_add_ns() on a 32bit architecture not supporting CONFIG_KTIME_SCALAR. When passing a very high value for u64 nsec, e.g. 7881299347898368000 the do_div() function converts this value to seconds (7881299347) which is still to high to pass to the ktime_set() function as long. The result in is a negative value. The problem on my system occurs in the tick-sched.c, tick_nohz_stop_sched_tick() when time_delta is set to timekeeping_max_deferment(). The check for time_delta < KTIME_MAX is valid, thus ktime_add_ns() is called with a too large value resulting in a negative expire value. This leads to an endless loop in the ticker code: time_delta: 7881299347898368000 expires = ktime_add_ns(last_update, time_delta) expires: negative value This fix caps the value to KTIME_MAX. This error doesn't occurs on 64bit or architectures supporting CONFIG_KTIME_SCALAR (e.g. ARM, x86-32). Change-Id: I59a43fa8f68eb7a50046fff221969a60275f3852 Signed-off-by: David Engraf <david.engraf@sysgo.com> [jstultz: Minor tweaks to commit message & header] Signed-off-by: John Stultz <john.stultz@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Francisco Franco <franciscofranco.1990@gmail.com>
* [PATCH] irq: Enable all irqs unconditionally in irq_resumeLaxman Dewangan2017-02-181-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit ac01810c9d2814238f08a227062e66a35a0e1ea2 upstream. When the system enters suspend, it disables all interrupts in suspend_device_irqs(), including the interrupts marked EARLY_RESUME. On the resume side things are different. The EARLY_RESUME interrupts are reenabled in sys_core_ops->resume and the non EARLY_RESUME interrupts are reenabled in the normal system resume path. When suspend_noirq() failed or suspend is aborted for any other reason, we might omit the resume side call to sys_core_ops->resume() and therefor the interrupts marked EARLY_RESUME are not reenabled and stay disabled forever. To solve this, enable all irqs unconditionally in irq_resume() regardless whether interrupts marked EARLY_RESUMEhave been already enabled or not. This might try to reenable already enabled interrupts in the non failure case, but the only affected platform is XEN and it has been confirmed that it does not cause any side effects. [ tglx: Massaged changelog. ] Change-Id: Id4527228f1bb44de5e36b569ac5b864b2b63fc7d Signed-off-by: Laxman Dewangan <ldewangan@nvidia.com> Acked-by-and-tested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Heiko Stuebner <heiko@sntech.de> Reviewed-by: Pavel Machek <pavel@ucw.cz> Cc: <ian.campbell@citrix.com> Cc: <rjw@rjwysocki.net> Cc: <len.brown@intel.com> Cc: <gregkh@linuxfoundation.org> Link: http://lkml.kernel.org/r/1385388587-16442-1-git-send-email-ldewangan@nvidia.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Francisco Franco <franciscofranco.1990@gmail.com>
* [PATCH] kernel/audit_tree.c: tree will leak memory when failureChen Gang2017-02-181-1/+1
| | | | | | | | | | | | | | | | | | | | | occurs in audit_trim_trees() commit 12b2f117f3bf738c1a00a6f64393f1953a740bd4 upstream. audit_trim_trees() calls get_tree(). If a failure occurs we must call put_tree(). [akpm@linux-foundation.org: run put_tree() before mutex_lock() for small scalability improvement] Signed-off-by: Chen Gang <gang.chen@asianux.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Eric Paris <eparis@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jonghwan Choi <jhbird.choi@samsung.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Change-Id: If4bb4e14b74e260461c2eb5c26f292cb132315a0 Signed-off-by: Francisco Franco <franciscofranco.1990@gmail.com>
* UPSTREAM: ring-buffer: Prevent overflow of size in ring_buffer_resize()Steven Rostedt2017-02-181-6/+5
|
* [PATCH] add extra free kbytes tunableColin Cross2017-02-171-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | | Add a userspace visible knob to tell the VM to keep an extra amount of memory free, by increasing the gap between each zone's min and low watermarks. This is useful for realtime applications that call system calls and have a bound on the number of allocations that happen in any short time period. In this application, extra_free_kbytes would be left at an amount equal to or larger than than the maximum number of allocations that happen in any burst. It may also be useful to reduce the memory use of virtual machines (temporarily?), in a way that does not cause memory fragmentation like ballooning does. [ccross] Revived for use on old kernels where no other solution exists. The tunable will be removed on kernels that do better at avoiding direct reclaim. Change-Id: I765a42be8e964bfd3e2886d1ca85a29d60c3bb3e Signed-off-by: Rik van Riel<riel@redhat.com> Signed-off-by: Colin Cross <ccross@android.com>
* [PATCH] Add Joe's RCU. Original patch from Joe Korty.Francisco Franco2017-02-133-2/+814
|
* [PATCH] zram swap optimizations: added zram swap optimizations forPaul Reioux2017-02-101-0/+13
| | | | | | | | Linux 3.0/3.1 code cherry-picked from Samsung Galaxy Note 2 kernels Signed-off-by: Paul Reioux <reioux@gmail.com>
* [PATCH] mutex: dynamically disable mutex spinning at high load Datefaux1232017-02-091-0/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Thu, 4 Apr 2013 10:54:18 -0400 The Linux mutex code has a MUTEX_SPIN_ON_OWNER configuration option that was enabled by default in major distributions like Red Hat. Allowing threads waiting on mutex to spin while the mutex owner is running will theoretically reduce latency on the acquisition of mutex at the expense of energy efficiency as the spinning threads are doing no useful work. This is not a problem on a lightly loaded system where the CPU may be idle anyway. On a highly loaded system, the spinning tasks may be blocking other tasks from running even if they have higher priority because the spinning was done with preemption disabled. This patch will disable mutex spinning if the current load is high enough. The load is considered high if there are 2 or more active tasks waiting to run on the current CPU. If there is only one task waiting, it will check the average load at the past minute (calc_load_tasks). If it is more than double the number of active CPUs, the load is considered high too. This is a rather simple metric that does not incur that much additional overhead. The AIM7 benchmarks were run on 3.7.10 derived kernels to show the performance changes with this patch on a 8-socket 80-core system with hyperthreading off. The table below shows the mean % change in performance over a range of users for some AIM7 workloads with just the less atomic operation patch (patch 1) vs the less atomic operation patch plus this one (patches 1+3). +--------------+-----------------+-----------------+-----------------+ | Workload | mean % change | mean % change | mean % change | | | 10-100 users | 200-1000 users | 1100-2000 users | +--------------+-----------------+-----------------+-----------------+ | alltests | 0.0% | -0.1% | +5.0% | | five_sec | +1.5% | +1.3% | +1.3% | | fserver | +1.5% | +25.4% | +9.6% | | high_systime | +0.1% | 0.0% | +0.8% | | new_fserver | +0.2% | +11.9% | +14.1% | | shared | -1.2% | +0.3% | +1.8% | | short | +6.4% | +2.5% | +3.0% | +--------------+-----------------+-----------------+-----------------+ It can be seen that this patch provides some big performance improvement for the fserver and new_fserver workloads while is still generally positive for the other AIM7 workloads. Signed-off-by: Waiman Long <Waiman.Long@hp.com> Reviewed-by: Davidlohr Bueso <davidlohr.bueso@hp.com> modified for grouper kernel from LKML Signed-off-by: Paul Reioux <reioux@gmail.com>
* [PATCH] mutex: Make more scalable by doing less atomic operationslongman882017-02-092-3/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | DateThu, 4 Apr 2013 10:54:16 -0400 In the __mutex_lock_common() function, an initial entry into the lock slow path will cause two atomic_xchg instructions to be issued. Together with the atomic decrement in the fast path, a total of three atomic read-modify-write instructions will be issued in rapid succession. This can cause a lot of cache bouncing when many tasks are trying to acquire the mutex at the same time. This patch will reduce the number of atomic_xchg instructions used by checking the counter value first before issuing the instruction. The atomic_read() function is just a simple memory read. The atomic_xchg() function, on the other hand, can be up to 2 order of magnitude or even more in cost when compared with atomic_read(). By using atomic_read() to check the value first before calling atomic_xchg(), we can avoid a lot of unnecessary cache coherency traffic. The only downside with this change is that a task on the slow path will have a tiny bit less chance of getting the mutex when competing with another task in the fast path. The same is true for the atomic_cmpxchg() function in the mutex-spin-on-owner loop. So an atomic_read() is also performed before calling atomic_cmpxchg(). The mutex locking and unlocking code for the x86 architecture can allow any negative number to be used in the mutex count to indicate that some tasks are waiting for the mutex. I am not so sure if that is the case for the other architectures. So the default is to avoid atomic_xchg() if the count has already been set to -1. For x86, the check is modified to include all negative numbers to cover a larger case. The following table shows the scalability data on an 8-node 80-core Westmere box with a 3.7.10 kernel. The numactl command is used to restrict the running of the high_systime workloads to 1/2/4/8 nodes with hyperthreading on and off. +-----------------+------------------+------------------+----------+ | Configuration | Mean Transaction | Mean Transaction | % Change | | User Range 1100 - 2000 | +-----------------+------------------------------------------------+ | 8 nodes, HT on | 36980 | 148590 | +301.8% | | 8 nodes, HT off | 42799 | 145011 | +238.8% | | 4 nodes, HT on | 61318 | 118445 | +51.1% | | 4 nodes, HT off | 158481 | 158592 | +0.1% | | 2 nodes, HT on | 180602 | 173967 | -3.7% | | 2 nodes, HT off | 198409 | 198073 | -0.2% | | 1 node , HT on | 149042 | 147671 | -0.9% | | 1 node , HT off | 126036 | 126533 | +0.4% | +-----------------+------------------------------------------------+ |ge 200 - 1000 | +-----------------+------------------------------------------------+ | 8 nodes, HT on | 41525 | 122349 | +194.6% | | 8 nodes, HT off | 49866 | 124032 | +148.7% | | 4 nodes, HT on | 66409 | 106984 | +61.1% | | 4 nodes, HT off | 119880 | 130508 | +8.9% | | 2 nodes, HT on | 138003 | 133948 | -2.9% | | 2 nodes, HT off | 132792 | 131997 | -0.6% | | 1 node , HT on | 116593 | 115859 | -0.6% | | 1 node , HT off | 104499 | 104597 | +0.1% | +-----------------+------------------+------------------+----------+ AIM7 benchmark run has a pretty large run-to-run variance due to random nature of the subtests executed. So a difference of less than +-5% may not be really significant. This patch improves high_systime workload performance at 4 nodes and up by maintaining transaction rates without significant drop-off at high node count. The patch has practically no impact on 1 and 2 nodes system. The table below shows the percentage time (as reported by perf record -a -s -g) spent on the __mutex_lock_slowpath() function by the high_systime workload at 1500 users for 2/4/8-node configurations with hyperthreading off. +---------------+-----------------+------------------+---------+ | Configuration | %Time w/o patch | %Time with patch | %Change | +---------------+-----------------+------------------+---------+ | 8 nodes | 65.34% | 0.69% | -99% | | 4 nodes | 8.70% | 1.02% | -88% | | 2 nodes | 0.41% | 0.32% | -22% | +---------------+-----------------+------------------+---------+ It is obvious that the dramatic performance improvement at 8 nodes was due to the drastic cut in the time spent within the __mutex_lock_slowpath() function. The table below show the improvements in other AIM7 workloads (at 8 nodes, hyperthreading off). +--------------+---------------+----------------+-----------------+ | Workload | mean % change | mean % change | mean % change | | | 10-100 users | 200-1000 users | 1100-2000 users | +--------------+---------------+----------------+-----------------+ | alltests | +0.6% | +104.2% | +185.9% | | five_sec | +1.9% | +0.9% | +0.9% | | fserver | +1.4% | -7.7% | +5.1% | | new_fserver | -0.5% | +3.2% | +3.1% | | shared | +13.1% | +146.1% | +181.5% | | short | +7.4% | +5.0% | +4.2% | +--------------+---------------+----------------+-----------------+ Signed-off-by: Waiman Long <Waiman.Long@hp.com> Reviewed-by: Davidlohr Bueso <davidlohr.bueso@hp.com>
* Correct optimizations for Cortex-A9Sudokamikaze2017-02-061-1/+2
|
* [PATCH] sched_fair: Reduce latenciesAndDiSa2017-01-211-3/+3
| | | | Change-Id: I908343f1c8ee70fe9c46f3d9d428ebdd54c9b13f
* correct merge errorAndDiSa2017-01-201-1/+1
|
* [PATCH] perf: protect group_leader from races that cause ctxAndDiSa2017-01-201-0/+16
| | | | | | | | | | | | | | | | | | | | | | double-free When moving a group_leader perf event from a software-context to a hardware-context, there's a race in checking and updating that context. The existing locking solution doesn't work; note that it tries to grab a lock inside the group_leader's context object, which you can only get at by going through a pointer that should be protected from these races. To avoid that problem, and to produce a simple solution, we can just use a lock per group_leader to protect all checks on the group_leader's context. The new lock is grabbed and released when no context locks are held. RM-290 Bug: 30955111 Bug: 31095224 Change-Id: If37124c100ca6f4aa962559fba3bd5dbbec8e052
* [PATCH] BACKPORT: perf: Fix event->ctx lockingAndDiSa2017-01-201-34/+205
| | | | | | | | | | | | | | | | | | | | | | | | | There have been a few reported issues wrt. the lack of locking around changing event->ctx. This patch tries to address those. It avoids the whole rwsem thing; and while it appears to work, please give it some thought in review. What I did fail at is sensible runtime checks on the use of event->ctx, the RCU use makes it very hard. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20150123125834.209535886@infradead.org Signed-off-by: Ingo Molnar <mingo@kernel.org> (cherry picked from commit f63a8daa5812afef4f06c962351687e1ff9ccb2b) Bug: 30955111 Bug: 31095224 Change-Id: I5bab713034e960fad467637e98e914440de5666d
* [PATCH] BACKPORT: perf: Introduce perf_pmu_migrate_context()AndDiSa2017-01-201-0/+36
| | | | | | | | | | | | | | Originally from Peter Zijlstra. The helper migrates perf events from one cpu to another cpu. Conflicts (perf: Fix race in removing an event): kernel/events/core.c Change-Id: I7885fe36c9e2803b10477d556163197085be3d19 Signed-off-by: Zheng Yan <zheng.z.yan@intel.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1339741902-8449-5-git-send-email-zheng.z.yan@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* [PATCH] BACKPORT: perf: Allow the PMU driver to choose the CPU onAndDiSa2017-01-201-4/+4
| | | | | | | | | | | | | which to install events Allow the pmu->event_init callback to change event->cpu, so the PMU driver can choose the CPU on which to install events. Change-Id: I0f8b4310d306f4c87bc961f0359c2bdf65c129b6 Signed-off-by: Zheng Yan <zheng.z.yan@intel.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1339741902-8449-4-git-send-email-zheng.z.yan@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* fix compile problems on 16.04AndDiSa2016-09-081-1/+1
|
* some minor changes ...AndDiSa2015-11-201-0/+4
|
* fixAndDiSa2015-11-191-1/+1
|
* Revert "revert me ..."AndDiSa2015-11-191-2/+2
| | | | This reverts commit 4e28b15bcc5474a323c222572f222fd8be577cb0.
* revert me ...AndDiSa2015-11-191-2/+2
|
* initial mergeAndDiSa2015-11-192-0/+8
|
* M for grouperDmitry Grinberg2015-10-157-13/+56
|
* cgroup: remove synchronize_rcu() from cgroup_attach_{task|proc}()Devin Kim2014-12-101-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | These 2 syncronize_rcu()s make attaching a task to a cgroup quite slow, and it can't be ignored in some situations. A real case from Colin Cross: Android uses cgroups heavily to manage thread priorities, putting threads in a background group with reduced cpu.shares when they are not visible to the user, and in a foreground group when they are. Some RPCs from foreground threads to background threads will temporarily move the background thread into the foreground group for the duration of the RPC. This results in many calls to cgroup_attach_task. In cgroup_attach_task() it's task->cgroups that is protected by RCU, and put_css_set() calls kfree_rcu() to free it. If we remove this synchronize_rcu(), there can be threads in RCU-read sections accessing their old cgroup via current->cgroups with concurrent rmdir operation, but this is safe. # time for ((i=0; i<50; i++)) { echo $$ > /mnt/sub/tasks; echo $$ > /mnt/tasks; } real 0m2.524s user 0m0.008s sys 0m0.004s With this patch: real 0m0.004s user 0m0.004s sys 0m0.000s tj: These synchronize_rcu()s are utterly confused. synchornize_rcu() necessarily has to come between two operations to guarantee that the changes made by the former operation are visible to all rcu readers before proceeding to the latter operation. Here, synchornize_rcu() are at the end of attach operations with nothing beyond it. Its only effect would be delaying completion of write(2) to sysfs tasks/procs files until all rcu readers see the change, which doesn't mean anything. cherry-picked from: https://android.googlesource.com/kernel/common/+/5d65bc0ca1bceb73204dab943922ba3c83276a8c Bug: 17709419 Change-Id: I98dacd6c13da27cb3496fe4a24a24084e46bdd9c Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Colin Cross <ccross@google.com> Signed-off-by: Devin Kim <dojip.kim@lge.com>
* oom: remove oom_disable_countDavid Rientjes2014-11-262-11/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | This removes mm->oom_disable_count entirely since it's unnecessary and currently buggy. The counter was intended to be per-process but it's currently decremented in the exit path for each thread that exits, causing it to underflow. The count was originally intended to prevent oom killing threads that share memory with threads that cannot be killed since it doesn't lead to future memory freeing. The counter could be fixed to represent all threads sharing the same mm, but it's better to remove the count since: - it is possible that the OOM_DISABLE thread sharing memory with the victim is waiting on that thread to exit and will actually cause future memory freeing, and - there is no guarantee that a thread is disabled from oom killing just because another thread sharing its mm is oom disabled. Signed-off-by: David Rientjes <rientjes@google.com> Reported-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Cc: Ying Han <yinghan@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Change-Id: If5667b6eb784b0503179f2f4e9aef0b5ecbdb368
* cgroup: Fix use after free of cgrp (cgrp->css_sets)Hans de Goede2014-07-291-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Running a 3.4 kernel + Fedora-18 (systemd) userland on my Allwinner A10 (arm cortex a8), I'm seeing repeated, reproducable list_del list corruption errors when build with CONFIG_DEBUG_LIST, and the backtrace always shows free_css_set_work as the function making the problematic list_del call. I've tracked this doen to a use after free of the cgrp struct, specifically of the cgrp->css_sets list_head, which gets cleared by free_css_set_work. Since free_css_set_work runs form a workqueue, it is possible for it to not be done with clearing the list when the cgrp gets free-ed. To avoid this the code adding the links increases cgrp->count, and the freeing code running from the workqueue decreases cgrp->count *after* doing list_del, and then if the count goes to 0 calls cgroup_wakeup_rmdir_waiter(). However cgroup_rmdir() is missing a check for cgrp->count != 0, causing it to still continue with the rmdir (which leads to the free-ing of the cgrp), before free_css_set_work is done. Sometimes the free-ed memory is re-used before free_css_set_work gets around to unlinking link->cgrp_link_list, triggering the list_del list corruption messages. This patch fixes this by properly checking for cgrp->count != 0 and waiting for the cgroup_rmdir_waitq in that case. Change-Id: I9dbc02a0a75d5dffa1b65d67456e00139dea57c3 Signed-off-by: Hans de Goede <hdegoede@redhat.com>
* cgroup: Take css_set_lock from cgroup_css_sets_empty()Hans de Goede2014-07-291-5/+9
| | | | | | | | | As indicated in the comment above cgroup_css_sets_empty it needs the css_set_lock. But neither of the 2 call points have it, so rather then fixing the callers just take the lock inside cgroup_css_sets_empty(). Signed-off-by: Hans de Goede <hdegoede@redhat.com> Change-Id: If7aea71824f6d0e3f2cc6c1ce236c3ae6be2037b
* prctl: adds the capable(CAP_SYS_NICE) check to PR_SET_TIMERSLACK_PID.Ruchi Kandoi2014-06-161-0/+3
| | | | | | | | | Adds a capable() check to make sure that arbitary apps do not change the timer slack for other apps. Bug: 15000427 Change-Id: I558a2551a0e3579c7f7e7aae54b28aa9d982b209 Signed-off-by: Ruchi Kandoi <kandoiruchi@google.com>
* futex: Make lookup_pi_state more robustThomas Gleixner2014-06-111-17/+106
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The current implementation of lookup_pi_state has ambigous handling of the TID value 0 in the user space futex. We can get into the kernel even if the TID value is 0, because either there is a stale waiters bit or the owner died bit is set or we are called from the requeue_pi path or from user space just for fun. The current code avoids an explicit sanity check for pid = 0 in case that kernel internal state (waiters) are found for the user space address. This can lead to state leakage and worse under some circumstances. Handle the cases explicit: Waiter | pi_state | pi->owner | uTID | uODIED | ? [1] NULL | --- | --- | 0 | 0/1 | Valid [2] NULL | --- | --- | >0 | 0/1 | Valid [3] Found | NULL | -- | Any | 0/1 | Invalid [4] Found | Found | NULL | 0 | 1 | Valid [5] Found | Found | NULL | >0 | 1 | Invalid [6] Found | Found | task | 0 | 1 | Valid [7] Found | Found | NULL | Any | 0 | Invalid [8] Found | Found | task | ==taskTID | 0/1 | Valid [9] Found | Found | task | 0 | 0 | Invalid [10] Found | Found | task | !=taskTID | 0/1 | Invalid [1] Indicates that the kernel can acquire the futex atomically. We came came here due to a stale FUTEX_WAITERS/FUTEX_OWNER_DIED bit. [2] Valid, if TID does not belong to a kernel thread. If no matching thread is found then it indicates that the owner TID has died. [3] Invalid. The waiter is queued on a non PI futex [4] Valid state after exit_robust_list(), which sets the user space value to FUTEX_WAITERS | FUTEX_OWNER_DIED. [5] The user space value got manipulated between exit_robust_list() and exit_pi_state_list() [6] Valid state after exit_pi_state_list() which sets the new owner in the pi_state but cannot access the user space value. [7] pi_state->owner can only be NULL when the OWNER_DIED bit is set. [8] Owner and user space value match [9] There is no transient state which sets the user space TID to 0 except exit_robust_list(), but this is indicated by the FUTEX_OWNER_DIED bit. See [4] [10] There is no transient state which leaves owner and user space TID out of sync. Backport to 3.13 conflicts: kernel/futex.c Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: John Johansen <john.johansen@canonical.com> Cc: Kees Cook <keescook@chromium.org> Cc: Will Drewry <wad@chromium.org> Cc: Darren Hart <dvhart@linux.intel.com> Cc: stable@vger.kernel.org
* futex: Always cleanup owner tid in unlock_piThomas Gleixner2014-06-111-22/+18
| | | | | | | | | | | | | | | | | | If the owner died bit is set at futex_unlock_pi, we currently do not cleanup the user space futex. So the owner TID of the current owner (the unlocker) persists. That's observable inconsistant state, especially when the ownership of the pi state got transferred. Clean it up unconditionally. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Kees Cook <keescook@chromium.org> Cc: Will Drewry <wad@chromium.org> Cc: Darren Hart <dvhart@linux.intel.com> Cc: stable@vger.kernel.org Conflicts: kernel/futex.c
* futex: Validate atomic acquisition in futex_lock_pi_atomic()Thomas Gleixner2014-06-111-3/+11
| | | | | | | | | | | | | | | | | | | | We need to protect the atomic acquisition in the kernel against rogue user space which sets the user space futex to 0, so the kernel side acquisition succeeds while there is existing state in the kernel associated to the real owner. Verify whether the futex has waiters associated with kernel state. If it has, return -EINVAL. The state is corrupted already, so no point in cleaning it up. Subsequent calls will fail as well. Not our problem. [ tglx: Use futex_top_waiter() and explain why we do not need to try restoring the already corrupted user space state. ] Signed-off-by: Darren Hart <dvhart@linux.intel.com> Cc: Kees Cook <keescook@chromium.org> Cc: Will Drewry <wad@chromium.org> Cc: stable@vger.kernel.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* futex-prevent-requeue-pi-on-same-futex.patch futex: Forbid uaddr == uaddr2 ↵Thomas Gleixner2014-06-111-0/+25
| | | | | | | | | | | | | | | | | | | | | | | | in futex_requeue(..., requeue_pi=1) If uaddr == uaddr2, then we have broken the rule of only requeueing from a non-pi futex to a pi futex with this call. If we attempt this, then dangling pointers may be left for rt_waiter resulting in an exploitable condition. This change brings futex_requeue() into line with futex_wait_requeue_pi() which performs the same check as per commit 6f7b0a2a5 (futex: Forbid uaddr == uaddr2 in futex_wait_requeue_pi()) [ tglx: Compare the resulting keys as well, as uaddrs might be different depending on the mapping ] Fixes CVE-2014-3153. Reported-by: Pinkie Pie Signed-off-by: Will Drewry <wad@chromium.org> Signed-off-by: Kees Cook <keescook@chromium.org> Cc: stable@vger.kernel.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* prctl: adds PR_SET_TIMERSLACK_PID for setting timer slack of an arbitrary ↵Ruchi Kandoi2014-04-241-0/+19
| | | | | | | | | | | | | | | | | | | | thread. Second argument is similar to PR_SET_TIMERSLACK, if non-zero then the slack is set to that value otherwise sets it to the default for the thread. Takes PID of the thread as the third argument. This allows power/performance management software to set timer slack for other threads according to its policy for the thread (such as when the thread is designated foreground vs. background activity) Change-Id: I744d451ff4e60dae69f38f53948ff36c51c14a3f Signed-off-by: Ruchi Kandoi <kandoiruchi@google.com> Conflicts: include/linux/prctl.h kernel/sys.c
* Power: Changes the permission to read only for sysfs fileRuchi Kandoi2014-04-241-3/+2
| | | | | | | /sys/kernel/wakeup_reasons/last_resume_reason Change-Id: If25e8e416ee9726996518b58b6551a61dc1591e3 Signed-off-by: Ruchi Kandoi <kandoiruchi@google.com>
* POWER: fix compile warnings in log_wakeup_reasonRuchi Kandoi2014-03-131-3/+4
| | | | | | | | Change I81addaf420f1338255c5d0638b0d244a99d777d1 introduced compile warnings, fix these. Change-Id: I05482a5335599ab96c0a088a7d175c8d4cf1cf69 Signed-off-by: Ruchi Kandoi <kandoiruchi@google.com>
* Power: add an API to log wakeup reasonsRuchi Kandoi2014-03-132-0/+134
| | | | | | | | Add API log_wakeup_reason() and expose it to userspace via sysfs path /sys/kernel/wakeup_reasons/last_resume_reason Change-Id: I81addaf420f1338255c5d0638b0d244a99d777d1 Signed-off-by: Ruchi Kandoi <kandoiruchi@google.com>
* perf: Treat attr.config as u64 in perf_swevent_init()Tommi Rantala2013-07-031-1/+1
| | | | | | | | | | | | | | | | | | Trinity discovered that we fail to check all 64 bits of attr.config passed by user space, resulting to out-of-bounds access of the perf_swevent_enabled array in sw_perf_event_destroy(). Introduced in commit b0a873ebb ("perf: Register PMU implementations"). Signed-off-by: Tommi Rantala <tt.rantala@gmail.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: davej@redhat.com Cc: Paul Mackerras <paulus@samba.org> Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Link: http://lkml.kernel.org/r/1365882554-30259-1-git-send-email-tt.rantala@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* signal: always clear sa_restorer on execveKees Cook2013-04-091-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When the new signal handlers are set up, the location of sa_restorer is not cleared, leaking a parent process's address space location to children. This allows for a potential bypass of the parent's ASLR by examining the sa_restorer value returned when calling sigaction(). Based on what should be considered "secret" about addresses, it only matters across the exec not the fork (since the VMAs haven't changed until the exec). But since exec sets SIG_DFL and keeps sa_restorer, this is where it should be fixed. Given the few uses of sa_restorer, a "set" function was not written since this would be the only use. Instead, we use __ARCH_HAS_SA_RESTORER, as already done in other places. Example of the leak before applying this patch: $ cat /proc/$$/maps ... 7fb9f3083000-7fb9f3238000 r-xp 00000000 fd:01 404469 .../libc-2.15.so ... $ ./leak ... 7f278bc74000-7f278be29000 r-xp 00000000 fd:01 404469 .../libc-2.15.so ... 1 0 (nil) 0x7fb9f30b94a0 2 4000000 (nil) 0x7f278bcaa4a0 3 4000000 (nil) 0x7f278bcaa4a0 4 0 (nil) 0x7fb9f30b94a0 ... [akpm@linux-foundation.org: use SA_RESTORER for backportability] Signed-off-by: Kees Cook <keescook@chromium.org> Reported-by: Emese Revfy <re.emese@gmail.com> Cc: Emese Revfy <re.emese@gmail.com> Cc: PaX Team <pageexec@freemail.hu> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Oleg Nesterov <oleg@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Serge Hallyn <serge.hallyn@canonical.com> Cc: Julien Tinnes <jln@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Ed Tam <etam@google.com>
* wake_up_process() should be never used to wakeup a TASK_STOPPED/TRACED taskOleg Nesterov2013-04-031-1/+2
| | | | | | | | | | | | | wake_up_process() should never wakeup a TASK_STOPPED/TRACED task. Change it to use TASK_NORMAL and add the WARN_ON(). TASK_ALL has no other users, probably can be killed. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ed Tam <etam@google.com>
* ptrace: ensure arch_ptrace/ptrace_request can never race with SIGKILLOleg Nesterov2013-04-032-9/+55
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | putreg() assumes that the tracee is not running and pt_regs_access() can safely play with its stack. However a killed tracee can return from ptrace_stop() to the low-level asm code and do RESTORE_REST, this means that debugger can actually read/modify the kernel stack until the tracee does SAVE_REST again. set_task_blockstep() can race with SIGKILL too and in some sense this race is even worse, the very fact the tracee can be woken up breaks the logic. As Linus suggested we can clear TASK_WAKEKILL around the arch_ptrace() call, this ensures that nobody can ever wakeup the tracee while the debugger looks at it. Not only this fixes the mentioned problems, we can do some cleanups/simplifications in arch_ptrace() paths. Probably ptrace_unfreeze_traced() needs more callers, for example it makes sense to make the tracee killable for oom-killer before access_process_vm(). While at it, add the comment into may_ptrace_stop() to explain why ptrace_stop() still can't rely on SIGKILL and signal_pending_state(). Reported-by: Salman Qazi <sqazi@google.com> Reported-by: Suleiman Souhlal <suleiman@google.com> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ed Tam <etam@google.com> Conflicts: kernel/ptrace.c