aboutsummaryrefslogtreecommitdiff
path: root/fs
Commit message (Collapse)AuthorAgeFilesLines
* Merge remote-tracking branch 'lineageos/cm-14.1' into ElementalX-6.00-cmflar22017-09-251-1/+5
|\
| * fs/exec: fix use after free in execveAndrea Arcangeli2017-09-111-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | "file" can be already freed if bprm->file is NULL after search_binary_handler() return. binfmt_script will do exactly that for example. If the VM reuses the file after fput run(), this will result in a use ater free. So obtain d_is_su before search_binary_handler() runs. This should explain this crash: [25333.009554] Unable to handle kernel NULL pointer dereference at virtual address 00000185 [..] [25333.009918] [2: am:21861] PC is at do_execve+0x354/0x474 Change-Id: I2a8a814d1c0aa75625be83cb30432cf13f1a0681 Signed-off-by: Kevin F. Haggerty <haggertk@lineageos.org>
* | Merge remote-tracking branch 'lineageos/cm-14.1' into ElementalX-6.00-cmflar22017-08-232-3/+39
|\|
| * f2fs: sanity check checkpoint segno and blkoffJin Qian2017-08-071-0/+18
| | | | | | | | | | | | | | | | | | | | | | | | Make sure segno and blkoff read from raw image are valid. Cc: stable@vger.kernel.org Signed-off-by: Jin Qian <jinqian@google.com> [Jaegeuk Kim: adjust minor coding style] Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Change-Id: Ie2505c071233c1a9dec2729fe1ad467689a1b7a2 (cherry picked from commit 15d3042a937c13f5d9244241c7a9c8416ff6e82a)
| * f2fs: sanity check segment countJin Qian2017-08-071-0/+7
| | | | | | | | | | | | | | | | | | | | | | F2FS uses 4 bytes to represent block address. As a result, supported size of disk is 16 TB and it equals to 16 * 1024 * 1024 / 2 segments. Signed-off-by: Jin Qian <jinqian@google.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Change-Id: I16b3cd6279bff1a221781a80b9b34744c9e7098f (cherry picked from commit b9dd46188edc2f0d1f37328637860bb65a771124)
| * timerfd: Protect the might cancel mechanism properThomas Gleixner2017-08-071-3/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The handling of the might_cancel queueing is not properly protected, so parallel operations on the file descriptor can race with each other and lead to list corruptions or use after free. Protect the context for these operations with a seperate lock. The wait queue lock cannot be reused for this because that would create a lock inversion scenario vs. the cancel lock. Replacing might_cancel with an atomic (atomic_t or atomic bit) does not help either because it still can race vs. the actual list operation. Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: "linux-fsdevel@vger.kernel.org" Cc: syzkaller <syzkaller@googlegroups.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1701311521430.3457@nanos Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Change-Id: I1f2d38a919ceb1ca1c7c9471dece0c1126383912 (cherry picked from commit 1e38da300e1e395a15048b0af1e5305bd91402f6)
* | Merge remote-tracking branch 'lineageos/cm-14.1' into ElementalX-6.00-cmflar22017-07-218-29/+73
|\|
| * udf: Check path length when reading symlinkJan Kara2017-07-105-20/+48
| | | | | | | | | | | | | | | | | | | | | | | | | | Symlink reading code does not check whether the resulting path fits into the page provided by the generic code. This isn't as easy as just checking the symlink size because of various encoding conversions we perform on path. So we have to check whether there is still enough space in the buffer on the fly. Change-Id: Id56d129029eaf2e651cf7236103fb73aa540ae1f CC: stable@vger.kernel.org Reported-by: Carl Henrik Lunde <chlunde@ping.uio.no> Signed-off-by: Jan Kara <jack@suse.cz>
| * fs/exec.c: account for argv/envp pointersKees Cook2017-07-041-4/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 98da7d08850fb8bdeb395d6368ed15753304aa0c upstream. When limiting the argv/envp strings during exec to 1/4 of the stack limit, the storage of the pointers to the strings was not included. This means that an exec with huge numbers of tiny strings could eat 1/4 of the stack limit in strings and then additional space would be later used by the pointers to the strings. For example, on 32-bit with a 8MB stack rlimit, an exec with 1677721 single-byte strings would consume less than 2MB of stack, the max (8MB / 4) amount allowed, but the pointers to the strings would consume the remaining additional stack space (1677721 * 4 == 6710884). The result (1677721 + 6710884 == 8388605) would exhaust stack space entirely. Controlling this stack exhaustion could result in pathological behavior in setuid binaries (CVE-2017-1000365). [akpm@linux-foundation.org: additional commenting from Kees] Fixes: b6a2fea39318 ("mm: variable length argument support") Link: http://lkml.kernel.org/r/20170622001720.GA32173@beast Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Qualys Security Advisory <qsa@qualys.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Change-Id: I2e01d7be2d52415264ff48c632bfe307008c4e03
| * mm: larger stack guard gap, between vmasHugh Dickins2017-07-022-5/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 1be7107fbe18eed3e319a6c3e83c78254b693acb upstream. Stack guard page is a useful feature to reduce a risk of stack smashing into a different mapping. We have been using a single page gap which is sufficient to prevent having stack adjacent to a different mapping. But this seems to be insufficient in the light of the stack usage in userspace. E.g. glibc uses as large as 64kB alloca() in many commonly used functions. Others use constructs liks gid_t buffer[NGROUPS_MAX] which is 256kB or stack strings with MAX_ARG_STRLEN. This will become especially dangerous for suid binaries and the default no limit for the stack size limit because those applications can be tricked to consume a large portion of the stack and a single glibc call could jump over the guard page. These attacks are not theoretical, unfortunatelly. Make those attacks less probable by increasing the stack guard gap to 1MB (on systems with 4k pages; but make it depend on the page size because systems with larger base pages might cap stack allocations in the PAGE_SIZE units) which should cover larger alloca() and VLA stack allocations. It is obviously not a full fix because the problem is somehow inherent, but it should reduce attack space a lot. One could argue that the gap size should be configurable from userspace, but that can be done later when somebody finds that the new 1MB is wrong for some special case applications. For now, add a kernel command line option (stack_guard_gap) to specify the stack gap size (in page units). Implementation wise, first delete all the old code for stack guard page: because although we could get away with accounting one extra page in a stack vma, accounting a larger gap can break userspace - case in point, a program run with "ulimit -S -v 20000" failed when the 1MB gap was counted for RLIMIT_AS; similar problems could come with RLIMIT_MLOCK and strict non-overcommit mode. Instead of keeping gap inside the stack vma, maintain the stack guard gap as a gap between vmas: using vm_start_gap() in place of vm_start (or vm_end_gap() in place of vm_end if VM_GROWSUP) in just those few places which need to respect the gap - mainly arch_get_unmapped_area(), and and the vma tree's subtree_gap support for that. Change-Id: I611023b0bfe1cab7b3e5da13e331a7baaaaf6eb0 Original-patch-by: Oleg Nesterov <oleg@redhat.com> Original-patch-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Hugh Dickins <hughd@google.com> [wt: backport to 4.11: adjust context] [wt: backport to 4.9: adjust context ; kernel doc was not in admin-guide] [wt: backport to 4.4: adjust context ; drop ppc hugetlb_radix changes] [wt: backport to 3.18: adjust context ; no FOLL_POPULATE ; s390 uses generic arch_get_unmapped_area()] [wt: backport to 3.16: adjust context] [wt: backport to 3.10: adjust context ; code logic in PARISC's arch_get_unmapped_area() wasn't found ; code inserted into expand_upwards() and expand_downwards() runs under anon_vma lock; changes for gup.c:faultin_page go to memory.c:__get_user_pages(); included Hugh Dickins' fixes] Signed-off-by: Willy Tarreau <w@1wt.eu> Signed-off-by: Flex1911 <dedsa2002@gmail.com>
* | Merge remote-tracking branch 'lineageos/cm-14.1' into ElementalX-6.00-cmflar22017-07-0317-103/+99
|\|
| * splice: introduce FMODE_SPLICE_READ and FMODE_SPLICE_WRITELinus Torvalds2017-06-262-0/+10
| | | | | | | | | | | | | | | | | | | | | | | | Introduce FMODE_SPLICE_READ and FMODE_SPLICE_WRITE. These modes check whether it is legal to read or write a file using splice. Both get automatically set on regular files and are not checked when a 'struct fileoperations' includes the splice_{read,write} methods. Change-Id: Icb601a7db12d4e07a62d790edaa8a9a5aed3ba2a Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
| * fs: fuse: Add replacment for CMA pages into the LRU cacheLaura Abbott2017-06-261-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | CMA pages are currently replaced in the FUSE file system since FUSE may hold on to CMA pages for a long time, preventing migration. The replacement page is added to the file cache but not the LRU cache. This may prevent the page from being properly aged and dropped, creating poor performance under tight memory condition. Fix this by adding the new page to the LRU cache after creation. Change-Id: Ib349abf1024d48386b835335f3fbacae040b6241 CRs-Fixed: 586855 Signed-off-by: Laura Abbott <lauraa@codeaurora.org>
| * BACKPORT: posix_acl: Clear SGID bit when setting file permissionsJan Kara2017-06-2614-103/+87
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [Partially applied during f2fs inclusion, changes now aligned to upstream] (cherry pick from commit 073931017b49d9458aa351605b43a7e34598caef) When file permissions are modified via chmod(2) and the user is not in the owning group or capable of CAP_FSETID, the setgid bit is cleared in inode_change_ok(). Setting a POSIX ACL via setxattr(2) sets the file permissions as well as the new ACL, but doesn't clear the setgid bit in a similar way; this allows to bypass the check in chmod(2). Fix that. NB: conflicts resolution included extending the change to all visible users of the near deprecated function posix_acl_equiv_mode replaced with posix_acl_update_mode. We did not resolve the ACL leak in this CL, require additional upstream fixes. References: CVE-2016-7097 Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Bug: 32458736 [haggertk]: Backport to 3.4/msm8974 * convert use of capable_wrt_inode_uidgid to capable Change-Id: I19591ad452cc825ac282b3cfd2daaa72aa9a1ac1
* | Merge remote-tracking branch 'lineageos/cm-14.1' into ElementalX-6.00-cmflar22017-06-121-5/+5
|\|
| * fs/proc/array.c: make safe access to group_leaderAdrian Salido2017-06-071-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As mentioned in commit 52ee2dfdd4f51cf422ea6a96a0846dc94244aa37 ("pids: refactor vnr/nr_ns helpers to make them safe"). *_nr_ns helpers used to be buggy. The commit addresses most of the helpers but is missing task_tgid_xxx() Without this protection there is a possible use after free reported by kasan instrumented kernel: ================================================================== BUG: KASAN: use-after-free in task_tgid_nr_ns+0x2c/0x44 at addr *** Read of size 8 by task cat/2472 CPU: 1 PID: 2472 Comm: cat Tainted: **** Hardware name: Google Tegra210 Smaug Rev 1,3+ (DT) Call trace: [<ffffffc00020ad2c>] dump_backtrace+0x0/0x17c [<ffffffc00020aec0>] show_stack+0x18/0x24 [<ffffffc0011573d0>] dump_stack+0x94/0x100 [<ffffffc0003c7dc0>] kasan_report+0x308/0x554 [<ffffffc0003c7518>] __asan_load8+0x20/0x7c [<ffffffc00025a54c>] task_tgid_nr_ns+0x28/0x44 [<ffffffc00046951c>] proc_pid_status+0x444/0x1080 [<ffffffc000460f60>] proc_single_show+0x8c/0xdc [<ffffffc0004081b0>] seq_read+0x2e8/0x6f0 [<ffffffc0003d1420>] vfs_read+0xd8/0x1e0 [<ffffffc0003d1b98>] SyS_read+0x68/0xd4 Accessing group_leader while holding rcu_lock and using the now safe helpers introduced in the commit mentioned, this race condition is addressed. Signed-off-by: Adrian Salido <salidoa@google.com> Change-Id: I4315217922dda375a30a3581c0c1740dda7b531b Bug: 31495866
* | Merge remote-tracking branch 'lineageos/cm-14.1' into ElementalX-6.00-cmflar22017-06-053-0/+35
|\|
| * kernel: Fix potential refcount leak in su checkTom Marshall2017-05-191-1/+3
| | | | | | | | Change-Id: I3d241ae805ba708c18bccfd5e5d6cdcc8a5bc1c8
| * kernel: Only expose su when daemon is runningTom Marshall2017-05-193-0/+33
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Note: this is for the 3.4 kernel It has been claimed that the PG implementation of 'su' has security vulnerabilities even when disabled. Unfortunately, the people that find these vulnerabilities often like to keep them private so they can profit from exploits while leaving users exposed to malicious hackers. In order to reduce the attack surface for vulnerabilites, it is therefore necessary to make 'su' completely inaccessible when it is not in use (except by the root and system users). Change-Id: Ia7d50ba46c3d932c2b0ca5fc8e9ec69ec9045f85
* | Merge remote-tracking branch 'lineageos/cm-14.1' into ElementalX-6.00-cmflar22017-04-252-4/+12
|\|
| * splice: Apply generic position and size checks to each writeBen Hutchings2017-04-172-4/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | We need to check the position and size of file writes against various limits, using generic_write_check(). This was not being done for the splice write path. It was fixed upstream by commit 8d0207652cbe ("->splice_write() via ->write_iter()") but we can't apply that. CVE-2014-7822 Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Change-Id: Ic7e7bd78d4594c993c9684d32a0ddeaf70165bce (cherry picked from commit 894c6350eaad7e613ae267504014a456e00a3e2a)
| * ecryptfs: don't allow mmap when the lower fs doesn't support itJeff Mahoney2017-03-031-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are legitimate reasons to disallow mmap on certain files, notably in sysfs or procfs. We shouldn't emulate mmap support on file systems that don't offer support natively. CVE-2016-1583 Signed-off-by: Jeff Mahoney <jeffm@suse.com> Cc: stable@vger.kernel.org [tyhicks: clean up f_op check by using ecryptfs_file_to_lower()] Signed-off-by: Tyler Hicks <tyhicks@canonical.com> (adapted from commit f0fe970df3838c202ef6c07a4c2b36838ef0a88b) Change-Id: I3eb979e9476847834eeea0ecbaf07a53329a7219
| * ext4: validate s_first_meta_bg at mount timeEryu Guan2017-03-031-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Ralf Spenneberg reported that he hit a kernel crash when mounting a modified ext4 image. And it turns out that kernel crashed when calculating fs overhead (ext4_calculate_overhead()), this is because the image has very large s_first_meta_bg (debug code shows it's 842150400), and ext4 overruns the memory in count_overhead() when setting bitmap buffer, which is PAGE_SIZE. ext4_calculate_overhead(): buf = get_zeroed_page(GFP_NOFS); <=== PAGE_SIZE buffer blks = count_overhead(sb, i, buf); count_overhead(): for (j = ext4_bg_num_gdb(sb, grp); j > 0; j--) { <=== j = 842150400 ext4_set_bit(EXT4_B2C(sbi, s++), buf); <=== buffer overrun count++; } This can be reproduced easily for me by this script: #!/bin/bash rm -f fs.img mkdir -p /mnt/ext4 fallocate -l 16M fs.img mke2fs -t ext4 -O bigalloc,meta_bg,^resize_inode -F fs.img debugfs -w -R "ssv first_meta_bg 842150400" fs.img mount -o loop fs.img /mnt/ext4 Fix it by validating s_first_meta_bg first at mount time, and refusing to mount if its value exceeds the largest possible meta_bg number. Reported-by: Ralf Spenneberg <ralf@os-t.de> Signed-off-by: Eryu Guan <guaneryu@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Andreas Dilger <adilger@dilger.ca> (cherry picked from commit 3a4b77cd47bb837b8557595ec7425f281f2ca1fe) (minor backport adapted from cf851ad35fd1e9c7b8ed00741eca613bc1a9c8c8) Change-Id: If183ad4a873705c9a0312087577705298b3586fe
| * BACKPORT: aio: mark AIO pseudo-fs noexecNick Desaulniers2017-03-031-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This ensures that do_mmap() won't implicitly make AIO memory mappings executable if the READ_IMPLIES_EXEC personality flag is set. Such behavior is problematic because the security_mmap_file LSM hook doesn't catch this case, potentially permitting an attacker to bypass a W^X policy enforced by SELinux. I have tested the patch on my machine. To test the behavior, compile and run this: #define _GNU_SOURCE #include <unistd.h> #include <sys/personality.h> #include <linux/aio_abi.h> #include <err.h> #include <stdlib.h> #include <stdio.h> #include <sys/syscall.h> int main(void) { personality(READ_IMPLIES_EXEC); aio_context_t ctx = 0; if (syscall(__NR_io_setup, 1, &ctx)) err(1, "io_setup"); char cmd[1000]; sprintf(cmd, "cat /proc/%d/maps | grep -F '/[aio]'", (int)getpid()); system(cmd); return 0; } In the output, "rw-s" is good, "rwxs" is bad. Signed-off-by: Jann Horn <jann@thejh.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit 22f6b4d34fcf039c63a94e7670e0da24f8575a5a) (cherry picked from googlesource commit bc02d1d9f5d0e0610504c24b05fef54726ba1a1b) Bug: 31711619 Change-Id: I9f2872703bef240d6b82320c744529459bb076dc
| * isofs: Fix infinite looping over CE entriesJan Kara2017-03-031-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Rock Ridge extensions define so called Continuation Entries (CE) which define where is further space with Rock Ridge data. Corrupted isofs image can contain arbitrarily long chain of these, including a one containing loop and thus causing kernel to end in an infinite loop when traversing these entries. Limit the traversal to 32 entries which should be more than enough space to store all the Rock Ridge data. Reported-by: P J P <ppandit@redhat.com> CC: stable@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz> (cherry picked from commit f54e18f1b831c92f6512d2eedb224cd63d607d3d) Change-Id: I62cd59b27ac11fbf0a04b0d02874df7f390338bb
* | ecryptfs: don't allow mmap when the lower fs doesn't support itJeff Mahoney2017-03-121-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are legitimate reasons to disallow mmap on certain files, notably in sysfs or procfs. We shouldn't emulate mmap support on file systems that don't offer support natively. CVE-2016-1583 Signed-off-by: Jeff Mahoney <jeffm@suse.com> Cc: stable@vger.kernel.org [tyhicks: clean up f_op check by using ecryptfs_file_to_lower()] Signed-off-by: Tyler Hicks <tyhicks@canonical.com> (adapted from commit f0fe970df3838c202ef6c07a4c2b36838ef0a88b) Change-Id: I3eb979e9476847834eeea0ecbaf07a53329a7219
* | ext4: validate s_first_meta_bg at mount timeEryu Guan2017-03-121-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Ralf Spenneberg reported that he hit a kernel crash when mounting a modified ext4 image. And it turns out that kernel crashed when calculating fs overhead (ext4_calculate_overhead()), this is because the image has very large s_first_meta_bg (debug code shows it's 842150400), and ext4 overruns the memory in count_overhead() when setting bitmap buffer, which is PAGE_SIZE. ext4_calculate_overhead(): buf = get_zeroed_page(GFP_NOFS); <=== PAGE_SIZE buffer blks = count_overhead(sb, i, buf); count_overhead(): for (j = ext4_bg_num_gdb(sb, grp); j > 0; j--) { <=== j = 842150400 ext4_set_bit(EXT4_B2C(sbi, s++), buf); <=== buffer overrun count++; } This can be reproduced easily for me by this script: #!/bin/bash rm -f fs.img mkdir -p /mnt/ext4 fallocate -l 16M fs.img mke2fs -t ext4 -O bigalloc,meta_bg,^resize_inode -F fs.img debugfs -w -R "ssv first_meta_bg 842150400" fs.img mount -o loop fs.img /mnt/ext4 Fix it by validating s_first_meta_bg first at mount time, and refusing to mount if its value exceeds the largest possible meta_bg number. Reported-by: Ralf Spenneberg <ralf@os-t.de> Signed-off-by: Eryu Guan <guaneryu@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Andreas Dilger <adilger@dilger.ca> (cherry picked from commit 3a4b77cd47bb837b8557595ec7425f281f2ca1fe) (minor backport adapted from cf851ad35fd1e9c7b8ed00741eca613bc1a9c8c8) Change-Id: If183ad4a873705c9a0312087577705298b3586fe
* | BACKPORT: aio: mark AIO pseudo-fs noexecNick Desaulniers2017-03-121-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This ensures that do_mmap() won't implicitly make AIO memory mappings executable if the READ_IMPLIES_EXEC personality flag is set. Such behavior is problematic because the security_mmap_file LSM hook doesn't catch this case, potentially permitting an attacker to bypass a W^X policy enforced by SELinux. I have tested the patch on my machine. To test the behavior, compile and run this: #define _GNU_SOURCE #include <unistd.h> #include <sys/personality.h> #include <linux/aio_abi.h> #include <err.h> #include <stdlib.h> #include <stdio.h> #include <sys/syscall.h> int main(void) { personality(READ_IMPLIES_EXEC); aio_context_t ctx = 0; if (syscall(__NR_io_setup, 1, &ctx)) err(1, "io_setup"); char cmd[1000]; sprintf(cmd, "cat /proc/%d/maps | grep -F '/[aio]'", (int)getpid()); system(cmd); return 0; } In the output, "rw-s" is good, "rwxs" is bad. Signed-off-by: Jann Horn <jann@thejh.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit 22f6b4d34fcf039c63a94e7670e0da24f8575a5a) (cherry picked from googlesource commit bc02d1d9f5d0e0610504c24b05fef54726ba1a1b) Bug: 31711619 Change-Id: I9f2872703bef240d6b82320c744529459bb076dc
* | isofs: Fix infinite looping over CE entriesJan Kara2017-03-121-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Rock Ridge extensions define so called Continuation Entries (CE) which define where is further space with Rock Ridge data. Corrupted isofs image can contain arbitrarily long chain of these, including a one containing loop and thus causing kernel to end in an infinite loop when traversing these entries. Limit the traversal to 32 entries which should be more than enough space to store all the Rock Ridge data. Reported-by: P J P <ppandit@redhat.com> CC: stable@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz> (cherry picked from commit f54e18f1b831c92f6512d2eedb224cd63d607d3d) Change-Id: I62cd59b27ac11fbf0a04b0d02874df7f390338bb
* | enable fsync by defaultflar22016-11-221-1/+1
| | | | | | | | Signed-off-by: flar2 <asegaert@gmail.com>
* | switch do_fsync() to fget_light()Al Viro2016-11-221-2/+3
| | | | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: flar2 <asegaert@gmail.com>
* | fs: sync: add missing return if fsync is disabled from userspace.Francisco Franco2016-11-221-0/+3
| | | | | | | | | | Signed-off-by: Francisco Franco <franciscofranco.1990@gmail.com> Signed-off-by: flar2 <asegaert@gmail.com>
* | Added fsync on/off support.franciscofranco2016-11-221-0/+25
| | | | | | | | | | Signed-off-by: Francisco Franco <franciscofranco.1990@gmail.com> Signed-off-by: flar2 <asegaert@gmail.com>
* | exFAT supportflar22016-11-2228-0/+12322
| | | | | | | | Signed-off-by: flar2 <asegaert@gmail.com>
* | AIO: Don't plug the I/O queue in do_io_submit()flar22016-11-221-4/+0
|/ | | | Signed-off-by: flar2 <asegaert@gmail.com>
* block: fix use-after-free in sys_ioprio_get()Omar Sandoval2016-11-111-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | get_task_ioprio() accesses the task->io_context without holding the task lock and thus can race with exit_io_context(), leading to a use-after-free. The reproducer below hits this within a few seconds on my 4-core QEMU VM: int main(int argc, char **argv) { pid_t pid, child; long nproc, i; /* ioprio_set(IOPRIO_WHO_PROCESS, 0, IOPRIO_PRIO_VALUE(IOPRIO_CLASS_IDLE, 0)); */ syscall(SYS_ioprio_set, 1, 0, 0x6000); nproc = sysconf(_SC_NPROCESSORS_ONLN); for (i = 0; i < nproc; i++) { pid = fork(); assert(pid != -1); if (pid == 0) { for (;;) { pid = fork(); assert(pid != -1); if (pid == 0) { _exit(0); } else { child = wait(NULL); assert(child == pid); } } } pid = fork(); assert(pid != -1); if (pid == 0) { for (;;) { /* ioprio_get(IOPRIO_WHO_PGRP, 0); */ syscall(SYS_ioprio_get, 2, 0); } } } for (;;) { /* ioprio_get(IOPRIO_WHO_PGRP, 0); */ syscall(SYS_ioprio_get, 2, 0); } return 0; } This gets us KASAN dumps like this: [ 35.526914] ================================================================== [ 35.530009] BUG: KASAN: out-of-bounds in get_task_ioprio+0x7b/0x90 at addr ffff880066f34e6c [ 35.530009] Read of size 2 by task ioprio-gpf/363 [ 35.530009] ============================================================================= [ 35.530009] BUG blkdev_ioc (Not tainted): kasan: bad access detected [ 35.530009] ----------------------------------------------------------------------------- [ 35.530009] Disabling lock debugging due to kernel taint [ 35.530009] INFO: Allocated in create_task_io_context+0x2b/0x370 age=0 cpu=0 pid=360 [ 35.530009] ___slab_alloc+0x55d/0x5a0 [ 35.530009] __slab_alloc.isra.20+0x2b/0x40 [ 35.530009] kmem_cache_alloc_node+0x84/0x200 [ 35.530009] create_task_io_context+0x2b/0x370 [ 35.530009] get_task_io_context+0x92/0xb0 [ 35.530009] copy_process.part.8+0x5029/0x5660 [ 35.530009] _do_fork+0x155/0x7e0 [ 35.530009] SyS_clone+0x19/0x20 [ 35.530009] do_syscall_64+0x195/0x3a0 [ 35.530009] return_from_SYSCALL_64+0x0/0x6a [ 35.530009] INFO: Freed in put_io_context+0xe7/0x120 age=0 cpu=0 pid=1060 [ 35.530009] __slab_free+0x27b/0x3d0 [ 35.530009] kmem_cache_free+0x1fb/0x220 [ 35.530009] put_io_context+0xe7/0x120 [ 35.530009] put_io_context_active+0x238/0x380 [ 35.530009] exit_io_context+0x66/0x80 [ 35.530009] do_exit+0x158e/0x2b90 [ 35.530009] do_group_exit+0xe5/0x2b0 [ 35.530009] SyS_exit_group+0x1d/0x20 [ 35.530009] entry_SYSCALL_64_fastpath+0x1a/0xa4 [ 35.530009] INFO: Slab 0xffffea00019bcd00 objects=20 used=4 fp=0xffff880066f34ff0 flags=0x1fffe0000004080 [ 35.530009] INFO: Object 0xffff880066f34e58 @offset=3672 fp=0x0000000000000001 [ 35.530009] ================================================================== Fix it by grabbing the task lock while we poke at the io_context. Change-Id: I4261aaf076fab943a80a45b0a77e023aa4ecbbd8 Cc: stable@vger.kernel.org Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* fs: ext4: disable support for fallocate FALLOC_FL_PUNCH_HOLENick Desaulniers2016-10-311-0/+7
| | | | | | Bug: 28760453 Change-Id: I019c2de559db9e4b95860ab852211b456d78c4ca Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
* mnt: Fail collect_mounts when applied to unmounted mountsEric W. Biederman2016-10-311-2/+5
| | | | | | | | | | | | | | | | | | | | | | | The only users of collect_mounts are in audit_tree.c In audit_trim_trees and audit_add_tree_rule the path passed into collect_mounts is generated from kern_path passed an audit_tree pathname which is guaranteed to be an absolute path. In those cases collect_mounts is obviously intended to work on mounted paths and if a race results in paths that are unmounted when collect_mounts it is reasonable to fail early. The paths passed into audit_tag_tree don't have the absolute path check. But are used to play with fsnotify and otherwise interact with the audit_trees, so again operating only on mounted paths appears reasonable. Avoid having to worry about what happens when we try and audit unmounted filesystems by restricting collect_mounts to mounts that appear in the mount tree. Change-Id: I2edfee6d6951a2179ce8f53785b65ddb1eb95629 Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
* kthread: Prevent unpark race which puts threads on the wrong cpuThomas Gleixner2016-10-291-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The smpboot threads rely on the park/unpark mechanism which binds per cpu threads on a particular core. Though the functionality is racy: CPU0 CPU1 CPU2 unpark(T) wake_up_process(T) clear(SHOULD_PARK) T runs leave parkme() due to !SHOULD_PARK bind_to(CPU2) BUG_ON(wrong CPU) We cannot let the tasks move themself to the target CPU as one of those tasks is actually the migration thread itself, which requires that it starts running on the target cpu right away. The solution to this problem is to prevent wakeups in park mode which are not from unpark(). That way we can guarantee that the association of the task to the target cpu is working correctly. Add a new task state (TASK_PARKED) which prevents other wakeups and use this state explicitly for the unpark wakeup. Peter noticed: Also, since the task state is visible to userspace and all the parked tasks are still in the PID space, its a good hint in ps and friends that these tasks aren't really there for the moment. The migration thread has another related issue. CPU0 CPU1 Bring up CPU2 create_thread(T) park(T) wait_for_completion() parkme() complete() sched_set_stop_task() schedule(TASK_PARKED) The sched_set_stop_task() call is issued while the task is on the runqueue of CPU1 and that confuses the hell out of the stop_task class on that cpu. So we need the same synchronizaion before sched_set_stop_task(). Change-Id: I9ad6fbe65992ad5b5cb9a252470a56ec51a4ff4f Reported-by: Dave Jones <davej@redhat.com> Reported-and-tested-by: Dave Hansen <dave@sr71.net> Reported-and-tested-by: Borislav Petkov <bp@alien8.de> Acked-by: Peter Ziljstra <peterz@infradead.org> Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Cc: dhillf@gmail.com Cc: Ingo Molnar <mingo@kernel.org> Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1304091635430.21884@ionos Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* f2fs: set fsync mark only for the last dnodeJaegeuk Kim2016-10-294-20/+113
| | | | | | | | | | In order to give atomic writes, we should consider power failure during sync_node_pages in fsync. So, this patch marks fsync flag only in the last dnode block. Change-Id: Ib44a91bf820f6631fe359a8ac430ede77ceda403 Acked-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
* f2fs: report unwritten status in fsync_node_pagesJaegeuk Kim2016-10-292-8/+9
| | | | | | | | | The fsync_node_pages should return pass or failure so that user could know fsync is completed or not. Change-Id: I3d588c44ad7452e66d3d6a795f2060de75fd5d0f Acked-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
* f2fs: flush dirty pages before starting atomic writesJaegeuk Kim2016-10-291-1/+10
| | | | | | | | If somebody wrote some data before atomic writes, we should flush them in order to handle atomic data in a right period. Change-Id: I35611d9016330ef837554cff263bcbb10b4cc810 Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
* f2fs: unset atomic/volatile flag in f2fs_release_fileJaegeuk Kim2016-10-292-3/+4
| | | | | | | | | | | | | The atomic/volatile operation should be done in pair of start and commit ioctl. For example, if a killed process remains open-ended atomic operation, we should drop its flag as well as its atomic data. Otherwise, if sqlite initiates another operation which doesn't require atomic writes, it will lose every data, since f2fs still treats with them as atomic writes; nobody will trigger its commit. Change-Id: Ic97f7d88a1158e2f21f4bd5447870ff578641fb3 Reported-by: Miao Xie <miaoxie@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
* f2fs: fix dropping inmemory pages in a wrong timeJaegeuk Kim2016-10-291-0/+8
| | | | | | | | | | | | | | | | | | | | When one reader closes its file while the other writer is doing atomic writes, f2fs_release_file drops atomic data resulting in an empty commit. This patch fixes this wrong commit problem by checking openess of the file. Process0 Process1 open file start atomic write write data read data close file f2fs_release_file() clear atomic data commit atomic write Change-Id: I99b90b569a56cb53bccf8758f870e0f49849c6fd Reported-by: Miao Xie <miaoxie@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
* f2fs: split sync_node_pages with fsync_node_pagesJaegeuk Kim2016-10-295-33/+84
| | | | | | | | | This patch splits the existing sync_node_pages into (f)sync_node_pages. The fsync_node_pages is used for f2fs_sync_file only. Change-Id: I207b087a54f1a0c2e994a78cd6ed475578d7044e Acked-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
* f2fs: avoid writing 0'th page in volatile writesJaegeuk Kim2016-10-291-2/+4
| | | | | | | | | | | | | The first page of volatile writes usually contains a sort of header information which will be used for recovery. (e.g., journal header of sqlite) If this is written without other journal data, user needs to handle the stale journal information. Change-Id: I85f4cfe4cbef32ed43b0f52d7328b42d411dd2da Acked-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
* f2fs: avoid needless lock for node pages when fsyncing a fileJaegeuk Kim2016-10-291-3/+7
| | | | | | | | | When fsync is called, sync_node_pages finds a proper direct node pages to flush. But, it locks unrelated direct node pages together unnecessarily. Change-Id: I6adc83f2e6592aea707851ee6e365afcc0e36f92 Acked-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
* f2fs: fix deadlock when flush inline dataChao Yu2016-10-291-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Below backtrace info was reported by Yunlei He: Call Trace: [<ffffffff817a9395>] schedule+0x35/0x80 [<ffffffff817abb7d>] rwsem_down_read_failed+0xed/0x130 [<ffffffff813c12a8>] call_rwsem_down_read_failed+0x18/0x [<ffffffff817ab1d0>] down_read+0x20/0x30 [<ffffffffa02a1a12>] f2fs_evict_inode+0x242/0x3a0 [f2fs] [<ffffffff81217057>] evict+0xc7/0x1a0 [<ffffffff81217cd6>] iput+0x196/0x200 [<ffffffff812134f9>] __dentry_kill+0x179/0x1e0 [<ffffffff812136f9>] dput+0x199/0x1f0 [<ffffffff811fe77b>] __fput+0x18b/0x220 [<ffffffff811fe84e>] ____fput+0xe/0x10 [<ffffffff81097427>] task_work_run+0x77/0x90 [<ffffffff81074d62>] exit_to_usermode_loop+0x73/0xa2 [<ffffffff81003b7a>] do_syscall_64+0xfa/0x110 [<ffffffff817acf65>] entry_SYSCALL64_slow_path+0x25/0x25 Call Trace: [<ffffffff817a9395>] schedule+0x35/0x80 [<ffffffff81216dc3>] __wait_on_freeing_inode+0xa3/0xd0 [<ffffffff810bc300>] ? autoremove_wake_function+0x40/0x4 [<ffffffff8121771d>] find_inode_fast+0x7d/0xb0 [<ffffffff8121794a>] ilookup+0x6a/0xd0 [<ffffffffa02bc740>] sync_node_pages+0x210/0x650 [f2fs] [<ffffffff8122e690>] ? do_fsync+0x70/0x70 [<ffffffffa02b085e>] block_operations+0x9e/0xf0 [f2fs] [<ffffffff8137b795>] ? bio_endio+0x55/0x60 [<ffffffffa02b0942>] write_checkpoint+0x92/0xba0 [f2fs] [<ffffffff8117da57>] ? mempool_free_slab+0x17/0x20 [<ffffffff8117de8b>] ? mempool_free+0x2b/0x80 [<ffffffff8122e690>] ? do_fsync+0x70/0x70 [<ffffffffa02a53e3>] f2fs_sync_fs+0x63/0xd0 [f2fs] [<ffffffff8129630f>] ? ext4_sync_fs+0xbf/0x190 [<ffffffff8122e6b0>] sync_fs_one_sb+0x20/0x30 [<ffffffff812002e9>] iterate_supers+0xb9/0x110 [<ffffffff8122e7b5>] sys_sync+0x55/0x90 [<ffffffff81003ae9>] do_syscall_64+0x69/0x110 [<ffffffff817acf65>] entry_SYSCALL64_slow_path+0x25/0x25 With following excuting serials, we will set inline_node in inode page after inode was unlinked, result in a deadloop described as below: 1. open file 2. write file 3. unlink file 4. write file 5. close file Thread A Thread B - dput - iput_final - inode->i_state |= I_FREEING - evict - f2fs_evict_inode - f2fs_sync_fs - write_checkpoint - block_operations - f2fs_lock_all (down_write(cp_rwsem)) - f2fs_lock_op (down_read(cp_rwsem)) - sync_node_pages - ilookup - find_inode_fast - __wait_on_freeing_inode (wait on I_FREEING clear) Here, we change to set inline_node flag only for linked inode for fixing. Change-Id: Ibf4326ecb4ba68e45e4e964092e1d2955341bc56 Reported-by: Yunlei He <heyunlei@huawei.com> Signed-off-by: Chao Yu <yuchao0@huawei.com> Tested-by: Jaegeuk Kim <jaegeuk@kernel.org> Cc: stable@vger.kernel.org # v4.6 Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
* f2fs: fix to update dirty page count correctlyChao Yu2016-10-291-3/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Once we failed to merge inline data into inode page during flushing inline inode, we will skip invoking inode_dec_dirty_pages, which makes dirty page count incorrect, result in panic in ->evict_inode, Fix it. ------------[ cut here ]------------ kernel BUG at /home/yuchao/git/devf2fs/inode.c:336! invalid opcode: 0000 [#1] PREEMPT SMP CPU: 3 PID: 10004 Comm: umount Tainted: G O 4.6.0-rc5+ #17 Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 task: f0c33000 ti: c5212000 task.ti: c5212000 EIP: 0060:[<f89aacb5>] EFLAGS: 00010202 CPU: 3 EIP is at f2fs_evict_inode+0x85/0x490 [f2fs] EAX: 00000001 EBX: c4529ea0 ECX: 00000001 EDX: 00000000 ESI: c0131000 EDI: f89dd0a0 EBP: c5213e9c ESP: c5213e78 DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 CR0: 80050033 CR2: b75878c0 CR3: 1a36a700 CR4: 000406f0 Stack: c4529ea0 c4529ef4 c5213e8c c176d45c c4529ef4 00000000 c4529ea0 c4529fac f89dd0a0 c5213eb0 c1204a68 c5213ed8 c452a2b4 c6680930 c5213ec0 c1204b64 c6680d44 c6680620 c5213eec c120588d ee84b000 ee84b5c0 c5214000 ee84b5e0 Call Trace: [<c176d45c>] ? _raw_spin_unlock+0x2c/0x50 [<c1204a68>] evict+0xa8/0x170 [<c1204b64>] dispose_list+0x34/0x50 [<c120588d>] evict_inodes+0x10d/0x130 [<c11ea941>] generic_shutdown_super+0x41/0xe0 [<c1185190>] ? unregister_shrinker+0x40/0x50 [<c1185190>] ? unregister_shrinker+0x40/0x50 [<c11eac52>] kill_block_super+0x22/0x70 [<f89af23e>] kill_f2fs_super+0x1e/0x20 [f2fs] [<c11eae1d>] deactivate_locked_super+0x3d/0x70 [<c11eb383>] deactivate_super+0x43/0x60 [<c1208ec9>] cleanup_mnt+0x39/0x80 [<c1208f50>] __cleanup_mnt+0x10/0x20 [<c107d091>] task_work_run+0x71/0x90 [<c105725a>] exit_to_usermode_loop+0x72/0x9e [<c1001c7c>] do_fast_syscall_32+0x19c/0x1c0 [<c176dd48>] sysenter_past_esp+0x45/0x74 EIP: [<f89aacb5>] f2fs_evict_inode+0x85/0x490 [f2fs] SS:ESP 0068:c5213e78 ---[ end trace d30536330b7fdc58 ]--- Change-Id: I68907a13e6ac726e54f5c2bbe219bc2c8400a558 Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
* ext4/fscrypto: avoid RCU lookup in d_revalidateJaegeuk Kim2016-10-291-0/+4
| | | | | | | | | | | | | | As Al pointed, d_revalidate should return RCU lookup before using d_inode. This was originally introduced by: commit 34286d666230 ("fs: rcu-walk aware d_revalidate method"). Reported-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Theodore Ts'o <tytso@mit.edu> Cc: stable <stable@vger.kernel.org> Conflicts: fs/ext4/crypto.c