eeff6c78a31b6a34bd357a0a8302c8635e7c1908
16044 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
2497402c9a |
time: adjtimex: Validate the ADJ_FREQUENCY values
commit 5e5aeb4367b450a28f447f6d5ab57d8f2ab16a5f upstream. Verify that the frequency value from userspace is valid and makes sense. Unverified values can cause overflows later on. Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com> [jstultz: Fix up bug for negative values and drop redunent cap check] Signed-off-by: John Stultz <john.stultz@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
|
7336dcc213 |
time: settimeofday: Validate the values of tv from user
commit 6ada1fc0e1c4775de0e043e1bd3ae9d065491aa5 upstream. An unvalidated user input is multiplied by a constant, which can result in an undefined behaviour for large values. While this is validated later, we should avoid triggering undefined behaviour. Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com> [jstultz: include trivial milisecond->microsecond correction noticed by Andy] Signed-off-by: John Stultz <john.stultz@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
|
d525563b50 |
perf: Fix events installation during moving group
commit 9fc81d87420d0d3fd62d5e5529972c0ad9eab9cc upstream. We allow PMU driver to change the cpu on which the event should be installed to. This happened in patch: |
||
|
|
12279351a4 |
exit: pidns: alloc_pid() leaks pid_namespace if child_reaper is exiting
commit 24c037ebf5723d4d9ab0996433cee4f96c292a4d upstream. alloc_pid() does get_pid_ns() beforehand but forgets to put_pid_ns() if it fails because disable_pid_allocation() was called by the exiting child_reaper. We could simply move get_pid_ns() down to successful return, but this fix tries to be as trivial as possible. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Aaron Tomlin <atomlin@redhat.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Serge Hallyn <serge.hallyn@ubuntu.com> Cc: Sterling Alexander <stalexan@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
|
afb5285388 |
userns: Allow setting gid_maps without privilege when setgroups is disabled
commit 66d2f338ee4c449396b6f99f5e75cd18eb6df272 upstream. Now that setgroups can be disabled and not reenabled, setting gid_map without privielge can now be enabled when setgroups is disabled. This restores most of the functionality that was lost when unprivileged setting of gid_map was removed. Applications that use this functionality will need to check to see if they use setgroups or init_groups, and if they don't they can be fixed by simply disabling setgroups before writing to gid_map. Reviewed-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
|
1c587ee50e |
userns: Add a knob to disable setgroups on a per user namespace basis
commit 9cc46516ddf497ea16e8d7cb986ae03a0f6b92f8 upstream. - Expose the knob to user space through a proc file /proc/<pid>/setgroups A value of "deny" means the setgroups system call is disabled in the current processes user namespace and can not be enabled in the future in this user namespace. A value of "allow" means the segtoups system call is enabled. - Descendant user namespaces inherit the value of setgroups from their parents. - A proc file is used (instead of a sysctl) as sysctls currently do not allow checking the permissions at open time. - Writing to the proc file is restricted to before the gid_map for the user namespace is set. This ensures that disabling setgroups at a user namespace level will never remove the ability to call setgroups from a process that already has that ability. A process may opt in to the setgroups disable for itself by creating, entering and configuring a user namespace or by calling setns on an existing user namespace with setgroups disabled. Processes without privileges already can not call setgroups so this is a noop. Prodcess with privilege become processes without privilege when entering a user namespace and as with any other path to dropping privilege they would not have the ability to call setgroups. So this remains within the bounds of what is possible without a knob to disable setgroups permanently in a user namespace. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
|
6eaab78729 |
userns: Rename id_map_mutex to userns_state_mutex
commit f0d62aec931e4ae3333c797d346dc4f188f454ba upstream. Generalize id_map_mutex so it can be used for more state of a user namespace. Reviewed-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
|
a1821391e5 |
userns: Only allow the creator of the userns unprivileged mappings
commit f95d7918bd1e724675de4940039f2865e5eec5fe upstream. If you did not create the user namespace and are allowed to write to uid_map or gid_map you should already have the necessary privilege in the parent user namespace to establish any mapping you want so this will not affect userspace in practice. Limiting unprivileged uid mapping establishment to the creator of the user namespace makes it easier to verify all credentials obtained with the uid mapping can be obtained without the uid mapping without privilege. Limiting unprivileged gid mapping establishment (which is temporarily absent) to the creator of the user namespace also ensures that the combination of uid and gid can already be obtained without privilege. This is part of the fix for CVE-2014-8989. Reviewed-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
|
f028f2d732 |
userns: Check euid no fsuid when establishing an unprivileged uid mapping
commit 80dd00a23784b384ccea049bfb3f259d3f973b9d upstream. setresuid allows the euid to be set to any of uid, euid, suid, and fsuid. Therefor it is safe to allow an unprivileged user to map their euid and use CAP_SETUID privileged with exactly that uid, as no new credentials can be obtained. I can not find a combination of existing system calls that allows setting uid, euid, suid, and fsuid from the fsuid making the previous use of fsuid for allowing unprivileged mappings a bug. This is part of a fix for CVE-2014-8989. Reviewed-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
|
ba0922adbd |
userns: Don't allow unprivileged creation of gid mappings
commit be7c6dba2332cef0677fbabb606e279ae76652c3 upstream. As any gid mapping will allow and must allow for backwards compatibility dropping groups don't allow any gid mappings to be established without CAP_SETGID in the parent user namespace. For a small class of applications this change breaks userspace and removes useful functionality. This small class of applications includes tools/testing/selftests/mount/unprivilged-remount-test.c Most of the removed functionality will be added back with the addition of a one way knob to disable setgroups. Once setgroups is disabled setting the gid_map becomes as safe as setting the uid_map. For more common applications that set the uid_map and the gid_map with privilege this change will have no affect. This is part of a fix for CVE-2014-8989. Reviewed-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
|
fc9b65e3d7 |
userns: Don't allow setgroups until a gid mapping has been setablished
commit 273d2c67c3e179adb1e74f403d1e9a06e3f841b5 upstream. setgroups is unique in not needing a valid mapping before it can be called, in the case of setgroups(0, NULL) which drops all supplemental groups. The design of the user namespace assumes that CAP_SETGID can not actually be used until a gid mapping is established. Therefore add a helper function to see if the user namespace gid mapping has been established and call that function in the setgroups permission check. This is part of the fix for CVE-2014-8989, being able to drop groups without privilege using user namespaces. Reviewed-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
|
b8a0441b54 |
userns: Document what the invariant required for safe unprivileged mappings.
commit 0542f17bf2c1f2430d368f44c8fcf2f82ec9e53e upstream. The rule is simple. Don't allow anything that wouldn't be allowed without unprivileged mappings. It was previously overlooked that establishing gid mappings would allow dropping groups and potentially gaining permission to files and directories that had lesser permissions for a specific group than for all other users. This is the rule needed to fix CVE-2014-8989 and prevent any other security issues with new_idmap_permitted. The reason for this rule is that the unix permission model is old and there are programs out there somewhere that take advantage of every little corner of it. So allowing a uid or gid mapping to be established without privielge that would allow anything that would not be allowed without that mapping will result in expectations from some code somewhere being violated. Violated expectations about the behavior of the OS is a long way to say a security issue. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
|
4accc8c8e2 |
groups: Consolidate the setgroups permission checks
commit 7ff4d90b4c24a03666f296c3d4878cd39001e81e upstream. Today there are 3 instances of setgroups and due to an oversight their permission checking has diverged. Add a common function so that they may all share the same permission checking code. This corrects the current oversight in the current permission checks and adds a helper to avoid this in the future. A user namespace security fix will update this new helper, shortly. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
|
0061b518b1 |
uprobes, x86: Fix _TIF_UPROBE vs _TIF_NOTIFY_RESUME
commit 82975bc6a6df743b9a01810fb32cb65d0ec5d60b upstream. x86 call do_notify_resume on paranoid returns if TIF_UPROBE is set but not on non-paranoid returns. I suspect that this is a mistake and that the code only works because int3 is paranoid. Setting _TIF_NOTIFY_RESUME in the uprobe code was probably a workaround for the x86 bug. With that bug fixed, we can remove _TIF_NOTIFY_RESUME from the uprobes code. Reported-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Acked-by: Borislav Petkov <bp@suse.de> Signed-off-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
|
858879737b |
perf: Handle compat ioctl
commit b3f207855f57b9c8f43a547a801340bb5cbc59e5 upstream. When running a 32-bit userspace on a 64-bit kernel (eg. i386 application on x86_64 kernel or 32-bit arm userspace on arm64 kernel) some of the perf ioctls must be treated with special care, as they have a pointer size encoded in the command. For example, PERF_EVENT_IOC_ID in 32-bit world will be encoded as 0x80042407, but 64-bit kernel will expect 0x80082407. In result the ioctl will fail returning -ENOTTY. This patch solves the problem by adding code fixing up the size as compat_ioctl file operation. Reported-by: Drew Richardson <drew.richardson@arm.com> Signed-off-by: Pawel Moll <pawel.moll@arm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Jiri Olsa <jolsa@redhat.com> Link: http://lkml.kernel.org/r/1402671812-9078-1-git-send-email-pawel.moll@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: David Ahern <daahern@cisco.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
|
bd501a2eb2 |
audit: keep inode pinned
commit 799b601451b21ebe7af0e6e8f6e2ccd4683c5064 upstream.
Audit rules disappear when an inode they watch is evicted from the cache.
This is likely not what we want.
The guilty commit is "fsnotify: allow marks to not pin inodes in core",
which didn't take into account that audit_tree adds watches with a zero
mask.
Adding any mask should fix this.
Fixes:
|
||
|
|
f9b6264a0f |
posix-timers: Fix stack info leak in timer_create()
commit 6891c4509c792209c44ced55a60f13954cb50ef4 upstream.
If userland creates a timer without specifying a sigevent info, we'll
create one ourself, using a stack local variable. Particularly will we
use the timer ID as sival_int. But as sigev_value is a union containing
a pointer and an int, that assignment will only partially initialize
sigev_value on systems where the size of a pointer is bigger than the
size of an int. On such systems we'll copy the uninitialized stack bytes
from the timer_create() call to userland when the timer actually fires
and we're going to deliver the signal.
Initialize sigev_value with 0 to plug the stack info leak.
Found in the PaX patch, written by the PaX Team.
Fixes:
|
||
|
|
1eaaef74ad |
PM / Sleep: fix recovery during resuming from hibernation
commit 94fb823fcb4892614f57e59601bb9d4920f24711 upstream. If a device's dev_pm_ops::freeze callback fails during the QUIESCE phase, we don't rollback things correctly calling the thaw and complete callbacks. This could leave some devices in a suspended state in case of an error during resuming from hibernation. Signed-off-by: Imre Deak <imre.deak@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
|
e033782a26 |
OOM, PM: OOM killed task shouldn't escape PM suspend
commit 5695be142e203167e3cb515ef86a88424f3524eb upstream.
PM freezer relies on having all tasks frozen by the time devices are
getting frozen so that no task will touch them while they are getting
frozen. But OOM killer is allowed to kill an already frozen task in
order to handle OOM situtation. In order to protect from late wake ups
OOM killer is disabled after all tasks are frozen. This, however, still
keeps a window open when a killed task didn't manage to die by the time
freeze_processes finishes.
Reduce the race window by checking all tasks after OOM killer has been
disabled. This is still not race free completely unfortunately because
oom_killer_disable cannot stop an already ongoing OOM killer so a task
might still wake up from the fridge and get killed without
freeze_processes noticing. Full synchronization of OOM and freezer is,
however, too heavy weight for this highly unlikely case.
Introduce and check oom_kills counter which gets incremented early when
the allocator enters __alloc_pages_may_oom path and only check all the
tasks if the counter changes during the freezing attempt. The counter
is updated so early to reduce the race window since allocator checked
oom_killer_disabled which is set by PM-freezing code. A false positive
will push the PM-freezer into a slow path but that is not a big deal.
Changes since v1
- push the re-check loop out of freeze_processes into
check_frozen_processes and invert the condition to make the code more
readable as per Rafael
Fixes:
|
||
|
|
f95ad6ed20 |
freezer: Do not freeze tasks killed by OOM killer
commit 51fae6da640edf9d266c94f36bc806c63c301991 upstream. Since |
||
|
|
5ade169530 |
modules, lock around setting of MODULE_STATE_UNFORMED
commit d3051b489aa81ca9ba62af366149ef42b8dae97c upstream. A panic was seen in the following sitation. There are two threads running on the system. The first thread is a system monitoring thread that is reading /proc/modules. The second thread is loading and unloading a module (in this example I'm using my simple dummy-module.ko). Note, in the "real world" this occurred with the qlogic driver module. When doing this, the following panic occurred: ------------[ cut here ]------------ kernel BUG at kernel/module.c:3739! invalid opcode: 0000 [#1] SMP Modules linked in: binfmt_misc sg nfsv3 rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache intel_powerclamp coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel lrw igb gf128mul glue_helper iTCO_wdt iTCO_vendor_support ablk_helper ptp sb_edac cryptd pps_core edac_core shpchp i2c_i801 pcspkr wmi lpc_ich ioatdma mfd_core dca ipmi_si nfsd ipmi_msghandler auth_rpcgss nfs_acl lockd sunrpc xfs libcrc32c sr_mod cdrom sd_mod crc_t10dif crct10dif_common mgag200 syscopyarea sysfillrect sysimgblt i2c_algo_bit drm_kms_helper ttm isci drm libsas ahci libahci scsi_transport_sas libata i2c_core dm_mirror dm_region_hash dm_log dm_mod [last unloaded: dummy_module] CPU: 37 PID: 186343 Comm: cat Tainted: GF O-------------- 3.10.0+ #7 Hardware name: Intel Corporation S2600CP/S2600CP, BIOS RMLSDP.86I.00.29.D696.1311111329 11/11/2013 task: ffff8807fd2d8000 ti: ffff88080fa7c000 task.ti: ffff88080fa7c000 RIP: 0010:[<ffffffff810d64c5>] [<ffffffff810d64c5>] module_flags+0xb5/0xc0 RSP: 0018:ffff88080fa7fe18 EFLAGS: 00010246 RAX: 0000000000000003 RBX: ffffffffa03b5200 RCX: 0000000000000000 RDX: 0000000000001000 RSI: ffff88080fa7fe38 RDI: ffffffffa03b5000 RBP: ffff88080fa7fe28 R08: 0000000000000010 R09: 0000000000000000 R10: 0000000000000000 R11: 000000000000000f R12: ffffffffa03b5000 R13: ffffffffa03b5008 R14: ffffffffa03b5200 R15: ffffffffa03b5000 FS: 00007f6ae57ef740(0000) GS:ffff88101e7a0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000404f70 CR3: 0000000ffed48000 CR4: 00000000001407e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Stack: ffffffffa03b5200 ffff8810101e4800 ffff88080fa7fe70 ffffffff810d666c ffff88081e807300 000000002e0f2fbf 0000000000000000 ffff88100f257b00 ffffffffa03b5008 ffff88080fa7ff48 ffff8810101e4800 ffff88080fa7fee0 Call Trace: [<ffffffff810d666c>] m_show+0x19c/0x1e0 [<ffffffff811e4d7e>] seq_read+0x16e/0x3b0 [<ffffffff812281ed>] proc_reg_read+0x3d/0x80 [<ffffffff811c0f2c>] vfs_read+0x9c/0x170 [<ffffffff811c1a58>] SyS_read+0x58/0xb0 [<ffffffff81605829>] system_call_fastpath+0x16/0x1b Code: 48 63 c2 83 c2 01 c6 04 03 29 48 63 d2 eb d9 0f 1f 80 00 00 00 00 48 63 d2 c6 04 13 2d 41 8b 0c 24 8d 50 02 83 f9 01 75 b2 eb cb <0f> 0b 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 41 RIP [<ffffffff810d64c5>] module_flags+0xb5/0xc0 RSP <ffff88080fa7fe18> Consider the two processes running on the system. CPU 0 (/proc/modules reader) CPU 1 (loading/unloading module) CPU 0 opens /proc/modules, and starts displaying data for each module by traversing the modules list via fs/seq_file.c:seq_open() and fs/seq_file.c:seq_read(). For each module in the modules list, seq_read does op->start() <-- this is a pointer to m_start() op->show() <- this is a pointer to m_show() op->stop() <-- this is a pointer to m_stop() The m_start(), m_show(), and m_stop() module functions are defined in kernel/module.c. The m_start() and m_stop() functions acquire and release the module_mutex respectively. ie) When reading /proc/modules, the module_mutex is acquired and released for each module. m_show() is called with the module_mutex held. It accesses the module struct data and attempts to write out module data. It is in this code path that the above BUG_ON() warning is encountered, specifically m_show() calls static char *module_flags(struct module *mod, char *buf) { int bx = 0; BUG_ON(mod->state == MODULE_STATE_UNFORMED); ... The other thread, CPU 1, in unloading the module calls the syscall delete_module() defined in kernel/module.c. The module_mutex is acquired for a short time, and then released. free_module() is called without the module_mutex. free_module() then sets mod->state = MODULE_STATE_UNFORMED, also without the module_mutex. Some additional code is called and then the module_mutex is reacquired to remove the module from the modules list: /* Now we can delete it from the lists */ mutex_lock(&module_mutex); stop_machine(__unlink_module, mod, NULL); mutex_unlock(&module_mutex); This is the sequence of events that leads to the panic. CPU 1 is removing dummy_module via delete_module(). It acquires the module_mutex, and then releases it. CPU 1 has NOT set dummy_module->state to MODULE_STATE_UNFORMED yet. CPU 0, which is reading the /proc/modules, acquires the module_mutex and acquires a pointer to the dummy_module which is still in the modules list. CPU 0 calls m_show for dummy_module. The check in m_show() for MODULE_STATE_UNFORMED passed for dummy_module even though it is being torn down. Meanwhile CPU 1, which has been continuing to remove dummy_module without holding the module_mutex, now calls free_module() and sets dummy_module->state to MODULE_STATE_UNFORMED. CPU 0 now calls module_flags() with dummy_module and ... static char *module_flags(struct module *mod, char *buf) { int bx = 0; BUG_ON(mod->state == MODULE_STATE_UNFORMED); and BOOM. Acquire and release the module_mutex lock around the setting of MODULE_STATE_UNFORMED in the teardown path, which should resolve the problem. Testing: In the unpatched kernel I can panic the system within 1 minute by doing while (true) do insmod dummy_module.ko; rmmod dummy_module.ko; done and while (true) do cat /proc/modules; done in separate terminals. In the patched kernel I was able to run just over one hour without seeing any issues. I also verified the output of panic via sysrq-c and the output of /proc/modules looks correct for all three states for the dummy_module. dummy_module 12661 0 - Unloading 0xffffffffa03a5000 (OE-) dummy_module 12661 0 - Live 0xffffffffa03bb000 (OE) dummy_module 14015 1 - Loading 0xffffffffa03a5000 (OE+) Signed-off-by: Prarit Bhargava <prarit@redhat.com> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
|
3ad3add775 |
tracing/syscalls: Ignore numbers outside NR_syscalls' range
commit 086ba77a6db00ed858ff07451bedee197df868c9 upstream.
ARM has some private syscalls (for example, set_tls(2)) which lie
outside the range of NR_syscalls. If any of these are called while
syscall tracing is being performed, out-of-bounds array access will
occur in the ftrace and perf sys_{enter,exit} handlers.
# trace-cmd record -e raw_syscalls:* true && trace-cmd report
...
true-653 [000] 384.675777: sys_enter: NR 192 (0, 1000, 3, 4000022, ffffffff, 0)
true-653 [000] 384.675812: sys_exit: NR 192 = 1995915264
true-653 [000] 384.675971: sys_enter: NR 983045 (76f74480, 76f74000, 76f74b28, 76f74480, 76f76f74, 1)
true-653 [000] 384.675988: sys_exit: NR 983045 = 0
...
# trace-cmd record -e syscalls:* true
[ 17.289329] Unable to handle kernel paging request at virtual address aaaaaace
[ 17.289590] pgd = 9e71c000
[ 17.289696] [aaaaaace] *pgd=00000000
[ 17.289985] Internal error: Oops: 5 [#1] PREEMPT SMP ARM
[ 17.290169] Modules linked in:
[ 17.290391] CPU: 0 PID: 704 Comm: true Not tainted 3.18.0-rc2+ #21
[ 17.290585] task: 9f4dab00 ti: 9e710000 task.ti: 9e710000
[ 17.290747] PC is at ftrace_syscall_enter+0x48/0x1f8
[ 17.290866] LR is at syscall_trace_enter+0x124/0x184
Fix this by ignoring out-of-NR_syscalls-bounds syscall numbers.
Commit
|
||
|
|
00790d4526 |
jiffies: Fix timeval conversion to jiffies
commit d78c9300c51d6ceed9f6d078d4e9366f259de28c upstream.
timeval_to_jiffies tried to round a timeval up to an integral number
of jiffies, but the logic for doing so was incorrect: intervals
corresponding to exactly N jiffies would become N+1. This manifested
itself particularly repeatedly stopping/starting an itimer:
setitimer(ITIMER_PROF, &val, NULL);
setitimer(ITIMER_PROF, NULL, &val);
would add a full tick to val, _even if it was exactly representable in
terms of jiffies_ (say, the result of a previous rounding.) Doing
this repeatedly would cause unbounded growth in val. So fix the math.
Here's what was wrong with the conversion: we essentially computed
(eliding seconds)
jiffies = usec * (NSEC_PER_USEC/TICK_NSEC)
by using scaling arithmetic, which took the best approximation of
NSEC_PER_USEC/TICK_NSEC with denominator of 2^USEC_JIFFIE_SC =
x/(2^USEC_JIFFIE_SC), and computed:
jiffies = (usec * x) >> USEC_JIFFIE_SC
and rounded this calculation up in the intermediate form (since we
can't necessarily exactly represent TICK_NSEC in usec.) But the
scaling arithmetic is a (very slight) *over*approximation of the true
value; that is, instead of dividing by (1 usec/ 1 jiffie), we
effectively divided by (1 usec/1 jiffie)-epsilon (rounding
down). This would normally be fine, but we want to round timeouts up,
and we did so by adding 2^USEC_JIFFIE_SC - 1 before the shift; this
would be fine if our division was exact, but dividing this by the
slightly smaller factor was equivalent to adding just _over_ 1 to the
final result (instead of just _under_ 1, as desired.)
In particular, with HZ=1000, we consistently computed that 10000 usec
was 11 jiffies; the same was true for any exact multiple of
TICK_NSEC.
We could possibly still round in the intermediate form, adding
something less than 2^USEC_JIFFIE_SC - 1, but easier still is to
convert usec->nsec, round in nanoseconds, and then convert using
time*spec*_to_jiffies. This adds one constant multiplication, and is
not observably slower in microbenchmarks on recent x86 hardware.
Tested: the following program:
int main() {
struct itimerval zero = {{0, 0}, {0, 0}};
/* Initially set to 10 ms. */
struct itimerval initial = zero;
initial.it_interval.tv_usec = 10000;
setitimer(ITIMER_PROF, &initial, NULL);
/* Save and restore several times. */
for (size_t i = 0; i < 10; ++i) {
struct itimerval prev;
setitimer(ITIMER_PROF, &zero, &prev);
/* on old kernels, this goes up by TICK_USEC every iteration */
printf("previous value: %ld %ld %ld %ld\n",
prev.it_interval.tv_sec, prev.it_interval.tv_usec,
prev.it_value.tv_sec, prev.it_value.tv_usec);
setitimer(ITIMER_PROF, &prev, NULL);
}
return 0;
}
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Paul Turner <pjt@google.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Reviewed-by: Paul Turner <pjt@google.com>
Reported-by: Aaron Jacobs <jacobsa@google.com>
Signed-off-by: Andrew Hunter <ahh@google.com>
[jstultz: Tweaked to apply to 3.17-rc]
Signed-off-by: John Stultz <john.stultz@linaro.org>
[bwh: Backported to 3.16: adjust filename]
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
||
|
|
78a3db11cb |
ring-buffer: Fix infinite spin in reading buffer
commit 24607f114fd14f2f37e3e0cb3d47bce96e81e848 upstream. Commit 651e22f2701b "ring-buffer: Always reset iterator to reader page" fixed one bug but in the process caused another one. The reset is to update the header page, but that fix also changed the way the cached reads were updated. The cache reads are used to test if an iterator needs to be updated or not. A ring buffer iterator, when created, disables writes to the ring buffer but does not stop other readers or consuming reads from happening. Although all readers are synchronized via a lock, they are only synchronized when in the ring buffer functions. Those functions may be called by any number of readers. The iterator continues down when its not interrupted by a consuming reader. If a consuming read occurs, the iterator starts from the beginning of the buffer. The way the iterator sees that a consuming read has happened since its last read is by checking the reader "cache". The cache holds the last counts of the read and the reader page itself. Commit 651e22f2701b changed what was saved by the cache_read when the rb_iter_reset() occurred, making the iterator never match the cache. Then if the iterator calls rb_iter_reset(), it will go into an infinite loop by checking if the cache doesn't match, doing the reset and retrying, just to see that the cache still doesn't match! Which should never happen as the reset is suppose to set the cache to the current value and there's locks that keep a consuming reader from having access to the data. Fixes: 651e22f2701b "ring-buffer: Always reset iterator to reader page" Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
|
bee870fc1a |
perf: fix perf bug in fork()
commit 6c72e3501d0d62fc064d3680e5234f3463ec5a86 upstream. Oleg noticed that a cleanup by Sylvain actually uncovered a bug; by calling perf_event_free_task() when failing sched_fork() we will not yet have done the memset() on ->perf_event_ctxp[] and will therefore try and 'free' the inherited contexts, which are still in use by the parent process. This is bad.. Suggested-by: Oleg Nesterov <oleg@redhat.com> Reported-by: Oleg Nesterov <oleg@redhat.com> Reported-by: Sylvain 'ythier' Hitier <sylvain.hitier@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
|
641bc58deb |
introduce for_each_thread() to replace the buggy while_each_thread()
commit 0c740d0afc3bff0a097ad03a1c8df92757516f5c upstream. while_each_thread() and next_thread() should die, almost every lockless usage is wrong. 1. Unless g == current, the lockless while_each_thread() is not safe. while_each_thread(g, t) can loop forever if g exits, next_thread() can't reach the unhashed thread in this case. Note that this can happen even if g is the group leader, it can exec. 2. Even if while_each_thread() itself was correct, people often use it wrongly. It was never safe to just take rcu_read_lock() and loop unless you verify that pid_alive(g) == T, even the first next_thread() can point to the already freed/reused memory. This patch adds signal_struct->thread_head and task->thread_node to create the normal rcu-safe list with the stable head. The new for_each_thread(g, t) helper is always safe under rcu_read_lock() as long as this task_struct can't go away. Note: of course it is ugly to have both task_struct->thread_node and the old task_struct->thread_group, we will kill it later, after we change the users of while_each_thread() to use for_each_thread(). Perhaps we can kill it even before we convert all users, we can reimplement next_thread(t) using the new thread_head/thread_node. But we can't do this right now because this will lead to subtle behavioural changes. For example, do/while_each_thread() always sees at least one task, while for_each_thread() can do nothing if the whole thread group has died. Or thread_group_empty(), currently its semantics is not clear unless thread_group_leader(p) and we need to audit the callers before we can change it. So this patch adds the new interface which has to coexist with the old one for some time, hopefully the next changes will be more or less straightforward and the old one will go away soon. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Sergey Dyasly <dserrg@gmail.com> Tested-by: Sergey Dyasly <dserrg@gmail.com> Reviewed-by: Sameer Nanda <snanda@chromium.org> Acked-by: David Rientjes <rientjes@google.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Mandeep Singh Baines <msb@chromium.org> Cc: "Ma, Xindong" <xindong.ma@intel.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: "Tu, Xiaobing" <xiaobing.tu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Li Zefan <lizefan@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
|
beed61068f |
kernel/fork.c:copy_process(): unify CLONE_THREAD-or-thread_group_leader code
commit 80628ca06c5d42929de6bc22c0a41589a834d151 upstream. Cleanup and preparation for the next changes. Move the "if (clone_flags & CLONE_THREAD)" code down under "if (likely(p->pid))" and turn it into into the "else" branch. This makes the process/thread initialization more symmetrical and removes one check. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Sergey Dyasly <dserrg@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Li Zefan <lizefan@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
|
5c95fef642 |
PM / sleep: Use valid_state() for platform-dependent sleep states only
commit 43e8317b0bba1d6eb85f38a4a233d82d7c20d732 upstream. Use the observation that, for platform-dependent sleep states (PM_SUSPEND_STANDBY, PM_SUSPEND_MEM), a given state is either always supported or always unsupported and store that information in pm_states[] instead of calling valid_state() every time we need to check it. Also do not use valid_state() for PM_SUSPEND_FREEZE, which is always valid, and move the pm_test_level validity check for PM_SUSPEND_FREEZE directly into enter_state(). Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Brian Norris <computersforpeace@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
|
e89547b87a |
PM / sleep: Add state field to pm_states[] entries
commit 27ddcc6596e50cb8f03d2e83248897667811d8f6 upstream. To allow sleep states corresponding to the "mem", "standby" and "freeze" lables to be different from the pm_states[] indexes of those strings, introduce struct pm_sleep_state, consisting of a string label and a state number, and turn pm_states[] into an array of objects of that type. This modification should not lead to any functional changes. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Brian Norris <computersforpeace@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
|
a63607b953 |
perf: Fix a race condition in perf_remove_from_context()
commit 3577af70a2ce4853d58e57d832e687d739281479 upstream. We saw a kernel soft lockup in perf_remove_from_context(), it looks like the `perf` process, when exiting, could not go out of the retry loop. Meanwhile, the target process was forking a child. So either the target process should execute the smp function call to deactive the event (if it was running) or it should do a context switch which deactives the event. It seems we optimize out a context switch in perf_event_context_sched_out(), and what's more important, we still test an obsolete task pointer when retrying, so no one actually would deactive that event in this situation. Fix it directly by reloading the task pointer in perf_remove_from_context(). This should cure the above soft lockup. Signed-off-by: Cong Wang <cwang@twopensource.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1409696840-843-1-git-send-email-xiyou.wangcong@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
|
3c47864204 |
alarmtimer: Lock k_itimer during timer callback
commit 474e941bed9262f5fa2394f9a4a67e24499e5926 upstream. Locks the k_itimer's it_lock member when handling the alarm timer's expiry callback. The regular posix timers defined in posix-timers.c have this lock held during timout processing because their callbacks are routed through posix_timer_fn(). The alarm timers follow a different path, so they ought to grab the lock somewhere else. Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Richard Cochran <richardcochran@gmail.com> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Sharvil Nanavati <sharvil@google.com> Signed-off-by: Richard Larocque <rlarocque@google.com> Signed-off-by: John Stultz <john.stultz@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
|
5cebda5d05 |
alarmtimer: Do not signal SIGEV_NONE timers
commit 265b81d23a46c39df0a735a3af4238954b41a4c2 upstream. Avoids sending a signal to alarm timers created with sigev_notify set to SIGEV_NONE by checking for that special case in the timeout callback. The regular posix timers avoid sending signals to SIGEV_NONE timers by not scheduling any callbacks for them in the first place. Although it would be possible to do something similar for alarm timers, it's simpler to handle this as a special case in the timeout. Prior to this patch, the alarm timer would ignore the sigev_notify value and try to deliver signals to the process anyway. Even worse, the sanity check for the value of sigev_signo is skipped when SIGEV_NONE was specified, so the signal number could be bogus. If sigev_signo was an unitialized value (as it often would be if SIGEV_NONE is used), then it's hard to predict which signal will be sent. Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Richard Cochran <richardcochran@gmail.com> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Sharvil Nanavati <sharvil@google.com> Signed-off-by: Richard Larocque <rlarocque@google.com> Signed-off-by: John Stultz <john.stultz@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
|
2b952aeed3 |
kcmp: fix standard comparison bug
commit acbbe6fbb240a927ee1f5994f04d31267d422215 upstream.
The C operator <= defines a perfectly fine total ordering on the set of
values representable in a long. However, unlike its namesake in the
integers, it is not translation invariant, meaning that we do not have
"b <= c" iff "a+b <= a+c" for all a,b,c.
This means that it is always wrong to try to boil down the relationship
between two longs to a question about the sign of their difference,
because the resulting relation [a LEQ b iff a-b <= 0] is neither
anti-symmetric or transitive. The former is due to -LONG_MIN==LONG_MIN
(take any two a,b with a-b = LONG_MIN; then a LEQ b and b LEQ a, but a !=
b). The latter can either be seen observing that x LEQ x+1 for all x,
implying x LEQ x+1 LEQ x+2 ... LEQ x-1 LEQ x; or more directly with the
simple example a=LONG_MIN, b=0, c=1, for which a-b < 0, b-c < 0, but a-c >
0.
Note that it makes absolutely no difference that a transmogrying bijection
has been applied before the comparison is done. In fact, had the
obfuscation not been done, one could probably not observe the bug
(assuming all values being compared always lie in one half of the address
space, the mathematical value of a-b is always representable in a long).
As it stands, one can easily obtain three file descriptors exhibiting the
non-transitivity of kcmp().
Side note 1: I can't see that ensuring the MSB of the multiplier is
set serves any purpose other than obfuscating the obfuscating code.
Side note 2:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <fcntl.h>
#include <unistd.h>
#include <assert.h>
#include <sys/syscall.h>
enum kcmp_type {
KCMP_FILE,
KCMP_VM,
KCMP_FILES,
KCMP_FS,
KCMP_SIGHAND,
KCMP_IO,
KCMP_SYSVSEM,
KCMP_TYPES,
};
pid_t pid;
int kcmp(pid_t pid1, pid_t pid2, int type,
unsigned long idx1, unsigned long idx2)
{
return syscall(SYS_kcmp, pid1, pid2, type, idx1, idx2);
}
int cmp_fd(int fd1, int fd2)
{
int c = kcmp(pid, pid, KCMP_FILE, fd1, fd2);
if (c < 0) {
perror("kcmp");
exit(1);
}
assert(0 <= c && c < 3);
return c;
}
int cmp_fdp(const void *a, const void *b)
{
static const int normalize[] = {0, -1, 1};
return normalize[cmp_fd(*(int*)a, *(int*)b)];
}
#define MAX 100 /* This is plenty; I've seen it trigger for MAX==3 */
int main(int argc, char *argv[])
{
int r, s, count = 0;
int REL[3] = {0,0,0};
int fd[MAX];
pid = getpid();
while (count < MAX) {
r = open("/dev/null", O_RDONLY);
if (r < 0)
break;
fd[count++] = r;
}
printf("opened %d file descriptors\n", count);
for (r = 0; r < count; ++r) {
for (s = r+1; s < count; ++s) {
REL[cmp_fd(fd[r], fd[s])]++;
}
}
printf("== %d\t< %d\t> %d\n", REL[0], REL[1], REL[2]);
qsort(fd, count, sizeof(fd[0]), cmp_fdp);
memset(REL, 0, sizeof(REL));
for (r = 0; r < count; ++r) {
for (s = r+1; s < count; ++s) {
REL[cmp_fd(fd[r], fd[s])]++;
}
}
printf("== %d\t< %d\t> %d\n", REL[0], REL[1], REL[2]);
return (REL[0] + REL[2] != 0);
}
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org>
"Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
||
|
|
8c1bf5dcfe |
trace: Fix epoll hang when we race with new entries
commit 4ce97dbf50245227add17c83d87dc838e7ca79d0 upstream. Epoll on trace_pipe can sometimes hang in a weird case. If the ring buffer is empty when we set waiters_pending but an event shows up exactly at that moment we can miss being woken up by the ring buffers irq work. Since ring_buffer_empty() is inherently racey we will sometimes think that the buffer is not empty. So we don't get woken up and we don't think there are any events even though there were some ready when we added the watch, which makes us hang. This patch fixes this by making sure that we are actually on the wait list before we set waiters_pending, and add a memory barrier to make sure ring_buffer_empty() is going to be correct. Link: http://lkml.kernel.org/p/1408989581-23727-1-git-send-email-jbacik@fb.com Cc: Martin Lau <kafai@fb.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
|
7f70b62ea0 |
ring-buffer: Up rb_iter_peek() loop count to 3
commit 021de3d904b88b1771a3a2cfc5b75023c391e646 upstream.
After writting a test to try to trigger the bug that caused the
ring buffer iterator to become corrupted, I hit another bug:
WARNING: CPU: 1 PID: 5281 at kernel/trace/ring_buffer.c:3766 rb_iter_peek+0x113/0x238()
Modules linked in: ipt_MASQUERADE sunrpc [...]
CPU: 1 PID: 5281 Comm: grep Tainted: G W 3.16.0-rc3-test+ #143
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./To be filled by O.E.M., BIOS SDBLI944.86P 05/08/2007
0000000000000000 ffffffff81809a80 ffffffff81503fb0 0000000000000000
ffffffff81040ca1 ffff8800796d6010 ffffffff810c138d ffff8800796d6010
ffff880077438c80 ffff8800796d6010 ffff88007abbe600 0000000000000003
Call Trace:
[<ffffffff81503fb0>] ? dump_stack+0x4a/0x75
[<ffffffff81040ca1>] ? warn_slowpath_common+0x7e/0x97
[<ffffffff810c138d>] ? rb_iter_peek+0x113/0x238
[<ffffffff810c138d>] ? rb_iter_peek+0x113/0x238
[<ffffffff810c14df>] ? ring_buffer_iter_peek+0x2d/0x5c
[<ffffffff810c6f73>] ? tracing_iter_reset+0x6e/0x96
[<ffffffff810c74a3>] ? s_start+0xd7/0x17b
[<ffffffff8112b13e>] ? kmem_cache_alloc_trace+0xda/0xea
[<ffffffff8114cf94>] ? seq_read+0x148/0x361
[<ffffffff81132d98>] ? vfs_read+0x93/0xf1
[<ffffffff81132f1b>] ? SyS_read+0x60/0x8e
[<ffffffff8150bf9f>] ? tracesys+0xdd/0xe2
Debugging this bug, which triggers when the rb_iter_peek() loops too
many times (more than 2 times), I discovered there's a case that can
cause that function to legitimately loop 3 times!
rb_iter_peek() is different than rb_buffer_peek() as the rb_buffer_peek()
only deals with the reader page (it's for consuming reads). The
rb_iter_peek() is for traversing the buffer without consuming it, and as
such, it can loop for one more reason. That is, if we hit the end of
the reader page or any page, it will go to the next page and try again.
That is, we have this:
1. iter->head > iter->head_page->page->commit
(rb_inc_iter() which moves the iter to the next page)
try again
2. event = rb_iter_head_event()
event->type_len == RINGBUF_TYPE_TIME_EXTEND
rb_advance_iter()
try again
3. read the event.
But we never get to 3, because the count is greater than 2 and we
cause the WARNING and return NULL.
Up the counter to 3.
Fixes:
|
||
|
|
814aa5addf |
ring-buffer: Always reset iterator to reader page
commit 651e22f2701b4113989237c3048d17337dd2185c upstream.
When performing a consuming read, the ring buffer swaps out a
page from the ring buffer with a empty page and this page that
was swapped out becomes the new reader page. The reader page
is owned by the reader and since it was swapped out of the ring
buffer, writers do not have access to it (there's an exception
to that rule, but it's out of scope for this commit).
When reading the "trace" file, it is a non consuming read, which
means that the data in the ring buffer will not be modified.
When the trace file is opened, a ring buffer iterator is allocated
and writes to the ring buffer are disabled, such that the iterator
will not have issues iterating over the data.
Although the ring buffer disabled writes, it does not disable other
reads, or even consuming reads. If a consuming read happens, then
the iterator is reset and starts reading from the beginning again.
My tests would sometimes trigger this bug on my i386 box:
WARNING: CPU: 0 PID: 5175 at kernel/trace/trace.c:1527 __trace_find_cmdline+0x66/0xaa()
Modules linked in:
CPU: 0 PID: 5175 Comm: grep Not tainted 3.16.0-rc3-test+ #8
Hardware name: /DG965MQ, BIOS MQ96510J.86A.0372.2006.0605.1717 06/05/2006
00000000 00000000 f09c9e1c c18796b3 c1b5d74c f09c9e4c c103a0e3 c1b5154b
f09c9e78 00001437 c1b5d74c 000005f7 c10bd85a c10bd85a c1cac57c f09c9eb0
ed0e0000 f09c9e64 c103a185 00000009 f09c9e5c c1b5154b f09c9e78 f09c9e80^M
Call Trace:
[<c18796b3>] dump_stack+0x4b/0x75
[<c103a0e3>] warn_slowpath_common+0x7e/0x95
[<c10bd85a>] ? __trace_find_cmdline+0x66/0xaa
[<c10bd85a>] ? __trace_find_cmdline+0x66/0xaa
[<c103a185>] warn_slowpath_fmt+0x33/0x35
[<c10bd85a>] __trace_find_cmdline+0x66/0xaa^M
[<c10bed04>] trace_find_cmdline+0x40/0x64
[<c10c3c16>] trace_print_context+0x27/0xec
[<c10c4360>] ? trace_seq_printf+0x37/0x5b
[<c10c0b15>] print_trace_line+0x319/0x39b
[<c10ba3fb>] ? ring_buffer_read+0x47/0x50
[<c10c13b1>] s_show+0x192/0x1ab
[<c10bfd9a>] ? s_next+0x5a/0x7c
[<c112e76e>] seq_read+0x267/0x34c
[<c1115a25>] vfs_read+0x8c/0xef
[<c112e507>] ? seq_lseek+0x154/0x154
[<c1115ba2>] SyS_read+0x54/0x7f
[<c188488e>] syscall_call+0x7/0xb
---[ end trace 3f507febd6b4cc83 ]---
>>>> ##### CPU 1 buffer started ####
Which was the __trace_find_cmdline() function complaining about the pid
in the event record being negative.
After adding more test cases, this would trigger more often. Strangely
enough, it would never trigger on a single test, but instead would trigger
only when running all the tests. I believe that was the case because it
required one of the tests to be shutting down via delayed instances while
a new test started up.
After spending several days debugging this, I found that it was caused by
the iterator becoming corrupted. Debugging further, I found out why
the iterator became corrupted. It happened with the rb_iter_reset().
As consuming reads may not read the full reader page, and only part
of it, there's a "read" field to know where the last read took place.
The iterator, must also start at the read position. In the rb_iter_reset()
code, if the reader page was disconnected from the ring buffer, the iterator
would start at the head page within the ring buffer (where writes still
happen). But the mistake there was that it still used the "read" field
to start the iterator on the head page, where it should always start
at zero because readers never read from within the ring buffer where
writes occur.
I originally wrote a patch to have it set the iter->head to 0 instead
of iter->head_page->read, but then I questioned why it wasn't always
setting the iter to point to the reader page, as the reader page is
still valid. The list_empty(reader_page->list) just means that it was
successful in swapping out. But the reader_page may still have data.
There was a bug report a long time ago that was not reproducible that
had something about trace_pipe (consuming read) not matching trace
(iterator read). This may explain why that happened.
Anyway, the correct answer to this bug is to always use the reader page
an not reset the iterator to inside the writable ring buffer.
Fixes:
|
||
|
|
a8fd51944e |
kernel/smp.c:on_each_cpu_cond(): fix warning in fallback path
commit 618fde872163e782183ce574c77f1123e2be8887 upstream. The rarely-executed memry-allocation-failed callback path generates a WARN_ON_ONCE() when smp_call_function_single() succeeds. Presumably it's supposed to warn on failures. Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Cc: Christoph Lameter <cl@gentwo.org> Cc: Gilad Ben-Yossef <gilad@benyossef.com> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Tejun Heo <htejun@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
|
76f01555c7 |
CAPABILITIES: remove undefined caps from all processes
commit 7d8b6c63751cfbbe5eef81a48c22978b3407a3ad upstream.
This is effectively a revert of
|
||
|
|
562eebeb9c |
timer: Fix lock inversion between hrtimer_bases.lock and scheduler locks
commit 504d58745c9ca28d33572e2d8a9990b43e06075d upstream. clockevents_increase_min_delta() calls printk() from under hrtimer_bases.lock. That causes lock inversion on scheduler locks because printk() can call into the scheduler. Lockdep puts it as: ====================================================== [ INFO: possible circular locking dependency detected ] 3.15.0-rc8-06195-g939f04b #2 Not tainted ------------------------------------------------------- trinity-main/74 is trying to acquire lock: (&port_lock_key){-.....}, at: [<811c60be>] serial8250_console_write+0x8c/0x10c but task is already holding lock: (hrtimer_bases.lock){-.-...}, at: [<8103caeb>] hrtimer_try_to_cancel+0x13/0x66 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #5 (hrtimer_bases.lock){-.-...}: [<8104a942>] lock_acquire+0x92/0x101 [<8142f11d>] _raw_spin_lock_irqsave+0x2e/0x3e [<8103c918>] __hrtimer_start_range_ns+0x1c/0x197 [<8107ec20>] perf_swevent_start_hrtimer.part.41+0x7a/0x85 [<81080792>] task_clock_event_start+0x3a/0x3f [<810807a4>] task_clock_event_add+0xd/0x14 [<8108259a>] event_sched_in+0xb6/0x17a [<810826a2>] group_sched_in+0x44/0x122 [<81082885>] ctx_sched_in.isra.67+0x105/0x11f [<810828e6>] perf_event_sched_in.isra.70+0x47/0x4b [<81082bf6>] __perf_install_in_context+0x8b/0xa3 [<8107eb8e>] remote_function+0x12/0x2a [<8105f5af>] smp_call_function_single+0x2d/0x53 [<8107e17d>] task_function_call+0x30/0x36 [<8107fb82>] perf_install_in_context+0x87/0xbb [<810852c9>] SYSC_perf_event_open+0x5c6/0x701 [<810856f9>] SyS_perf_event_open+0x17/0x19 [<8142f8ee>] syscall_call+0x7/0xb -> #4 (&ctx->lock){......}: [<8104a942>] lock_acquire+0x92/0x101 [<8142f04c>] _raw_spin_lock+0x21/0x30 [<81081df3>] __perf_event_task_sched_out+0x1dc/0x34f [<8142cacc>] __schedule+0x4c6/0x4cb [<8142cae0>] schedule+0xf/0x11 [<8142f9a6>] work_resched+0x5/0x30 -> #3 (&rq->lock){-.-.-.}: [<8104a942>] lock_acquire+0x92/0x101 [<8142f04c>] _raw_spin_lock+0x21/0x30 [<81040873>] __task_rq_lock+0x33/0x3a [<8104184c>] wake_up_new_task+0x25/0xc2 [<8102474b>] do_fork+0x15c/0x2a0 [<810248a9>] kernel_thread+0x1a/0x1f [<814232a2>] rest_init+0x1a/0x10e [<817af949>] start_kernel+0x303/0x308 [<817af2ab>] i386_start_kernel+0x79/0x7d -> #2 (&p->pi_lock){-.-...}: [<8104a942>] lock_acquire+0x92/0x101 [<8142f11d>] _raw_spin_lock_irqsave+0x2e/0x3e [<810413dd>] try_to_wake_up+0x1d/0xd6 [<810414cd>] default_wake_function+0xb/0xd [<810461f3>] __wake_up_common+0x39/0x59 [<81046346>] __wake_up+0x29/0x3b [<811b8733>] tty_wakeup+0x49/0x51 [<811c3568>] uart_write_wakeup+0x17/0x19 [<811c5dc1>] serial8250_tx_chars+0xbc/0xfb [<811c5f28>] serial8250_handle_irq+0x54/0x6a [<811c5f57>] serial8250_default_handle_irq+0x19/0x1c [<811c56d8>] serial8250_interrupt+0x38/0x9e [<810510e7>] handle_irq_event_percpu+0x5f/0x1e2 [<81051296>] handle_irq_event+0x2c/0x43 [<81052cee>] handle_level_irq+0x57/0x80 [<81002a72>] handle_irq+0x46/0x5c [<810027df>] do_IRQ+0x32/0x89 [<8143036e>] common_interrupt+0x2e/0x33 [<8142f23c>] _raw_spin_unlock_irqrestore+0x3f/0x49 [<811c25a4>] uart_start+0x2d/0x32 [<811c2c04>] uart_write+0xc7/0xd6 [<811bc6f6>] n_tty_write+0xb8/0x35e [<811b9beb>] tty_write+0x163/0x1e4 [<811b9cd9>] redirected_tty_write+0x6d/0x75 [<810b6ed6>] vfs_write+0x75/0xb0 [<810b7265>] SyS_write+0x44/0x77 [<8142f8ee>] syscall_call+0x7/0xb -> #1 (&tty->write_wait){-.....}: [<8104a942>] lock_acquire+0x92/0x101 [<8142f11d>] _raw_spin_lock_irqsave+0x2e/0x3e [<81046332>] __wake_up+0x15/0x3b [<811b8733>] tty_wakeup+0x49/0x51 [<811c3568>] uart_write_wakeup+0x17/0x19 [<811c5dc1>] serial8250_tx_chars+0xbc/0xfb [<811c5f28>] serial8250_handle_irq+0x54/0x6a [<811c5f57>] serial8250_default_handle_irq+0x19/0x1c [<811c56d8>] serial8250_interrupt+0x38/0x9e [<810510e7>] handle_irq_event_percpu+0x5f/0x1e2 [<81051296>] handle_irq_event+0x2c/0x43 [<81052cee>] handle_level_irq+0x57/0x80 [<81002a72>] handle_irq+0x46/0x5c [<810027df>] do_IRQ+0x32/0x89 [<8143036e>] common_interrupt+0x2e/0x33 [<8142f23c>] _raw_spin_unlock_irqrestore+0x3f/0x49 [<811c25a4>] uart_start+0x2d/0x32 [<811c2c04>] uart_write+0xc7/0xd6 [<811bc6f6>] n_tty_write+0xb8/0x35e [<811b9beb>] tty_write+0x163/0x1e4 [<811b9cd9>] redirected_tty_write+0x6d/0x75 [<810b6ed6>] vfs_write+0x75/0xb0 [<810b7265>] SyS_write+0x44/0x77 [<8142f8ee>] syscall_call+0x7/0xb -> #0 (&port_lock_key){-.....}: [<8104a62d>] __lock_acquire+0x9ea/0xc6d [<8104a942>] lock_acquire+0x92/0x101 [<8142f11d>] _raw_spin_lock_irqsave+0x2e/0x3e [<811c60be>] serial8250_console_write+0x8c/0x10c [<8104e402>] call_console_drivers.constprop.31+0x87/0x118 [<8104f5d5>] console_unlock+0x1d7/0x398 [<8104fb70>] vprintk_emit+0x3da/0x3e4 [<81425f76>] printk+0x17/0x19 [<8105bfa0>] clockevents_program_min_delta+0x104/0x116 [<8105c548>] clockevents_program_event+0xe7/0xf3 [<8105cc1c>] tick_program_event+0x1e/0x23 [<8103c43c>] hrtimer_force_reprogram+0x88/0x8f [<8103c49e>] __remove_hrtimer+0x5b/0x79 [<8103cb21>] hrtimer_try_to_cancel+0x49/0x66 [<8103cb4b>] hrtimer_cancel+0xd/0x18 [<8107f102>] perf_swevent_cancel_hrtimer.part.60+0x2b/0x30 [<81080705>] task_clock_event_stop+0x20/0x64 [<81080756>] task_clock_event_del+0xd/0xf [<81081350>] event_sched_out+0xab/0x11e [<810813e0>] group_sched_out+0x1d/0x66 [<81081682>] ctx_sched_out+0xaf/0xbf [<81081e04>] __perf_event_task_sched_out+0x1ed/0x34f [<8142cacc>] __schedule+0x4c6/0x4cb [<8142cae0>] schedule+0xf/0x11 [<8142f9a6>] work_resched+0x5/0x30 other info that might help us debug this: Chain exists of: &port_lock_key --> &ctx->lock --> hrtimer_bases.lock Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(hrtimer_bases.lock); lock(&ctx->lock); lock(hrtimer_bases.lock); lock(&port_lock_key); *** DEADLOCK *** 4 locks held by trinity-main/74: #0: (&rq->lock){-.-.-.}, at: [<8142c6f3>] __schedule+0xed/0x4cb #1: (&ctx->lock){......}, at: [<81081df3>] __perf_event_task_sched_out+0x1dc/0x34f #2: (hrtimer_bases.lock){-.-...}, at: [<8103caeb>] hrtimer_try_to_cancel+0x13/0x66 #3: (console_lock){+.+...}, at: [<8104fb5d>] vprintk_emit+0x3c7/0x3e4 stack backtrace: CPU: 0 PID: 74 Comm: trinity-main Not tainted 3.15.0-rc8-06195-g939f04b #2 00000000 81c3a310 8b995c14 81426f69 8b995c44 81425a99 8161f671 8161f570 8161f538 8161f559 8161f538 8b995c78 8b142bb0 00000004 8b142fdc 8b142bb0 8b995ca8 8104a62d 8b142fac 000016f2 81c3a310 00000001 00000001 00000003 Call Trace: [<81426f69>] dump_stack+0x16/0x18 [<81425a99>] print_circular_bug+0x18f/0x19c [<8104a62d>] __lock_acquire+0x9ea/0xc6d [<8104a942>] lock_acquire+0x92/0x101 [<811c60be>] ? serial8250_console_write+0x8c/0x10c [<811c6032>] ? wait_for_xmitr+0x76/0x76 [<8142f11d>] _raw_spin_lock_irqsave+0x2e/0x3e [<811c60be>] ? serial8250_console_write+0x8c/0x10c [<811c60be>] serial8250_console_write+0x8c/0x10c [<8104af87>] ? lock_release+0x191/0x223 [<811c6032>] ? wait_for_xmitr+0x76/0x76 [<8104e402>] call_console_drivers.constprop.31+0x87/0x118 [<8104f5d5>] console_unlock+0x1d7/0x398 [<8104fb70>] vprintk_emit+0x3da/0x3e4 [<81425f76>] printk+0x17/0x19 [<8105bfa0>] clockevents_program_min_delta+0x104/0x116 [<8105cc1c>] tick_program_event+0x1e/0x23 [<8103c43c>] hrtimer_force_reprogram+0x88/0x8f [<8103c49e>] __remove_hrtimer+0x5b/0x79 [<8103cb21>] hrtimer_try_to_cancel+0x49/0x66 [<8103cb4b>] hrtimer_cancel+0xd/0x18 [<8107f102>] perf_swevent_cancel_hrtimer.part.60+0x2b/0x30 [<81080705>] task_clock_event_stop+0x20/0x64 [<81080756>] task_clock_event_del+0xd/0xf [<81081350>] event_sched_out+0xab/0x11e [<810813e0>] group_sched_out+0x1d/0x66 [<81081682>] ctx_sched_out+0xaf/0xbf [<81081e04>] __perf_event_task_sched_out+0x1ed/0x34f [<8104416d>] ? __dequeue_entity+0x23/0x27 [<81044505>] ? pick_next_task_fair+0xb1/0x120 [<8142cacc>] __schedule+0x4c6/0x4cb [<81047574>] ? trace_hardirqs_off_caller+0xd7/0x108 [<810475b0>] ? trace_hardirqs_off+0xb/0xd [<81056346>] ? rcu_irq_exit+0x64/0x77 Fix the problem by using printk_deferred() which does not call into the scheduler. Reported-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
|
3984bb13c8 |
printk: rename printk_sched to printk_deferred
commit aac74dc495456412c4130a1167ce4beb6c1f0b38 upstream. After learning we'll need some sort of deferred printk functionality in the timekeeping core, Peter suggested we rename the printk_sched function so it can be reused by needed subsystems. This only changes the function name. No logic changes. Signed-off-by: John Stultz <john.stultz@linaro.org> Reviewed-by: Steven Rostedt <rostedt@goodmis.org> Cc: Jan Kara <jack@suse.cz> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Jiri Bohac <jbohac@suse.cz> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
|
efd39f7786 |
tracing: Fix wraparound problems in "uptime" trace clock
commit 58d4e21e50ff3cc57910a8abc20d7e14375d2f61 upstream.
The "uptime" trace clock added in:
commit
|
||
|
|
4aba6e3634 |
sched: Fix possible divide by zero in avg_atom() calculation
commit b0ab99e7736af88b8ac1b7ae50ea287fffa2badc upstream. proc_sched_show_task() does: if (nr_switches) do_div(avg_atom, nr_switches); nr_switches is unsigned long and do_div truncates it to 32 bits, which means it can test non-zero on e.g. x86-64 and be truncated to zero for division. Fix the problem by using div64_ul() instead. As a side effect calculations of avg_atom for big nr_switches are now correct. Signed-off-by: Mateusz Guzik <mguzik@redhat.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1402750809-31991-1-git-send-email-mguzik@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
|
e6be7d3115 |
locking/mutex: Disable optimistic spinning on some architectures
commit 4badad352a6bb202ec68afa7a574c0bb961e5ebc upstream. The optimistic spin code assumes regular stores and cmpxchg() play nice; this is found to not be true for at least: parisc, sparc32, tile32, metag-lock1, arc-!llsc and hexagon. There is further wreckage, but this in particular seemed easy to trigger, so blacklist this. Opt in for known good archs. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Reported-by: Mikulas Patocka <mpatocka@redhat.com> Cc: David Miller <davem@davemloft.net> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: James Bottomley <James.Bottomley@hansenpartnership.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Jason Low <jason.low2@hp.com> Cc: Waiman Long <waiman.long@hp.com> Cc: "James E.J. Bottomley" <jejb@parisc-linux.org> Cc: Paul McKenney <paulmck@linux.vnet.ibm.com> Cc: John David Anglin <dave.anglin@bell.net> Cc: James Hogan <james.hogan@imgtec.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Davidlohr Bueso <davidlohr@hp.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Russell King <linux@arm.linux.org.uk> Cc: Will Deacon <will.deacon@arm.com> Cc: linux-arm-kernel@lists.infradead.org Cc: linux-kernel@vger.kernel.org Cc: linuxppc-dev@lists.ozlabs.org Cc: sparclinux@vger.kernel.org Link: http://lkml.kernel.org/r/20140606175316.GV13930@laptop.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
|
804536e8e0 |
PM / sleep: Fix request_firmware() error at resume
commit 4320f6b1d9db4ca912c5eb6ecb328b2e090e1586 upstream. The commit [ |
||
|
|
c933192733 |
alarmtimer: Fix bug where relative alarm timers were treated as absolute
commit 16927776ae757d0d132bdbfabbfe2c498342bd59 upstream. Sharvil noticed with the posix timer_settime interface, using the CLOCK_REALTIME_ALARM or CLOCK_BOOTTIME_ALARM clockid, if the users tried to specify a relative time timer, it would incorrectly be treated as absolute regardless of the state of the flags argument. This patch corrects this, properly checking the absolute/relative flag, as well as adds further error checking that no invalid flag bits are set. Reported-by: Sharvil Nanavati <sharvil@google.com> Signed-off-by: John Stultz <john.stultz@linaro.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Sharvil Nanavati <sharvil@google.com> Link: http://lkml.kernel.org/r/1404767171-6902-1-git-send-email-john.stultz@linaro.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
|
16de9ea386 |
ring-buffer: Fix polling on trace_pipe
commit 97b8ee845393701edc06e27ccec2876ff9596019 upstream.
ring_buffer_poll_wait() should always put the poll_table to its wait_queue
even there is immediate data available. Otherwise, the following epoll and
read sequence will eventually hang forever:
1. Put some data to make the trace_pipe ring_buffer read ready first
2. epoll_ctl(efd, EPOLL_CTL_ADD, trace_pipe_fd, ee)
3. epoll_wait()
4. read(trace_pipe_fd) till EAGAIN
5. Add some more data to the trace_pipe ring_buffer
6. epoll_wait() -> this epoll_wait() will block forever
~ During the epoll_ctl(efd, EPOLL_CTL_ADD,...) call in step 2,
ring_buffer_poll_wait() returns immediately without adding poll_table,
which has poll_table->_qproc pointing to ep_poll_callback(), to its
wait_queue.
~ During the epoll_wait() call in step 3 and step 6,
ring_buffer_poll_wait() cannot add ep_poll_callback() to its wait_queue
because the poll_table->_qproc is NULL and it is how epoll works.
~ When there is new data available in step 6, ring_buffer does not know
it has to call ep_poll_callback() because it is not in its wait queue.
Hence, block forever.
Other poll implementation seems to call poll_wait() unconditionally as the very
first thing to do. For example, tcp_poll() in tcp.c.
Link: http://lkml.kernel.org/p/20140610060637.GA14045@devbig242.prn2.facebook.com
Fixes:
|
||
|
|
e250100bed |
tracing: Add ftrace_trace_stack into __trace_puts/__trace_bputs
commit 8abfb8727f4a724d31f9ccfd8013fbd16d539445 upstream. Currently trace option stacktrace is not applicable for trace_printk with constant string argument, the reason is in __trace_puts/__trace_bputs ftrace_trace_stack is missing. In contrast, when using trace_printk with non constant string argument(will call into __trace_printk/__trace_bprintk), then trace option stacktrace is workable, this inconstant result will confuses users a lot. Link: http://lkml.kernel.org/p/51E7A7C9.9040401@huawei.com Signed-off-by: zhangwei(Jovi) <jovi.zhangwei@huawei.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
|
9b87c4e58f |
tracing: Fix graph tracer with stack tracer on other archs
commit 5f8bf2d263a20b986225ae1ed7d6759dc4b93af9 upstream. Running my ftrace tests on PowerPC, it failed the test that checks if function_graph tracer is affected by the stack tracer. It was. Looking into this, I found that the update_function_graph_func() must be called even if the trampoline function is not changed. This is because archs like PowerPC do not support ftrace_ops being passed by assembly and instead uses a helper function (what the trampoline function points to). Since this function is not changed even when multiple ftrace_ops are added to the code, the test that falls out before calling update_function_graph_func() will miss that the update must still be done. Call update_function_graph_function() for all calls to update_ftrace_function() Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
|
2371e977c8 |
rtmutex: Plug slow unlock race
commit 27e35715df54cbc4f2d044f681802ae30479e7fb upstream.
When the rtmutex fast path is enabled the slow unlock function can
create the following situation:
spin_lock(foo->m->wait_lock);
foo->m->owner = NULL;
rt_mutex_lock(foo->m); <-- fast path
free = atomic_dec_and_test(foo->refcnt);
rt_mutex_unlock(foo->m); <-- fast path
if (free)
kfree(foo);
spin_unlock(foo->m->wait_lock); <--- Use after free.
Plug the race by changing the slow unlock to the following scheme:
while (!rt_mutex_has_waiters(m)) {
/* Clear the waiters bit in m->owner */
clear_rt_mutex_waiters(m);
owner = rt_mutex_owner(m);
spin_unlock(m->wait_lock);
if (cmpxchg(m->owner, owner, 0) == owner)
return;
spin_lock(m->wait_lock);
}
So in case of a new waiter incoming while the owner tries the slow
path unlock we have two situations:
unlock(wait_lock);
lock(wait_lock);
cmpxchg(p, owner, 0) == owner
mark_rt_mutex_waiters(lock);
acquire(lock);
Or:
unlock(wait_lock);
lock(wait_lock);
mark_rt_mutex_waiters(lock);
cmpxchg(p, owner, 0) != owner
enqueue_waiter();
unlock(wait_lock);
lock(wait_lock);
wakeup_next waiter();
unlock(wait_lock);
lock(wait_lock);
acquire(lock);
If the fast path is disabled, then the simple
m->owner = NULL;
unlock(m->wait_lock);
is sufficient as all access to m->owner is serialized via
m->wait_lock;
Also document and clarify the wakeup_next_waiter function as suggested
by Oleg Nesterov.
Reported-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20140611183852.937945560@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
||
|
|
1201613a70 |
rtmutex: Handle deadlock detection smarter
commit 3d5c9340d1949733eb37616abd15db36aef9a57c upstream. Even in the case when deadlock detection is not requested by the caller, we can detect deadlocks. Right now the code stops the lock chain walk and keeps the waiter enqueued, even on itself. Silly not to yell when such a scenario is detected and to keep the waiter enqueued. Return -EDEADLK unconditionally and handle it at the call sites. The futex calls return -EDEADLK. The non futex ones dequeue the waiter, throw a warning and put the task into a schedule loop. Tagged for stable as it makes the code more robust. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Brad Mouring <bmouring@ni.com> Link: http://lkml.kernel.org/r/20140605152801.836501969@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |