2936 Commits

Author SHA1 Message Date
Stricted bdecc6d184 Merge tag 'v3.10.108' into update
This is the 3.10.108 stable release
2018-03-21 23:07:40 +01:00
Stricted 073b9047a0 Merge tag 'v3.10.107' into update
This is the 3.10.107 stable release
2018-03-21 23:07:35 +01:00
Stricted 47e5ca72da Merge tag 'v3.10.106' into update
This is the 3.10.106 stable release
2018-03-21 23:06:23 +01:00
Stricted ad957d335c Merge tag 'v3.10.105' into update
This is the 3.10.105 stable release
2018-03-21 23:00:38 +01:00
Stricted b9e7bc93d6 Merge tag 'v3.10.103' into update
This is the 3.10.103 stable release
2018-03-21 22:58:21 +01:00
Stricted a8732f92e3 Merge tag 'v3.10.102' into update
This is the 3.10.102 stable release
2018-03-21 22:54:09 +01:00
Stricted d09f889ac9 Merge tag 'v3.10.99' into update
This is the 3.10.99 stable release
2018-03-21 22:51:42 +01:00
Stricted dd388bd4cd Merge tag 'v3.10.97' into update
This is the 3.10.97 stable release
2018-03-21 22:51:04 +01:00
Stricted 3884fb9807 Merge tag 'v3.10.93' into update
This is the 3.10.93 stable release
2018-03-21 22:49:39 +01:00
Stricted 4511a4e21f Merge tag 'v3.10.92' into update
This is the 3.10.92 stable release
2018-03-21 22:49:35 +01:00
Stricted d50b84c473 Merge tag 'v3.10.91' into update
This is the 3.10.91 stable release
2018-03-21 22:48:36 +01:00
Stricted 8441062777 Merge tag 'v3.10.90' into update
This is the 3.10.90 stable release
2018-03-21 22:47:31 +01:00
Stricted 97a3fbdc30 Merge tag 'v3.10.88' into update
This is the 3.10.88 stable release
2018-03-21 22:47:25 +01:00
Stricted 3460ea59c6 Merge tag 'v3.10.87' into update
This is the 3.10.87 stable release
2018-03-21 22:47:22 +01:00
Stricted 45f8c76c71 Merge tag 'v3.10.86' into update
This is the 3.10.86 stable release
2018-03-21 22:47:17 +01:00
Stricted 38b8911896 Merge tag 'v3.10.85' into update
This is the 3.10.85 stable release
2018-03-21 22:46:39 +01:00
Stricted 5eab702925 Merge tag 'v3.10.80' into update
This is the 3.10.80 stable release
2018-03-21 22:45:22 +01:00
Stricted 4a7de1f3d4 Merge tag 'v3.10.74' into update
This is the 3.10.74 stable release
2018-03-21 22:41:07 +01:00
Stricted aba762bde4 Merge tag 'v3.10.72' into update
This is the 3.10.72 stable release
2018-03-21 22:40:54 +01:00
Stricted 5d8d08710c Merge tag 'v3.10.71' into update
This is the 3.10.71 stable release
2018-03-21 22:40:50 +01:00
Stricted 875966bda8 Merge tag 'v3.10.68' into update
This is the 3.10.68 stable release
2018-03-21 22:38:24 +01:00
Stricted b2d402e5a4 Merge tag 'v3.10.67' into update
This is the 3.10.67 stable release
2018-03-21 22:36:30 +01:00
Stricted d242a5721c Merge tag 'v3.10.64' into update
This is the 3.10.64 stable release
2018-03-21 22:33:51 +01:00
Stricted 7887027a47 Merge tag 'v3.10.61' into update
This is the 3.10.61 stable release
2018-03-21 22:31:40 +01:00
Stricted 6f56b75961 Merge tag 'v3.10.60' into update
This is the 3.10.60 stable release
2018-03-21 22:31:34 +01:00
Stricted f5aa73ff5c Merge tag 'v3.10.57' into update
This is the 3.10.57 stable release
2018-03-21 22:28:46 +01:00
Stricted f29ec40f35 Merge tag 'v3.10.56' into update
This is the 3.10.56 stable release
2018-03-21 22:24:54 +01:00
Stricted b435043299 Merge tag 'v3.10.55' into update
This is the 3.10.55 stable release
2018-03-21 22:13:57 +01:00
Stricted 6fa3eb70c0 import PULS_20160108 2018-03-13 20:29:02 +01:00
Jason Yan 7719043b41 md: fix super_offset endianness in super_1_rdev_size_change
commit 3fb632e40d7667d8bedfabc28850ac06d5493f54 upstream.

The sb->super_offset should be big-endian, but the rdev->sb_start is in
host byte order, so fix this by adding cpu_to_le64.

Signed-off-by: Jason Yan <yanaijie@huawei.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>
2017-11-02 07:16:19 +01:00
Shaohua Li 634fd95363 md/raid10: submit bio directly to replacement disk
commit 6d399783e9d4e9bd44931501948059d24ad96ff8 upstream.

Commit 57c67df(md/raid10: submit IO from originating thread instead of
md thread) submits bio directly for normal disks but not for replacement
disks. There is no point we shouldn't do this for replacement disks.

Cc: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>
2017-11-02 07:16:19 +01:00
NeilBrown 9ac5390b02 md/bitmap: disable bitmap_resize for file-backed bitmaps.
commit e8a27f836f165c26f867ece7f31eb5c811692319 upstream.

bitmap_resize() does not work for file-backed bitmaps.
The buffer_heads are allocated and initialized when
the bitmap is read from the file, but resize doesn't
read from the file, it loads from the internal bitmap.
When it comes time to write the new bitmap, the bh is
non-existent and we crash.

The common case when growing an array involves making the array larger,
and that normally means making the bitmap larger.  Doing
that inside the kernel is possible, but would need more code.
It is probably easier to require people who use file-backed
bitmaps to remove them and re-add after a reshape.

So this patch disables the resizing of arrays which have
file-backed bitmaps.  This is better than crashing.

Reported-by: Zhilong Liu <zlliu@suse.com>
Fixes: d60b479d17 ("md/bitmap: add bitmap_resize function to allow bitmap resizing.")
Cc: stable@vger.kernel.org (v3.5+).
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>
2017-11-02 07:16:18 +01:00
colyli@suse.de d45256ff3e md linear: fix a race between linear_add() and linear_congested()
commit 03a9e24ef2aaa5f1f9837356aed79c860521407a upstream.

Recently I receive a bug report that on Linux v3.0 based kerenl, hot add
disk to a md linear device causes kernel crash at linear_congested(). From
the crash image analysis, I find in linear_congested(), mddev->raid_disks
contains value N, but conf->disks[] only has N-1 pointers available. Then
a NULL pointer deference crashes the kernel.

There is a race between linear_add() and linear_congested(), RCU stuffs
used in these two functions cannot avoid the race. Since Linuv v4.0
RCU code is replaced by introducing mddev_suspend().  After checking the
upstream code, it seems linear_congested() is not called in
generic_make_request() code patch, so mddev_suspend() cannot provent it
from being called. The possible race still exists.

Here I explain how the race still exists in current code.  For a machine
has many CPUs, on one CPU, linear_add() is called to add a hard disk to a
md linear device; at the same time on other CPU, linear_congested() is
called to detect whether this md linear device is congested before issuing
an I/O request onto it.

Now I use a possible code execution time sequence to demo how the possible
race happens,

seq    linear_add()                linear_congested()
 0                                 conf=mddev->private
 1   oldconf=mddev->private
 2   mddev->raid_disks++
 3                              for (i=0; i<mddev->raid_disks;i++)
 4                                bdev_get_queue(conf->disks[i].rdev->bdev)
 5   mddev->private=newconf

In linear_add() mddev->raid_disks is increased in time seq 2, and on
another CPU in linear_congested() the for-loop iterates conf->disks[i] by
the increased mddev->raid_disks in time seq 3,4. But conf with one more
element (which is a pointer to struct dev_info type) to conf->disks[] is
not updated yet, accessing its structure member in time seq 4 will cause a
NULL pointer deference fault.

To fix this race, there are 2 parts of modification in the patch,
 1) Add 'int raid_disks' in struct linear_conf, as a copy of
    mddev->raid_disks. It is initialized in linear_conf(), always being
    consistent with pointers number of 'struct dev_info disks[]'. When
    iterating conf->disks[] in linear_congested(), use conf->raid_disks to
    replace mddev->raid_disks in the for-loop, then NULL pointer deference
    will not happen again.
 2) RCU stuffs are back again, and use kfree_rcu() in linear_add() to
    free oldconf memory. Because oldconf may be referenced as mddev->private
    in linear_congested(), kfree_rcu() makes sure that its memory will not
    be released until no one uses it any more.
Also some code comments are added in this patch, to make this modification
to be easier understandable.

This patch can be applied for kernels since v4.0 after commit:
3be260cc18f8 ("md/linear: remove rcu protections in favour of
suspend/resume"). But this bug is reported on Linux v3.0 based kernel, for
people who maintain kernels before Linux v4.0, they need to do some back
back port to this patch.

Changelog:
 - V3: add 'int raid_disks' in struct linear_conf, and use kfree_rcu() to
       replace rcu_call() in linear_add().
 - v2: add RCU stuffs by suggestion from Shaohua and Neil.
 - v1: initial effort.

Signed-off-by: Coly Li <colyli@suse.de>
Cc: Shaohua Li <shli@fb.com>
Cc: Neil Brown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>
2017-06-20 08:02:59 +02:00
Wei Fang 7a56cc0918 md:raid1: fix a dead loop when read from a WriteMostly disk
commit 816b0acf3deb6d6be5d0519b286fdd4bafade905 upstream.

If first_bad == this_sector when we get the WriteMostly disk
in read_balance(), valid disk will be returned with zero
max_sectors. It'll lead to a dead loop in make_request(), and
OOM will happen because of endless allocation of struct bio.

Since we can't get data from this disk in this case, so
continue for another disk.

Signed-off-by: Wei Fang <fangwei1@huawei.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Cc: Julia Lawall <julia.lawall@lip6.fr>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Signed-off-by: Willy Tarreau <w@1wt.eu>
2017-06-20 08:02:58 +02:00
Konstantin Khlebnikov 523d7696aa md/raid5: limit request size according to implementation limits
commit e8d7c33232e5fdfa761c3416539bc5b4acd12db5 upstream.

Current implementation employ 16bit counter of active stripes in lower
bits of bio->bi_phys_segments. If request is big enough to overflow
this counter bio will be completed and freed too early.

Fortunately this not happens in default configuration because several
other limits prevent that: stripe_cache_size * nr_disks effectively
limits count of active stripes. And small max_sectors_kb at lower
disks prevent that during normal read/write operations.

Overflow easily happens in discard if it's enabled by module parameter
"devices_handle_discard_safely" and stripe_cache_size is set big enough.

This patch limits requests size with 256Mb - 8Kb to prevent overflows.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Shaohua Li <shli@kernel.org>
Cc: Neil Brown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>
2017-06-20 08:02:58 +02:00
Benjamin Marzinski 8255960354 dm space map metadata: fix 'struct sm_metadata' leak on failed create
commit 314c25c56c1ee5026cf99c570bdfe01847927acb upstream.

In dm_sm_metadata_create() we temporarily change the dm_space_map
operations from 'ops' (whose .destroy function deallocates the
sm_metadata) to 'bootstrap_ops' (whose .destroy function doesn't).

If dm_sm_metadata_create() fails in sm_ll_new_metadata() or
sm_ll_extend(), it exits back to dm_tm_create_internal(), which calls
dm_sm_destroy() with the intention of freeing the sm_metadata, but it
doesn't (because the dm_space_map operations is still set to
'bootstrap_ops').

Fix this by setting the dm_space_map operations back to 'ops' if
dm_sm_metadata_create() fails when it is set to 'bootstrap_ops'.

[js] no nr_blocks test in 3.12 yet

Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Signed-off-by: Willy Tarreau <w@1wt.eu>
2017-06-20 08:02:38 +02:00
Ondrej Kozina 3044e1958d dm crypt: mark key as invalid until properly loaded
commit 265e9098bac02bc5e36cda21fdbad34cb5b2f48d upstream.

In crypt_set_key(), if a failure occurs while replacing the old key
(e.g. tfm->setkey() fails) the key must not have DM_CRYPT_KEY_VALID flag
set.  Otherwise, the crypto layer would have an invalid key that still
has DM_CRYPT_KEY_VALID flag set.

Signed-off-by: Ondrej Kozina <okozina@redhat.com>
Reviewed-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>
2017-06-20 08:02:38 +02:00
Mikulas Patocka cc3d0c266f dm: flush queued bios when process blocks to avoid deadlock
commit d67a5f4b5947aba4bfe9a80a2b86079c215ca755 upstream.

Commit df2cb6daa4 ("block: Avoid deadlocks with bio allocation by
stacking drivers") created a workqueue for every bio set and code
in bio_alloc_bioset() that tries to resolve some low-memory deadlocks
by redirecting bios queued on current->bio_list to the workqueue if the
system is low on memory.  However other deadlocks (see below **) may
happen, without any low memory condition, because generic_make_request
is queuing bios to current->bio_list (rather than submitting them).

** the related dm-snapshot deadlock is detailed here:
https://www.redhat.com/archives/dm-devel/2016-July/msg00065.html

Fix this deadlock by redirecting any bios on current->bio_list to the
bio_set's rescue workqueue on every schedule() call.  Consequently,
when the process blocks on a mutex, the bios queued on
current->bio_list are dispatched to independent workqueus and they can
complete without waiting for the mutex to be available.

The structure blk_plug contains an entry cb_list and this list can contain
arbitrary callback functions that are called when the process blocks.
To implement this fix DM (ab)uses the onstack plug's cb_list interface
to get its flush_current_bio_list() called at schedule() time.

This fixes the snapshot deadlock - if the map method blocks,
flush_current_bio_list() will be called and it redirects bios waiting
on current->bio_list to appropriate workqueues.

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1267650
Depends-on: df2cb6daa4 ("block: Avoid deadlocks with bio allocation by stacking drivers")
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>
2017-06-08 00:47:02 +02:00
Bart Van Assche fa88cd790f dm: mark request_queue dead before destroying the DM device
commit 3b785fbcf81c3533772c52b717f77293099498d3 upstream.

This avoids that new requests are queued while __dm_destroy() is in
progress.

[js] use md->queue instead of non-present helper

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Signed-off-by: Willy Tarreau <w@1wt.eu>
2017-02-10 11:04:08 +01:00
Mike Snitzer 46049e5ca6 dm flakey: fix reads to be issued if drop_writes configured
commit 299f6230bc6d0ccd5f95bb0fb865d80a9c7d5ccc upstream.

v4.8-rc3 commit 99f3c90d0d ("dm flakey: error READ bios during the
down_interval") overlooked the 'drop_writes' feature, which is meant to
allow reads to be issued rather than errored, during the down_interval.

Fixes: 99f3c90d0d ("dm flakey: error READ bios during the down_interval")
Reported-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>
2017-02-10 11:04:06 +01:00
Mike Snitzer 8de0e6a718 dm flakey: error READ bios during the down_interval
commit 99f3c90d0d85708e7401a81ce3314e50bf7f2819 upstream.

When the corrupt_bio_byte feature was introduced it caused READ bios to
no longer be errored with -EIO during the down_interval.  This had to do
with the complexity of needing to submit READs if the corrupt_bio_byte
feature was used.

Fix it so READ bios are properly errored with -EIO; doing so early in
flakey_map() as long as there isn't a match for the corrupt_bio_byte
feature.

Fixes: a3998799fb ("dm flakey: add corrupt_bio_byte feature")
Reported-by: Akira Hayakawa <ruby.wktk@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Willy Tarreau <w@1wt.eu>
2016-08-27 11:40:39 +02:00
Jes Sorensen f6d033a309 md/raid5: Compare apples to apples (or sectors to sectors)
commit e7597e69dec59b65c5525db1626b9d34afdfa678 upstream.

'max_discard_sectors' is in sectors, while 'stripe' is in bytes.

This fixes the problem where DISCARD would get disabled on some larger
RAID5 configurations (6 or more drives in my testing), while it worked
as expected with smaller configurations.

Fixes: 620125f2bf ("MD: raid5 trim support")
Cc: stable@vger.kernel.org v3.7+
Signed-off-by: Jes Sorensen <Jes.Sorensen@redhat.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>
2016-06-07 10:42:47 +02:00
Eric Wheeler f5b0d85307 bcache: fix cache_set_flush() NULL pointer dereference on OOM
commit f8b11260a445169989d01df75d35af0f56178f95 upstream.

When bch_cache_set_alloc() fails to kzalloc the cache_set, the
asyncronous closure handling tries to dereference a cache_set that
hadn't yet been allocated inside of cache_set_flush() which is called
by __cache_set_unregister() during cleanup.  This appears to happen only
during an OOM condition on bcache_register.

Signed-off-by: Eric Wheeler <bcache@linux.ewheeler.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
2016-06-07 10:42:46 +02:00
Mikulas Patocka eeb7b0e016 dm snapshot: fix hung bios when copy error occurs
commit 385277bfb57faac44e92497104ba542cdd82d5fe upstream.

When there is an error copying a chunk dm-snapshot can incorrectly hold
associated bios indefinitely, resulting in hung IO.

The function copy_callback sets pe->error if there was error copying the
chunk, and then calls complete_exception.  complete_exception calls
pending_complete on error, otherwise it calls commit_exception with
commit_callback (and commit_callback calls complete_exception).

The persistent exception store (dm-snap-persistent.c) assumes that calls
to prepare_exception and commit_exception are paired.
persistent_prepare_exception increases ps->pending_count and
persistent_commit_exception decreases it.

If there is a copy error, persistent_prepare_exception is called but
persistent_commit_exception is not.  This results in the variable
ps->pending_count never returning to zero and that causes some pending
exceptions (and their associated bios) to be held forever.

Fix this by unconditionally calling commit_exception regardless of
whether the copy was successful.  A new "valid" parameter is added to
commit_exception -- when the copy fails this parameter is set to zero so
that the chunk that failed to copy (and all following chunks) is not
recorded in the snapshot store.  Also, remove commit_callback now that
it is merely a wrapper around pending_complete.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-03-03 15:06:23 -08:00
Joe Thornber 859ac05006 dm btree: fix bufio buffer leaks in dm_btree_del() error path
commit ed8b45a3679eb49069b094c0711b30833f27c734 upstream.

If dm_btree_del()'s call to push_frame() fails, e.g. due to
btree_node_validator finding invalid metadata, the dm_btree_del() error
path must unlock all frames (which have active dm-bufio buffers) that
were pushed onto the del_stack.

Otherwise, dm_bufio_client_destroy() will BUG_ON() because dm-bufio
buffers have leaked, e.g.:
  device-mapper: bufio: leaked buffer 3, hold count 1, list 0

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-03-03 15:06:22 -08:00
Nikolay Borisov 86d80ecd4f dm thin: fix race condition when destroying thin pool workqueue
commit 18d03e8c25f173f4107a40d0b8c24defb6ed69f3 upstream.

When a thin pool is being destroyed delayed work items are
cancelled using cancel_delayed_work(), which doesn't guarantee that on
return the delayed item isn't running.  This can cause the work item to
requeue itself on an already destroyed workqueue.  Fix this by using
cancel_delayed_work_sync() which guarantees that on return the work item
is not running anymore.

Fixes: 905e51b39a ("dm thin: commit outstanding data every second")
Fixes: 85ad643b7e7e5 ("dm thin: add timeout to stop out-of-data-space mode holding IO forever")
Signed-off-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-03-03 15:06:19 -08:00
Joe Thornber 9bb86db161 dm thin metadata: fix bug when taking a metadata snapshot
commit 49e99fc717f624aa75ca755d6e7bc029efd3f0e9 upstream.

When you take a metadata snapshot the btree roots for the mapping and
details tree need to have their reference counts incremented so they
persist for the lifetime of the metadata snap.

The roots being incremented were those currently written in the
superblock, which could possibly be out of date if concurrent IO is
triggering new mappings, breaking of sharing, etc.

Fix this by performing a commit with the metadata lock held while taking
a metadata snapshot.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-03-03 15:06:19 -08:00
Zheng Liu 9fac660099 bcache: unregister reboot notifier if bcache fails to unregister device
commit 2ecf0cdb2b437402110ab57546e02abfa68a716b upstream.

In bcache_init() function it forgot to unregister reboot notifier if
bcache fails to unregister a block device.  This commit fixes this.

Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Tested-by: Joshua Schmid <jschmid@suse.com>
Tested-by: Eric Wheeler <bcache@linux.ewheeler.net>
Cc: Kent Overstreet <kmo@daterainc.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-03-03 15:06:19 -08:00
Mauricio Faria de Oliveira 0fd9b8b499 Revert "dm mpath: fix stalls when handling invalid ioctls"
commit 47796938c46b943d157ac8a6f9ed4e3b98b83cf4 upstream.

This reverts commit a1989b330093578ea5470bea0a00f940c444c466.

That commit introduced a regression at least for the case of the SG_IO ioctl()
running without CAP_SYS_RAWIO capability (e.g., unprivileged users) when there
are no active paths: the ioctl() fails with the ENOTTY errno immediately rather
than blocking due to queue_if_no_path until a path becomes active, for example.

That case happens to be exercised by QEMU KVM guests with 'scsi-block' devices
(qemu "-device scsi-block" [1], libvirt "<disk type='block' device='lun'>" [2])
from multipath devices; which leads to SCSI/filesystem errors in such a guest.

More general scenarios can hit that regression too. The following demonstration
employs a SG_IO ioctl() with a standard SCSI INQUIRY command for this objective
(some output & user changes omitted for brevity and comments added for clarity).

Reverting that commit restores normal operation (queueing) in failing scenarios;
tested on linux-next (next-20151022).

1) Test-case is based on sg_simple0 [3] (just SG_IO; remove SG_GET_VERSION_NUM)

    $ cat sg_simple0.c
    ... see [3] ...
    $ sed '/SG_GET_VERSION_NUM/,/}/d' sg_simple0.c > sgio_inquiry.c
    $ gcc sgio_inquiry.c -o sgio_inquiry

2) The ioctl() works fine with active paths present.

    # multipath -l 85ag56
    85ag56 (...) dm-19 IBM     ,2145
    size=60G features='1 queue_if_no_path' hwhandler='0' wp=rw
    |-+- policy='service-time 0' prio=0 status=active
    | |- 8:0:11:0  sdz  65:144  active undef running
    | `- 9:0:9:0   sdbf 67:144  active undef running
    `-+- policy='service-time 0' prio=0 status=enabled
      |- 8:0:12:0  sdae 65:224  active undef running
      `- 9:0:12:0  sdbo 68:32   active undef running

    $ ./sgio_inquiry /dev/mapper/85ag56
    Some of the INQUIRY command's response:
        IBM       2145              0000
    INQUIRY duration=0 millisecs, resid=0

3) The ioctl() fails with ENOTTY errno with _no_ active paths present,
   for unprivileged users (rather than blocking due to queue_if_no_path).

    # for path in $(multipath -l 85ag56 | grep -o 'sd[a-z]\+'); \
          do multipathd -k"fail path $path"; done

    # multipath -l 85ag56
    85ag56 (...) dm-19 IBM     ,2145
    size=60G features='1 queue_if_no_path' hwhandler='0' wp=rw
    |-+- policy='service-time 0' prio=0 status=enabled
    | |- 8:0:11:0  sdz  65:144  failed undef running
    | `- 9:0:9:0   sdbf 67:144  failed undef running
    `-+- policy='service-time 0' prio=0 status=enabled
      |- 8:0:12:0  sdae 65:224  failed undef running
      `- 9:0:12:0  sdbo 68:32   failed undef running

    $ ./sgio_inquiry /dev/mapper/85ag56
    sg_simple0: Inquiry SG_IO ioctl error: Inappropriate ioctl for device

4) dmesg shows that scsi_verify_blk_ioctl() failed for SG_IO (0x2285);
   it returns -ENOIOCTLCMD, later replaced with -ENOTTY in vfs_ioctl().

    $ dmesg
    <...>
    [] device-mapper: multipath: Failing path 65:144.
    [] device-mapper: multipath: Failing path 67:144.
    [] device-mapper: multipath: Failing path 65:224.
    [] device-mapper: multipath: Failing path 68:32.
    [] sgio_inquiry: sending ioctl 2285 to a partition!

5) The ioctl() only works if the SYS_CAP_RAWIO capability is present
   (then queueing happens -- in this example, queue_if_no_path is set);
   this is due to a conditional check in scsi_verify_blk_ioctl().

    # capsh --drop=cap_sys_rawio -- -c './sgio_inquiry /dev/mapper/85ag56'
    sg_simple0: Inquiry SG_IO ioctl error: Inappropriate ioctl for device

    # ./sgio_inquiry /dev/mapper/85ag56 &
    [1] 72830

    # cat /proc/72830/stack
    [<c00000171c0df700>] 0xc00000171c0df700
    [<c000000000015934>] __switch_to+0x204/0x350
    [<c000000000152d4c>] msleep+0x5c/0x80
    [<c00000000077dfb0>] dm_blk_ioctl+0x70/0x170
    [<c000000000487c40>] blkdev_ioctl+0x2b0/0x9b0
    [<c0000000003128e4>] block_ioctl+0x64/0xd0
    [<c0000000002dd3b0>] do_vfs_ioctl+0x490/0x780
    [<c0000000002dd774>] SyS_ioctl+0xd4/0xf0
    [<c000000000009358>] system_call+0x38/0xd0

6) This is the function call chain exercised in this analysis:

SYSCALL_DEFINE3(ioctl, <...>) @ fs/ioctl.c
    -> do_vfs_ioctl()
        -> vfs_ioctl()
            ...
            error = filp->f_op->unlocked_ioctl(filp, cmd, arg);
            ...
                -> dm_blk_ioctl() @ drivers/md/dm.c
                    -> multipath_ioctl() @ drivers/md/dm-mpath.c
                        ...
                        (bdev = NULL, due to no active paths)
                        ...
                        if (!bdev || <...>) {
                            int err = scsi_verify_blk_ioctl(NULL, cmd);
                            if (err)
                                r = err;
                        }
                        ...
                            -> scsi_verify_blk_ioctl() @ block/scsi_ioctl.c
                                ...
                                if (bd && bd == bd->bd_contains) // not taken (bd = NULL)
                                    return 0;
                                ...
                                if (capable(CAP_SYS_RAWIO)) // not taken (unprivileged user)
                                    return 0;
                                ...
                                printk_ratelimited(KERN_WARNING
                                           "%s: sending ioctl %x to a partition!\n" <...>);

                                return -ENOIOCTLCMD;
                            <-
                        ...
                        return r ? : <...>
                    <-
            ...
            if (error == -ENOIOCTLCMD)
                error = -ENOTTY;
             out:
                return error;
            ...

Links:
[1] http://git.qemu.org/?p=qemu.git;a=commit;h=336a6915bc7089fb20fea4ba99972ad9a97c5f52
[2] https://libvirt.org/formatdomain.html#elementsDisks (see 'disk' -> 'device')
[3] http://tldp.org/HOWTO/SCSI-Generic-HOWTO/pexample.html (Revision 1.2, 2002-05-03)

Signed-off-by: Mauricio Faria de Oliveira <mauricfo@linux.vnet.ibm.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-02-19 14:22:38 -08:00
Mike Snitzer d6671b052b dm btree: fix leak of bufio-backed block in btree_split_sibling error path
commit 30ce6e1cc5a0f781d60227e9096c86e188d2c2bd upstream.

The block allocated at the start of btree_split_sibling() is never
released if later insert_at() fails.

Fix this by releasing the previously allocated bufio block using
unlock_block().

Reported-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-02-19 14:22:37 -08:00