]> git.proxmox.com Git - mirror_ubuntu-zesty-kernel.git/log
mirror_ubuntu-zesty-kernel.git
8 years agoceph: handle interrupted ceph_writepage()
Yan, Zheng [Fri, 13 May 2016 09:29:51 +0000 (17:29 +0800)]
ceph: handle interrupted ceph_writepage()

writepage() can be interrupted when it's called by direct memory
reclaimer (the direct memory relaimer is killed). To avoid lossing
data, we redirty the page.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
8 years agoceph: make ceph_update_writeable_page() uninterruptible
Yan, Zheng [Fri, 13 May 2016 03:30:24 +0000 (11:30 +0800)]
ceph: make ceph_update_writeable_page() uninterruptible

ceph_update_writeable_page() is used by ceph_write_begin(). It beaks
atomicity of write operation if it's interruptible.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
8 years agolibceph: make ceph_osdc_wait_request() uninterruptible
Yan, Zheng [Fri, 13 May 2016 03:04:33 +0000 (11:04 +0800)]
libceph: make ceph_osdc_wait_request() uninterruptible

Ceph_osdc_wait_request() is used when cephfs issues sync IO. In most
cases, the sync IO should be uninterruptible. The fix is use killale
wait function in ceph_osdc_wait_request().

Signed-off-by: Yan, Zheng <zyan@redhat.com>
8 years agoceph: handle -EAGAIN returned by ceph_update_writeable_page()
Yan, Zheng [Tue, 10 May 2016 11:09:06 +0000 (19:09 +0800)]
ceph: handle -EAGAIN returned by ceph_update_writeable_page()

when ceph_update_writeable_page() return -EAGAIN, caller should
lock the page and call ceph_update_writeable_page() again.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
8 years agoceph: make fault/page_mkwrite return VM_FAULT_OOM for -ENOMEM
Yan, Zheng [Tue, 10 May 2016 10:59:13 +0000 (18:59 +0800)]
ceph: make fault/page_mkwrite return VM_FAULT_OOM for -ENOMEM

Signed-off-by: Yan, Zheng <zyan@redhat.com>
8 years agoceph: block non-fatal signals for fault/page_mkwrite
Yan, Zheng [Tue, 10 May 2016 10:40:28 +0000 (18:40 +0800)]
ceph: block non-fatal signals for fault/page_mkwrite

Fault and page_mkwrite are supposed to be uninterruptable. But they
call ceph functions that are interruptible. So they should block
signals before calling functions that are interruptible

Signed-off-by: Yan, Zheng <zyan@redhat.com>
8 years agoceph: make logical calculation functions return bool
Zhang Zhuoyu [Fri, 25 Mar 2016 09:18:39 +0000 (05:18 -0400)]
ceph: make logical calculation functions return bool

This patch makes serverl logical caculation functions return bool to
improve readability due to these particular functions only using 0/1
as their return value.

No functional change.

Signed-off-by: Zhang Zhuoyu <zhangzhuoyu@cmss.chinamobile.com>
8 years agoceph: tolerate bad i_size for symlink inode
Yan, Zheng [Thu, 5 May 2016 08:40:17 +0000 (16:40 +0800)]
ceph: tolerate bad i_size for symlink inode

A mds bug can cause symlink's size to be truncated to zero.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
8 years agoceph: improve fragtree change detection
Yan, Zheng [Wed, 4 May 2016 03:40:30 +0000 (11:40 +0800)]
ceph: improve fragtree change detection

check if number of splits in i_fragtree is equal to number of splits
in mds reply

Signed-off-by: Yan, Zheng <zyan@redhat.com>
8 years agoceph: keep leaf frag when updating fragtree
Yan, Zheng [Wed, 4 May 2016 03:05:10 +0000 (11:05 +0800)]
ceph: keep leaf frag when updating fragtree

Nodes in i_fragtree are sorted according to ceph_compare_frag().
It means frag node in i_fragtree always follow its direct parent
node. To check if a leaf node is valid, we just need to check if
it's child of previous split node.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
8 years agoceph: fix dir_auth check in ceph_fill_dirfrag()
Yan, Zheng [Tue, 3 May 2016 14:33:20 +0000 (22:33 +0800)]
ceph: fix dir_auth check in ceph_fill_dirfrag()

-1 is CDIR_AUTH_PARENT, it means dir's auth mds is the same as
inode's auth mds

Signed-off-by: Yan, Zheng <zyan@redhat.com>
8 years agoceph: don't assume frag tree splits in mds reply are sorted
Yan, Zheng [Tue, 3 May 2016 12:55:50 +0000 (20:55 +0800)]
ceph: don't assume frag tree splits in mds reply are sorted

The algorithm that updates i_fragtree relies on that the frag tree
splits in mds reply are of the same order of i_fragtree. This is not
true because current MDS encodes frag tree splits in ascending order
of (unsigned)frag_t. But nodes in i_fragtree are sorted according to
ceph_frag_compare().

The fix is sort the frag tree splits first, then updates i_fragtree.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
8 years agoceph: fix inode reference leak
Yan, Zheng [Fri, 29 Apr 2016 15:40:23 +0000 (23:40 +0800)]
ceph: fix inode reference leak

Signed-off-by: Yan, Zheng <zyan@redhat.com>
8 years agoceph: using hash value to compose dentry offset
Yan, Zheng [Fri, 29 Apr 2016 03:27:30 +0000 (11:27 +0800)]
ceph: using hash value to compose dentry offset

If MDS sorts dentries in dirfrag in hash order, we use hash value to
compose dentry offset. dentry offset is:

  (0xff << 52) | ((24 bits hash) << 28) |
  (the nth entry hash hash collision)

This offset is stable across directory fragmentation. This alos means
there is no need to reset readdir offset if directory get fragmented
in the middle of readdir.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
8 years agoceph: don't forbid marking directory complete after forward seek
Yan, Zheng [Thu, 28 Apr 2016 14:56:44 +0000 (22:56 +0800)]
ceph: don't forbid marking directory complete after forward seek

Forward seek within same frag does not update fi->last_name, it will
not affect contents of later readdir reply. So there is no need to
forbid marking directory complete

Signed-off-by: Yan, Zheng <zyan@redhat.com>
8 years agoceph: record 'offset' for each entry of readdir result
Yan, Zheng [Thu, 28 Apr 2016 07:17:40 +0000 (15:17 +0800)]
ceph: record 'offset' for each entry of readdir result

This is preparation for using hash value as dentry 'offset'

Signed-off-by: Yan, Zheng <zyan@redhat.com>
8 years agoceph: define 'end/complete' in readdir reply as bit flags
Yan, Zheng [Wed, 27 Apr 2016 09:48:30 +0000 (17:48 +0800)]
ceph: define 'end/complete' in readdir reply as bit flags

Set a flag in readdir request, which indicates that client interprets
'end/complete' as bit flags. So that mds can reply additional flags in
readdir reply.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
8 years agoceph: define struct for dir entry in readdir reply
Yan, Zheng [Thu, 28 Apr 2016 01:37:39 +0000 (09:37 +0800)]
ceph: define struct for dir entry in readdir reply

This avoids defining multiple arrays for entries in readdir reply

Signed-off-by: Yan, Zheng <zyan@redhat.com>
8 years agoceph: simplify 'offset in frag'
Yan, Zheng [Wed, 27 Apr 2016 09:32:34 +0000 (17:32 +0800)]
ceph: simplify 'offset in frag'

don't distinguish leftmost frag from other frags. always use 2 as
first entry's offset.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
8 years agoceph: remove unnecessary checks in __dcache_readdir
Yan, Zheng [Fri, 29 Apr 2016 07:58:32 +0000 (15:58 +0800)]
ceph: remove unnecessary checks in __dcache_readdir

we never add snapdir and the hidden .ceph dir into readdir cache

Signed-off-by: Yan, Zheng <zyan@redhat.com>
8 years agoceph: search cache postion for dcache readdir
Yan, Zheng [Thu, 28 Apr 2016 09:43:35 +0000 (17:43 +0800)]
ceph: search cache postion for dcache readdir

use binary search to find cache index that corresponds to readdir
postion.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
8 years agoceph: use CEPH_MDS_OP_RMXATTR request to remove xattr
Yan, Zheng [Thu, 21 Apr 2016 04:11:54 +0000 (12:11 +0800)]
ceph: use CEPH_MDS_OP_RMXATTR request to remove xattr

Setxattr with NULL value and XATTR_REPLACE flag should be equivalent
to removexattr. But current MDS does not support deleting vxattrs through
MDS_OP_SETXATTR request. The workaround is sending MDS_OP_RMXATTR request
if setxattr actually removs xattr.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
8 years agoceph: report mount root in session metadata
Yan, Zheng [Thu, 21 Apr 2016 03:09:55 +0000 (11:09 +0800)]
ceph: report mount root in session metadata

Signed-off-by: Yan, Zheng <zyan@redhat.com>
8 years agoceph: don't show symlink target in debugfs/mdsc
Yan, Zheng [Mon, 18 Apr 2016 08:51:37 +0000 (16:51 +0800)]
ceph: don't show symlink target in debugfs/mdsc

symlink target is useless for debug and can be very long. It's annoying
to show it in debugfs/mdsc.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
8 years agoceph: don't call truncate_pagecache in ceph_writepages_start
Yan, Zheng [Fri, 15 Apr 2016 05:56:12 +0000 (13:56 +0800)]
ceph: don't call truncate_pagecache in ceph_writepages_start

truncate_pagecache() may decrease inode's reference. This can cause
deadlock if inode's last reference is dropped and iput_final() wants
to evict the inode. (evict() calls inode_wait_for_writeback(), which
waits for ceph_writepages_start() to return).

The fix is use work thead to truncate dirty pages. Also add 'forced
umount' check to ceph_update_writeable_page(), which prevents new
pages getting dirty.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
8 years agoceph: renew caps for read/write if mds session got killed.
Yan, Zheng [Fri, 8 Apr 2016 07:27:16 +0000 (15:27 +0800)]
ceph: renew caps for read/write if mds session got killed.

When mds session gets killed, read/write operation may hang.
Client waits for Frw caps, but mds does not know what caps client
wants. To recover this, client sends an open request to mds. The
request will tell mds what caps client wants.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
8 years agoceph: CEPH_FEATURE_MDSENC support
Yan, Zheng [Thu, 31 Mar 2016 07:53:01 +0000 (15:53 +0800)]
ceph: CEPH_FEATURE_MDSENC support

Signed-off-by: Yan, Zheng <zyan@redhat.com>
8 years agoceph: multiple filesystem support
Yan, Zheng [Wed, 30 Mar 2016 09:18:34 +0000 (17:18 +0800)]
ceph: multiple filesystem support

To access non-default filesystem, we just need to subscribe to
mdsmap.<MDS_NAMESPACE_ID> and add a new mount option for mds
namespace id.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
[idryomov@gmail.com: switch to a new libceph API]
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
8 years agolibceph: support for subscribing to "mdsmap.<id>" maps
Ilya Dryomov [Wed, 25 May 2016 22:05:01 +0000 (00:05 +0200)]
libceph: support for subscribing to "mdsmap.<id>" maps

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
8 years agolibceph: replace ceph_monc_request_next_osdmap()
Ilya Dryomov [Thu, 28 Apr 2016 14:07:28 +0000 (16:07 +0200)]
libceph: replace ceph_monc_request_next_osdmap()

... with a wrapper around maybe_request_map() - no need for two
osdmap-specific functions.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
8 years agolibceph: take osdc->lock in osdmap_show() and dump flags in hex
Ilya Dryomov [Thu, 28 Apr 2016 14:07:27 +0000 (16:07 +0200)]
libceph: take osdc->lock in osdmap_show() and dump flags in hex

There is now about a dozen CEPH_OSDMAP_* flags.  This is a debugging
interface, so just dump in hex instead of spelling each flag out.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
8 years agolibceph: pool deletion detection
Ilya Dryomov [Thu, 28 Apr 2016 14:07:27 +0000 (16:07 +0200)]
libceph: pool deletion detection

This adds the "map check" infrastructure for sending osdmap version
checks on CALC_TARGET_POOL_DNE and completing in-flight requests with
-ENOENT if the target pool doesn't exist or has just been deleted.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
8 years agolibceph: async MON client generic requests
Ilya Dryomov [Thu, 28 Apr 2016 14:07:27 +0000 (16:07 +0200)]
libceph: async MON client generic requests

For map check, we are going to need to send CEPH_MSG_MON_GET_VERSION
messages asynchronously and get a callback on completion.  Refactor MON
client to allow firing off generic requests asynchronously and add an
async variant of ceph_monc_get_version().  ceph_monc_do_statfs() is
switched over and remains sync.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
8 years agolibceph: support for checking on status of watch
Ilya Dryomov [Thu, 28 Apr 2016 14:07:27 +0000 (16:07 +0200)]
libceph: support for checking on status of watch

Implement ceph_osdc_watch_check() to be able to check on status of
watch.  Note that the time it takes for a watch/notify event to get
delivered through the notify_wq is taken into account.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
8 years agolibceph: support for sending notifies
Ilya Dryomov [Thu, 28 Apr 2016 14:07:27 +0000 (16:07 +0200)]
libceph: support for sending notifies

Implement ceph_osdc_notify() for sending notifies.

Due to the fact that the current messenger can't do read-in into
pagelists (it can only do write-out from them), I had to go with a page
vector for a NOTIFY_COMPLETE payload, for now.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
8 years agolibceph, rbd: ceph_osd_linger_request, watch/notify v2
Ilya Dryomov [Wed, 25 May 2016 23:15:02 +0000 (01:15 +0200)]
libceph, rbd: ceph_osd_linger_request, watch/notify v2

This adds support and switches rbd to a new, more reliable version of
watch/notify protocol.  As with the OSD client update, this is mostly
about getting the right structures linked into the right places so that
reconnects are properly sent when needed.  watch/notify v2 also
requires sending regular pings to the OSDs - send_linger_ping().

A major change from the old watch/notify implementation is the
introduction of ceph_osd_linger_request - linger requests no longer
piggy back on ceph_osd_request.  ceph_osd_event has been merged into
ceph_osd_linger_request.

All the details are now hidden within libceph, the interface consists
of a simple pair of watch/unwatch functions and ceph_osdc_notify_ack().
ceph_osdc_watch() does return ceph_osd_linger_request, but only to keep
the lifetime management simple.

ceph_osdc_notify_ack() accepts an optional data payload, which is
relayed back to the notifier.

Portions of this patch are loosely based on work by Douglas Fuller
<dfuller@redhat.com> and Mike Christie <michaelc@cs.wisc.edu>.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
8 years agorbd: rbd_dev_header_unwatch_sync() variant
Ilya Dryomov [Thu, 28 Apr 2016 14:07:26 +0000 (16:07 +0200)]
rbd: rbd_dev_header_unwatch_sync() variant

Introduce __rbd_dev_header_unwatch_sync(), which doesn't flush notify
callbacks.  This is for the new rados_watcherrcb_t, which would be
called from a notify callback.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
8 years agolibceph: wait_request_timeout()
Ilya Dryomov [Thu, 28 Apr 2016 14:07:26 +0000 (16:07 +0200)]
libceph: wait_request_timeout()

The unwatch timeout is currently implemented in rbd.  With
watch/unwatch code moving into libceph, we are going to need
a ceph_osdc_wait_request() variant with a timeout.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
8 years agolibceph: request_init() and request_release_checks()
Ilya Dryomov [Thu, 28 Apr 2016 14:07:26 +0000 (16:07 +0200)]
libceph: request_init() and request_release_checks()

These are going to be used by request_reinit() code.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
8 years agolibceph: a major OSD client update
Ilya Dryomov [Thu, 28 Apr 2016 14:07:26 +0000 (16:07 +0200)]
libceph: a major OSD client update

This is a major sync up, up to ~Jewel.  The highlights are:

- per-session request trees (vs a global per-client tree)
- per-session locking (vs a global per-client rwlock)
- homeless OSD session
- no ad-hoc global per-client lists
- support for pool quotas
- foundation for watch/notify v2 support
- foundation for map check (pool deletion detection) support

The switchover is incomplete: lingering requests can be setup and
teared down but aren't ever reestablished.  This functionality is
restored with the introduction of the new lingering infrastructure
(ceph_osd_linger_request, linger_work, etc) in a later commit.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
8 years agolibceph: protect osdc->osd_lru list with a spinlock
Ilya Dryomov [Thu, 28 Apr 2016 14:07:26 +0000 (16:07 +0200)]
libceph: protect osdc->osd_lru list with a spinlock

OSD client is getting moved from the big per-client lock to a set of
per-session locks.  The big rwlock would only be held for read most of
the time, so a global osdc->osd_lru needs additional protection.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
8 years agolibceph: allocate ceph_osd with GFP_NOFAIL
Ilya Dryomov [Thu, 28 Apr 2016 14:07:25 +0000 (16:07 +0200)]
libceph: allocate ceph_osd with GFP_NOFAIL

create_osd() is called way too deep in the stack to be able to error
out in a sane way; a failing create_osd() just messes everything up.
The current req_notarget list solution is broken - the list is never
traversed as it's not entirely clear when to do it, I guess.

If we were to start traversing it at regular intervals and retrying
each request, we wouldn't be far off from what __GFP_NOFAIL is doing,
so allocate OSD sessions with __GFP_NOFAIL, at least until we come up
with a better fix.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
8 years agolibceph: osd_init() and osd_cleanup()
Ilya Dryomov [Thu, 28 Apr 2016 14:07:25 +0000 (16:07 +0200)]
libceph: osd_init() and osd_cleanup()

These are going to be used by homeless OSD sessions code.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
8 years agolibceph: handle_one_map()
Ilya Dryomov [Thu, 28 Apr 2016 14:07:25 +0000 (16:07 +0200)]
libceph: handle_one_map()

Separate osdmap handling from decoding and iterating over a bag of maps
in a fresh MOSDMap message.  This sets up the scene for the updated OSD
client.

Of particular importance here is the addition of pi->was_full, which
can be used to answer "did this pool go full -> not-full in this map?".
This is the key bit for supporting pool quotas.

We won't be able to downgrade map_sem for much longer, so drop
downgrade_write().

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
8 years agolibceph: allocate dummy osdmap in ceph_osdc_init()
Ilya Dryomov [Thu, 28 Apr 2016 14:07:25 +0000 (16:07 +0200)]
libceph: allocate dummy osdmap in ceph_osdc_init()

This leads to a simpler osdmap handling code, particularly when dealing
with pi->was_full, which is introduced in a later commit.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
8 years agolibceph: schedule tick from ceph_osdc_init()
Ilya Dryomov [Thu, 28 Apr 2016 14:07:24 +0000 (16:07 +0200)]
libceph: schedule tick from ceph_osdc_init()

Both homeless OSD sessions and watch/notify v2, introduced in later
commits, require periodic ticks which don't depend on ->num_requests.
Schedule the initial tick from ceph_osdc_init() and reschedule from
handle_timeout() unconditionally.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
8 years agolibceph: move schedule_delayed_work() in ceph_osdc_init()
Ilya Dryomov [Thu, 28 Apr 2016 14:07:24 +0000 (16:07 +0200)]
libceph: move schedule_delayed_work() in ceph_osdc_init()

ceph_osdc_stop() isn't called if ceph_osdc_init() fails, so we end up
with handle_osds_timeout() running on invalid memory if any one of the
allocations fails.  Call schedule_delayed_work() after everything is
setup, just before returning.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
8 years agolibceph: redo callbacks and factor out MOSDOpReply decoding
Ilya Dryomov [Thu, 28 Apr 2016 14:07:24 +0000 (16:07 +0200)]
libceph: redo callbacks and factor out MOSDOpReply decoding

If you specify ACK | ONDISK and set ->r_unsafe_callback, both
->r_callback and ->r_unsafe_callback(true) are called on ack.  This is
very confusing.  Redo this so that only one of them is called:

    ->r_unsafe_callback(true), on ack
    ->r_unsafe_callback(false), on commit

or

    ->r_callback, on ack|commit

Decode everything in decode_MOSDOpReply() to reduce clutter.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
8 years agolibceph: drop msg argument from ceph_osdc_callback_t
Ilya Dryomov [Thu, 28 Apr 2016 14:07:24 +0000 (16:07 +0200)]
libceph: drop msg argument from ceph_osdc_callback_t

finish_read(), its only user, uses it to get to hdr.data_len, which is
what ->r_result is set to on success.  This gains us the ability to
safely call callbacks from contexts other than reply, e.g. map check.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
8 years agolibceph: switch to calc_target(), part 2
Ilya Dryomov [Wed, 25 May 2016 22:29:52 +0000 (00:29 +0200)]
libceph: switch to calc_target(), part 2

The crux of this is getting rid of ceph_osdc_build_request(), so that
MOSDOp can be encoded not before but after calc_target() calculates the
actual target.  Encoding now happens within ceph_osdc_start_request().

Also nuked is the accompanying bunch of pointers into the encoded
buffer that was used to update fields on each send - instead, the
entire front is re-encoded.  If we want to support target->name_len !=
base->name_len in the future, there is no other way, because oid is
surrounded by other fields in the encoded buffer.

Encoding OSD ops and adding data items to the request message were
mixed together in osd_req_encode_op().  While we want to re-encode OSD
ops, we don't want to add duplicate data items to the message when
resending, so all call to ceph_osdc_msg_data_add() are factored out
into a new setup_request_data().

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
8 years agolibceph: switch to calc_target(), part 1
Ilya Dryomov [Thu, 28 Apr 2016 14:07:23 +0000 (16:07 +0200)]
libceph: switch to calc_target(), part 1

Replace __calc_request_pg() and most of __map_request() with
calc_target() and start using req->r_t.

ceph_osdc_build_request() however still encodes base_oid, because it's
called before calc_target() is and target_oid is empty at that point in
time; a printf in osdc_show() also shows base_oid.  This is fixed in
"libceph: switch to calc_target(), part 2".

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
8 years agolibceph: introduce ceph_osd_request_target, calc_target()
Ilya Dryomov [Thu, 28 Apr 2016 14:07:23 +0000 (16:07 +0200)]
libceph: introduce ceph_osd_request_target, calc_target()

Introduce ceph_osd_request_target, containing all mapping-related
fields of ceph_osd_request and calc_target() for calculating mappings
and populating it.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
8 years agolibceph: pi->min_size, pi->last_force_request_resend
Ilya Dryomov [Thu, 28 Apr 2016 14:07:23 +0000 (16:07 +0200)]
libceph: pi->min_size, pi->last_force_request_resend

Add and decode pi->min_size and pi->last_force_request_resend.  These
are going to be used by calc_target().

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
8 years agolibceph: make pgid_cmp() global
Ilya Dryomov [Thu, 28 Apr 2016 14:07:23 +0000 (16:07 +0200)]
libceph: make pgid_cmp() global

calc_target() code is going to need to know how to compare PGs.  Take
lhs and rhs pgid by const * while at it.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
8 years agolibceph: rename ceph_calc_pg_primary()
Ilya Dryomov [Thu, 28 Apr 2016 14:07:23 +0000 (16:07 +0200)]
libceph: rename ceph_calc_pg_primary()

Rename ceph_calc_pg_primary() to ceph_pg_to_acting_primary() to
emphasise that it returns acting primary.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
8 years agolibceph: ceph_osds, ceph_pg_to_up_acting_osds()
Ilya Dryomov [Thu, 28 Apr 2016 14:07:22 +0000 (16:07 +0200)]
libceph: ceph_osds, ceph_pg_to_up_acting_osds()

Knowning just acting set isn't enough, we need to be able to record up
set as well to detect interval changes.  This means returning (up[],
up_len, up_primary, acting[], acting_len, acting_primary) and passing
it around.  Introduce and switch to ceph_osds to help with that.

Rename ceph_calc_pg_acting() to ceph_pg_to_up_acting_osds() and return
both up and acting sets from it.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
8 years agolibceph: rename ceph_oloc_oid_to_pg()
Ilya Dryomov [Thu, 28 Apr 2016 14:07:22 +0000 (16:07 +0200)]
libceph: rename ceph_oloc_oid_to_pg()

Rename ceph_oloc_oid_to_pg() to ceph_object_locator_to_pg().  Emphasise
that returned is raw PG and return -ENOENT instead of -EIO if the pool
doesn't exist.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
8 years agolibceph: fix ceph_eversion encoding
Ilya Dryomov [Thu, 28 Apr 2016 14:07:22 +0000 (16:07 +0200)]
libceph: fix ceph_eversion encoding

eversion_t is version+epoch in userspace and is encoded in that order.
ceph_eversion is defined as epoch+version in rados.h, yet we memcpy it
in __send_request().  Reoder ceph_eversion fields.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
8 years agolibceph: DEFINE_RB_FUNCS macro
Ilya Dryomov [Thu, 28 Apr 2016 14:07:22 +0000 (16:07 +0200)]
libceph: DEFINE_RB_FUNCS macro

Given

    struct foo {
        u64 id;
        struct rb_node bar_node;
    };

generate insert_bar(), erase_bar() and lookup_bar() functions with

    DEFINE_RB_FUNCS(bar, struct foo, id, bar_node)

The key is assumed to be an integer (u64, int, etc), compared with
< and >.  nodefld has to be initialized with RB_CLEAR_NODE().

Start using it for MDS, MON and OSD requests and OSD sessions.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
8 years agolibceph: open-code remove_{all,old}_osds()
Ilya Dryomov [Thu, 28 Apr 2016 14:07:22 +0000 (16:07 +0200)]
libceph: open-code remove_{all,old}_osds()

They are called only once, from ceph_osdc_stop() and
handle_osds_timeout() respectively.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
8 years agolibceph: nuke unused fields and functions
Ilya Dryomov [Thu, 28 Apr 2016 14:07:21 +0000 (16:07 +0200)]
libceph: nuke unused fields and functions

Either unused or useless:

    osdmap->mkfs_epoch
    osd->o_marked_for_keepalive
    monc->num_generic_requests
    osdc->map_waiters
    osdc->last_requested_map
    osdc->timeout_tid

    osd_req_op_cls_response_data()

    osdmap_apply_incremental() @msgr arg

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
8 years agorbd: use header_oid instead of header_name
Ilya Dryomov [Fri, 29 Apr 2016 18:01:25 +0000 (20:01 +0200)]
rbd: use header_oid instead of header_name

Switch to ceph_object_id and use ceph_oid_aprintf() instead of a bare
const char *.  This reduces noise in rbd_dev_header_name().

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
8 years agolibceph: variable-sized ceph_object_id
Ilya Dryomov [Fri, 29 Apr 2016 17:54:20 +0000 (19:54 +0200)]
libceph: variable-sized ceph_object_id

Currently ceph_object_id can hold object names of up to 100
(CEPH_MAX_OID_NAME_LEN) characters.  This is enough for all use cases,
expect one - long rbd image names:

- a format 1 header is named "<imgname>.rbd"
- an object that points to a format 2 header is named "rbd_id.<imgname>"

We operate on these potentially long-named objects during rbd map, and,
for format 1 images, during header refresh.  (A format 2 header name is
a small system-generated string.)

Lift this 100 character limit by making ceph_object_id be able to point
to an externally-allocated string.  Apart from being able to work with
almost arbitrarily-long named objects, this allows us to reduce the
size of ceph_object_id from >100 bytes to 64 bytes.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
8 years agolibceph: change how osd_op_reply message size is calculated
Ilya Dryomov [Wed, 27 Apr 2016 16:32:56 +0000 (18:32 +0200)]
libceph: change how osd_op_reply message size is calculated

For a message pool message, preallocate a page, just like we do for
osd_op.  For a normal message, take ceph_object_id into account and
don't bother subtracting CEPH_OSD_SLAB_OPS ceph_osd_ops.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
8 years agolibceph: move message allocation out of ceph_osdc_alloc_request()
Ilya Dryomov [Wed, 27 Apr 2016 12:15:51 +0000 (14:15 +0200)]
libceph: move message allocation out of ceph_osdc_alloc_request()

The size of ->r_request and ->r_reply messages depends on the size of
the object name (ceph_object_id), while the size of ceph_osd_request is
fixed.  Move message allocation into a separate function that would
have to be called after ceph_object_id and ceph_object_locator (which
is also going to become variable in size with RADOS namespaces) have
been filled in:

    req = ceph_osdc_alloc_request(...);
    <fill in req->r_base_oid>
    <fill in req->r_base_oloc>
    ceph_osdc_alloc_messages(req);

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
8 years agolibceph: grab snapc in ceph_osdc_alloc_request()
Ilya Dryomov [Tue, 26 Apr 2016 13:39:47 +0000 (15:39 +0200)]
libceph: grab snapc in ceph_osdc_alloc_request()

ceph_osdc_build_request() is going away.  Grab snapc and initialize
->r_snapid in ceph_osdc_alloc_request().

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
8 years agolibceph: make ceph_osdc_put_request() accept NULL
Ilya Dryomov [Tue, 26 Apr 2016 13:05:29 +0000 (15:05 +0200)]
libceph: make ceph_osdc_put_request() accept NULL

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
8 years agorbd: get/put img_request in rbd_img_request_submit()
Ilya Dryomov [Mon, 16 May 2016 11:18:57 +0000 (13:18 +0200)]
rbd: get/put img_request in rbd_img_request_submit()

By the time we get to checking for_each_obj_request_safe(img_request)
terminating condition, all obj_requests may be complete and img_request
ref, that rbd_img_request_submit() takes away from its caller, may be
put.  Moving the next_obj_request cursor is then a use-after-free on
img_request.

It's totally benign, as the value that's read is never used, but
I think it's still worth fixing.

Cc: Alex Elder <elder@linaro.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
8 years agoLinux 4.6
Linus Torvalds [Sun, 15 May 2016 22:43:13 +0000 (15:43 -0700)]
Linux 4.6

8 years agoMerge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sun, 15 May 2016 15:07:35 +0000 (08:07 -0700)]
Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fix from Thomas Gleixner:
 "Just the missing compat entry for the new pread/writev2"

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86: Use compat version for preadv2 and pwritev2

8 years agoMerge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Linus Torvalds [Sat, 14 May 2016 21:15:06 +0000 (14:15 -0700)]
Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net

Pull networking fixes from David Miller:

 1) Fix mvneta/bm dependencies, from Arnd Bergmann.

 2) RX completion hw bug workaround in bnxt_en, from Michael Chan.

 3) Kernel pointer leak in nf_conntrack, from Linus.

 4) Hoplimit route attribute limits not enforced properly, from Paolo
    Abeni.

 5) qlcnic driver NULL deref fix from Dan Carpenter.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
  arm64: bpf: jit JMP_JSET_{X,K}
  net/route: enforce hoplimit max value
  nf_conntrack: avoid kernel pointer value leak in slab name
  drivers: net: xgene: fix register offset
  drivers: net: xgene: fix statistics counters race condition
  drivers: net: xgene: fix ununiform latency across queues
  drivers: net: xgene: fix sharing of irqs
  drivers: net: xgene: fix IPv4 forward crash
  xen-netback: fix extra_info handling in xenvif_tx_err()
  net: mvneta: bm: fix dependencies again
  bnxt_en: Add workaround to detect bad opaque in rx completion (part 2)
  bnxt_en: Add workaround to detect bad opaque in rx completion (part 1)
  qlcnic: potential NULL dereference in qlcnic_83xx_get_minidump_template()

8 years agoarm64: bpf: jit JMP_JSET_{X,K}
Zi Shen Lim [Fri, 13 May 2016 06:37:58 +0000 (23:37 -0700)]
arm64: bpf: jit JMP_JSET_{X,K}

Original implementation commit e54bcde3d69d ("arm64: eBPF JIT compiler")
had the relevant code paths, but due to an oversight always fail jiting.

As a result, we had been falling back to BPF interpreter whenever a BPF
program has JMP_JSET_{X,K} instructions.

With this fix, we confirm that the corresponding tests in lib/test_bpf
continue to pass, and also jited.

...
[    2.784553] test_bpf: #30 JSET jited:1 188 192 197 PASS
[    2.791373] test_bpf: #31 tcpdump port 22 jited:1 325 677 625 PASS
[    2.808800] test_bpf: #32 tcpdump complex jited:1 323 731 991 PASS
...
[    3.190759] test_bpf: #237 JMP_JSET_K: if (0x3 & 0x2) return 1 jited:1 110 PASS
[    3.192524] test_bpf: #238 JMP_JSET_K: if (0x3 & 0xffffffff) return 1 jited:1 98 PASS
[    3.211014] test_bpf: #249 JMP_JSET_X: if (0x3 & 0x2) return 1 jited:1 120 PASS
[    3.212973] test_bpf: #250 JMP_JSET_X: if (0x3 & 0xffffffff) return 1 jited:1 89 PASS
...

Fixes: e54bcde3d69d ("arm64: eBPF JIT compiler")
Signed-off-by: Zi Shen Lim <zlim.lnx@gmail.com>
Acked-by: Will Deacon <will.deacon@arm.com>
Acked-by: Yang Shi <yang.shi@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agonet/route: enforce hoplimit max value
Paolo Abeni [Fri, 13 May 2016 16:33:41 +0000 (18:33 +0200)]
net/route: enforce hoplimit max value

Currently, when creating or updating a route, no check is performed
in both ipv4 and ipv6 code to the hoplimit value.

The caller can i.e. set hoplimit to 256, and when such route will
 be used, packets will be sent with hoplimit/ttl equal to 0.

This commit adds checks for the RTAX_HOPLIMIT value, in both ipv4
ipv6 route code, substituting any value greater than 255 with 255.

This is consistent with what is currently done for ADVMSS and MTU
in the ipv4 code.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agonf_conntrack: avoid kernel pointer value leak in slab name
Linus Torvalds [Sat, 14 May 2016 18:11:44 +0000 (11:11 -0700)]
nf_conntrack: avoid kernel pointer value leak in slab name

The slab name ends up being visible in the directory structure under
/sys, and even if you don't have access rights to the file you can see
the filenames.

Just use a 64-bit counter instead of the pointer to the 'net' structure
to generate a unique name.

This code will go away in 4.7 when the conntrack code moves to a single
kmemcache, but this is the backportable simple solution to avoiding
leaking kernel pointers to user space.

Fixes: 5b3501faa874 ("netfilter: nf_conntrack: per netns nf_conntrack_cachep")
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agoMerge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Linus Torvalds [Sat, 14 May 2016 18:59:43 +0000 (11:59 -0700)]
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull vfs fixes from Al Viro:
 "Overlayfs fixes from Miklos, assorted fixes from me.

  Stable fodder of varying severity, all sat in -next for a while"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  ovl: ignore permissions on underlying lookup
  vfs: add lookup_hash() helper
  vfs: rename: check backing inode being equal
  vfs: add vfs_select_inode() helper
  get_rock_ridge_filename(): handle malformed NM entries
  ecryptfs: fix handling of directory opening
  atomic_open(): fix the handling of create_error
  fix the copy vs. map logics in blk_rq_map_user_iov()
  do_splice_to(): cap the size before passing to ->splice_read()

8 years agoMerge branch 'xgene-fixes'
David S. Miller [Sat, 14 May 2016 01:12:07 +0000 (21:12 -0400)]
Merge branch 'xgene-fixes'

Iyappan Subramanian says:

====================
drivers: net: xgene: Bug fixes

This patch set addresses the following bug fixes that were found during testing.

  1. IPv4 forward test crash
    - drivers: net: xgene: fix IPv4 forward crash

  2. Sharing of irqs
    - drivers: net: xgene: fix sharing of irqs

  3. Ununiform latency across queues
    - drivers: net: xgene: fix ununiform latency across queues

  4. Fix statistics counters race condition
    - drivers: net: xgene: fix statistics counters race condition

  5. Correcting register offset and field lengths
    - drivers: net: xgene: fix register offset

v2: Address review comments from v1
- Defer TSO fix, and reposting all other patches from v1

v1:
- Initial version
====================

Signed-off-by: Iyappan Subramanian <isubramanian@apm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agodrivers: net: xgene: fix register offset
Iyappan Subramanian [Fri, 13 May 2016 23:53:01 +0000 (16:53 -0700)]
drivers: net: xgene: fix register offset

This patch fixes SG_RX_DV_GATE_REG_0_ADDR register offset
and ring state field lengths.

Signed-off-by: Iyappan Subramanian <isubramanian@apm.com>
Tested-by: Toan Le <toanle@apm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agodrivers: net: xgene: fix statistics counters race condition
Iyappan Subramanian [Fri, 13 May 2016 23:53:00 +0000 (16:53 -0700)]
drivers: net: xgene: fix statistics counters race condition

This patch fixes the race condition on updating the statistics
counters by moving the counters to the ring structure.

Signed-off-by: Iyappan Subramanian <isubramanian@apm.com>
Tested-by: Toan Le <toanle@apm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agodrivers: net: xgene: fix ununiform latency across queues
Iyappan Subramanian [Fri, 13 May 2016 23:52:59 +0000 (16:52 -0700)]
drivers: net: xgene: fix ununiform latency across queues

This patch addresses ununiform latency across queues by adding
more queues to match with, upto number of CPU cores.

Also, number of interrupts are increased and the channel numbers
are reordered.

Signed-off-by: Iyappan Subramanian <isubramanian@apm.com>
Tested-by: Toan Le <toanle@apm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agodrivers: net: xgene: fix sharing of irqs
Iyappan Subramanian [Fri, 13 May 2016 23:52:58 +0000 (16:52 -0700)]
drivers: net: xgene: fix sharing of irqs

Since hardware doesn't allow sharing of interrupts,
this patch fixes the same by removing IRQF_SHARED flag.

Signed-off-by: Iyappan Subramanian <isubramanian@apm.com>
Tested-by: Toan Le <toanle@apm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agodrivers: net: xgene: fix IPv4 forward crash
Iyappan Subramanian [Fri, 13 May 2016 23:52:57 +0000 (16:52 -0700)]
drivers: net: xgene: fix IPv4 forward crash

This patch fixes the crash observed during IPv4 forward test by
setting the drop field in the dbptr.

Signed-off-by: Iyappan Subramanian <isubramanian@apm.com>
Tested-by: Toan Le <toanle@apm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agoMerge branch 'for-4.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj...
Linus Torvalds [Fri, 13 May 2016 23:26:46 +0000 (16:26 -0700)]
Merge branch 'for-4.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup

Pull cgroup fixes from Tejun Heo:
 "During v4.6-rc1 cgroup namespace support was merged.  There is an
  issue where it's impossible to tell whether a given cgroup mount point
  is bind mounted or namespaced.  Serge has been working on the issue
  but it took longer than expected to resolve, so the late pull request.

  Given that it's a completely new feature and the patches don't touch
  anything else, the risk seems acceptable.  However, if this is too
  late, an alternative is plugging new cgroup ns creation for v4.6 and
  retrying for v4.7"

* 'for-4.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup: fix compile warning
  kernfs: kernfs_sop_show_path: don't return 0 after seq_dentry call
  cgroup, kernfs: make mountinfo show properly scoped path for cgroup namespaces
  kernfs_path_from_node_locked: don't overwrite nlen

8 years agoMerge branch 'for-4.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Linus Torvalds [Fri, 13 May 2016 23:16:51 +0000 (16:16 -0700)]
Merge branch 'for-4.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq

Pull workqueue fix from Tejun Heo:
 "CPU hotplug callbacks can invoke DOWN_FAILED w/o preceding
  DOWN_PREPARE which can trigger a WARN_ON() in workqueue.

  The bug has been there for a very long time.  It only triggers if CPU
  down fails at a specific point and I don't think it has adverse
  effects other than the warning messages.  The fix is very low impact"

* 'for-4.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
  workqueue: fix rebind bound workers warning

8 years agoMerge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Fri, 13 May 2016 19:21:17 +0000 (12:21 -0700)]
Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler fix from Ingo Molnar:
 "This is a revert to fix an interactivity problem.

  The proper fixes for the problems that the reverted commit exposed are
  now in sched/core (consisting of 3 patches), but were too risky for
  v4.6 and will arrive in the v4.7 merge window"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  Revert "sched/fair: Fix fairness issue on migration"

8 years agoMerge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Fri, 13 May 2016 18:54:02 +0000 (11:54 -0700)]
Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull perf fixes from Ingo Molnar:
 "An uncharacteristically large number of bugs popped up in the last
  week:

   - various tooling fixes, two crashes and build problems
   - two Intel PT fixes
   - an KNL uncore driver fix
   - an Intel PMU driver fix"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf stat: Fallback to user only counters when perf_event_paranoid > 1
  perf evsel: Handle EACCESS + perf_event_paranoid=2 in fallback()
  perf evsel: Improve EPERM error handling in open_strerror()
  tools lib traceevent: Do not reassign parg after collapse_tree()
  perf probe: Check if dwarf_getlocations() is available
  perf dwarf: Guard !x86_64 definitions under #ifdef else clause
  perf tools: Use readdir() instead of deprecated readdir_r()
  perf thread_map: Use readdir() instead of deprecated readdir_r()
  perf script: Use readdir() instead of deprecated readdir_r()
  perf tools: Use readdir() instead of deprecated readdir_r()
  perf/core: Disable the event on a truncated AUX record
  perf/x86/intel/pt: Generate PMI in the STOP region as well
  perf/x86: Fix undefined shift on 32-bit kernels
  perf/x86/msr: Fix SMI overflow
  perf/x86/intel/uncore: Fix CHA registers configuration procedure for Knights Landing platform
  perf diff: Fix duplicated output column

8 years agoMerge tag 'fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm...
Linus Torvalds [Fri, 13 May 2016 16:52:00 +0000 (09:52 -0700)]
Merge tag 'fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc

Pull ARM SoC fixes from Arnd Bergmann:
 "Three more bug fixes for ARM SoCs this week:

   - The Atmel sama5d2 was registering the wrong NFC device type

   - On Atmel sam9x5, the power management controller had an incorrect
     register area size

   - On ARM64 Allwinner machine was not secting the generic irqchip
     code, causing build errors in some configurations"

* tag 'fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc:
  ARM: dts: at91: sam9x5: Fix the memory range assigned to the PMC
  arm64/sunxi: 4.6-rc1: Add dependency on generic irq chip
  ARM: dts: at91: sama5d2: use "atmel,sama5d3-nfc" compatible for nfc

8 years agoMerge tag 'regulator-fix-v4.6-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Fri, 13 May 2016 16:46:00 +0000 (09:46 -0700)]
Merge tag 'regulator-fix-v4.6-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator

Pull regulator fixes from Mark Brown:
 "A small collection of driver specific fixes for the regulator
  subsysetem:

   - Fix handling of probe deferral for GPIO regulators

   - Fix a typo in the module alias for DA9053

   - Fix the definition of BUCK9 in the S2MPS11 driver.  This change
     looks larger than it is because an irregularity in the hardware
     means that the macro used to define bucks 6-10 needs duplicating
     and tweaking to have a separate macro for 9

   - Fix a series of errors in the definitions of the LDOs the AXP20x
     regulators, some of which had always been present and some of which
     were introduced in the merge window"

* tag 'regulator-fix-v4.6-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator:
  regulator: da9063: Correct module alias prefix to fix module autoloading
  regulator: axp20x: Fix axp22x ldo_io registration error on cold boot
  regulator: axp20x: Fix axp22x ldo_io voltage ranges
  regulator: axp20x: Fix LDO4 linear voltage range
  regulator: s2mps11: Fix invalid selector mask and voltages for buck9
  regulator: gpio: check return value of of_get_named_gpio

8 years agoMerge tag 'regmap-fix-v4.6-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Fri, 13 May 2016 16:40:32 +0000 (09:40 -0700)]
Merge tag 'regmap-fix-v4.6-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap

Pull regmap fixes from Mark Brown:
 "This is rather too late so it'd be completely understandable if you
  don't want to pull it at this point, I had thought I'd sent this
  earlier but it seems I didn't.  Everything has been in -next for some
  time now.

  The main set of fixes here are mopping up some more issues with MMIO,
  fixing handling of endianness configuration in DT (which just wasn't
  working at all) and cases where the register and value endianness are
  different.

  There is also a fix for bulk register reads on SPMI"

* tag 'regmap-fix-v4.6-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap:
  regmap: spmi: Fix regmap_spmi_ext_read in multi-byte case
  regmap: mmio: Explicitly say little endian is the defualt in the bus config
  regmap: mmio: Parse endianness definitions from DT
  regmap: Fix implicit inclusion of device.h
  regmap: mmio: Fix value endianness selection
  regmap: fix documentation to match code

8 years agoMerge tag 'media/v4.6-6' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab...
Linus Torvalds [Fri, 13 May 2016 16:34:59 +0000 (09:34 -0700)]
Merge tag 'media/v4.6-6' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media

Pull media fix from Mauro Carvalho Chehab:
 "A revert fixing a breakage that caused an OOPS on all VB2-based DVB
  drivers.

  We already have a proper fix, but it sounds safer to keep it being
  tested for a while and not hurry, to avoid the risk of another
  regression, specially since this is meant to be c/c to stable.  So,
  for now, let's just revert the broken patch"

* tag 'media/v4.6-6' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media:
  Revert "[media] videobuf2-v4l2: Verify planes array in buffer dequeueing"

8 years agoMerge branch 'drm-fixes' of git://people.freedesktop.org/~airlied/linux
Linus Torvalds [Fri, 13 May 2016 16:27:05 +0000 (09:27 -0700)]
Merge branch 'drm-fixes' of git://people.freedesktop.org/~airlied/linux

Pull drm fixes from Dave Airlie:
 "A bunch of radeon displayport mode setting fixes, and some misc i915
  fixes.

  There is one revert, the MST audio code in i915 was causing some
  oopses, so we've decided just to drop it until next kernel when we can
  fix it properly"

* 'drm-fixes' of git://people.freedesktop.org/~airlied/linux:
  drm/amdgpu: fix DP mode validation
  drm/radeon: fix DP mode validation
  drm/i915: Bail out of pipe config compute loop on LPT
  drm/radeon: fix PLL sharing on DCE6.1 (v2)
  drm/radeon: fix DP link training issue with second 4K monitor
  Revert "drm/i915: start adding dp mst audio"
  drm/i915/bdw: Add missing delay during L3 SQC credit programming
  drm/i915/lvds: separate border enable readout from panel fitter
  drm/i915: Update CDCLK_FREQ register on BDW after changing cdclk frequency

8 years agoMerge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Linus Torvalds [Fri, 13 May 2016 16:21:31 +0000 (09:21 -0700)]
Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6

Pull crypto fix from Herbert Xu:
 "This fixes a bug in the RSA self-test that may cause crashes on some
  architectures such as SPARC"

* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
  crypto: testmgr - Use kmalloc memory for RSA input

8 years agoMerge remote-tracking branches 'regulator/fix/axp20x', 'regulator/fix/da9063', 'regul...
Mark Brown [Fri, 13 May 2016 10:11:08 +0000 (11:11 +0100)]
Merge remote-tracking branches 'regulator/fix/axp20x', 'regulator/fix/da9063', 'regulator/fix/gpio' and 'regulator/fix/s2mps11' into regulator-linus

8 years agoMerge remote-tracking branches 'regmap/fix/be', 'regmap/fix/doc' and 'regmap/fix...
Mark Brown [Fri, 13 May 2016 09:36:10 +0000 (10:36 +0100)]
Merge remote-tracking branches 'regmap/fix/be', 'regmap/fix/doc' and 'regmap/fix/spmi' into regmap-linus

8 years agoMerge remote-tracking branch 'regmap/fix/mmio' into regmap-linus
Mark Brown [Fri, 13 May 2016 09:36:09 +0000 (10:36 +0100)]
Merge remote-tracking branch 'regmap/fix/mmio' into regmap-linus

8 years agoMerge branch 'drm-fixes-4.6' of git://people.freedesktop.org/~agd5f/linux into drm...
Dave Airlie [Fri, 13 May 2016 06:03:39 +0000 (16:03 +1000)]
Merge branch 'drm-fixes-4.6' of git://people.freedesktop.org/~agd5f/linux into drm-fixes

DP mode validation regression fix.
* 'drm-fixes-4.6' of git://people.freedesktop.org/~agd5f/linux:
  drm/amdgpu: fix DP mode validation
  drm/radeon: fix DP mode validation

8 years agoxen-netback: fix extra_info handling in xenvif_tx_err()
Paul Durrant [Thu, 12 May 2016 13:43:03 +0000 (14:43 +0100)]
xen-netback: fix extra_info handling in xenvif_tx_err()

Patch 562abd39 "xen-netback: support multiple extra info fragments
passed from frontend" contained a mistake which can result in an in-
correct number of responses being generated when handling errors
encountered when processing packets containing extra info fragments.
This patch fixes the problem.

Signed-off-by: Paul Durrant <paul.durrant@citrix.com>
Reported-by: Jan Beulich <JBeulich@suse.com>
Cc: Wei Liu <wei.liu2@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
8 years agoMerge tag 'perf-urgent-for-mingo-20160512' of git://git.kernel.org/pub/scm/linux...
Ingo Molnar [Fri, 13 May 2016 05:35:12 +0000 (07:35 +0200)]
Merge tag 'perf-urgent-for-mingo-20160512' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/urgent

Pull perf/urgent fixes from Arnaldo Carvalho de Melo:

- Fallback to usermode-only counters when perf_event_paranoid > 1, which
  is the case now (Arnaldo Carvalho de Melo)

- Do not reassign parg after collapse_tree() in libtraceevent, which
  may cause tool crashes (Steven Rostedt)

- Fix the build on Fedora Rawhide, where readdir_r() is deprecated and
  also wrt -Werror=unused-const-variable= + x86_32_regoffset_table on
  !x86_64 (Arnaldo Carvalho de Melo)

- Fix the build on Ubuntu 12.04.5, where dwarf_getlocations() isn't
  available, i.e. libdw-dev < 0.157 (Arnaldo Carvalho de Melo)

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
8 years agoMerge branch 'akpm' (patches from Andrew)
Linus Torvalds [Fri, 13 May 2016 01:44:24 +0000 (18:44 -0700)]
Merge branch 'akpm' (patches from Andrew)

Merge fixes from Andrew Morton:
 "4 fixes"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  mm: thp: calculate the mapcount correctly for THP pages during WP faults
  ksm: fix conflict between mmput and scan_get_next_rmap_item
  ocfs2: fix posix_acl_create deadlock
  ocfs2: revert using ocfs2_acl_chmod to avoid inode cluster lock hang

8 years agomm: thp: calculate the mapcount correctly for THP pages during WP faults
Andrea Arcangeli [Thu, 12 May 2016 22:42:25 +0000 (15:42 -0700)]
mm: thp: calculate the mapcount correctly for THP pages during WP faults

This will provide fully accuracy to the mapcount calculation in the
write protect faults, so page pinning will not get broken by false
positive copy-on-writes.

total_mapcount() isn't the right calculation needed in
reuse_swap_page(), so this introduces a page_trans_huge_mapcount()
that is effectively the full accurate return value for page_mapcount()
if dealing with Transparent Hugepages, however we only use the
page_trans_huge_mapcount() during COW faults where it strictly needed,
due to its higher runtime cost.

This also provide at practical zero cost the total_mapcount
information which is needed to know if we can still relocate the page
anon_vma to the local vma. If page_trans_huge_mapcount() returns 1 we
can reuse the page no matter if it's a pte or a pmd_trans_huge
triggering the fault, but we can only relocate the page anon_vma to
the local vma->anon_vma if we're sure it's only this "vma" mapping the
whole THP physical range.

Kirill A. Shutemov discovered the problem with moving the page
anon_vma to the local vma->anon_vma in a previous version of this
patch and another problem in the way page_move_anon_rmap() was called.

Andrew Morton discovered that CONFIG_SWAP=n wouldn't build in a
previous version, because reuse_swap_page must be a macro to call
page_trans_huge_mapcount from swap.h, so this uses a macro again
instead of an inline function. With this change at least it's a less
dangerous usage than it was before, because "page" is used only once
now, while with the previous code reuse_swap_page(page++) would have
called page_mapcount on page+1 and it would have increased page twice
instead of just once.

Dean Luick noticed an uninitialized variable that could result in a
rmap inefficiency for the non-THP case in a previous version.

Mike Marciniszyn said:

: Our RDMA tests are seeing an issue with memory locking that bisects to
: commit 61f5d698cc97 ("mm: re-enable THP")
:
: The test program registers two rather large MRs (512M) and RDMA
: writes data to a passive peer using the first and RDMA reads it back
: into the second MR and compares that data.  The sizes are chosen randomly
: between 0 and 1024 bytes.
:
: The test will get through a few (<= 4 iterations) and then gets a
: compare error.
:
: Tracing indicates the kernel logical addresses associated with the individual
: pages at registration ARE correct , the data in the "RDMA read response only"
: packets ARE correct.
:
: The "corruption" occurs when the packet crosse two pages that are not physically
: contiguous.   The second page reads back as zero in the program.
:
: It looks like the user VA at the point of the compare error no longer points to
: the same physical address as was registered.
:
: This patch totally resolves the issue!

Link: http://lkml.kernel.org/r/1462547040-1737-2-git-send-email-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: "Kirill A. Shutemov" <kirill@shutemov.name>
Reviewed-by: Dean Luick <dean.luick@intel.com>
Tested-by: Alex Williamson <alex.williamson@redhat.com>
Tested-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Tested-by: Josh Collier <josh.d.collier@intel.com>
Cc: Marc Haber <mh+linux-kernel@zugschlus.de>
Cc: <stable@vger.kernel.org> [4.5]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
8 years agoksm: fix conflict between mmput and scan_get_next_rmap_item
Zhou Chengming [Thu, 12 May 2016 22:42:21 +0000 (15:42 -0700)]
ksm: fix conflict between mmput and scan_get_next_rmap_item

A concurrency issue about KSM in the function scan_get_next_rmap_item.

task A (ksmd): |task B (the mm's task):
|
mm = slot->mm; |
down_read(&mm->mmap_sem); |
|
... |
|
spin_lock(&ksm_mmlist_lock); |
|
ksm_scan.mm_slot go to the next slot; |
|
spin_unlock(&ksm_mmlist_lock); |
|mmput() ->
| ksm_exit():
|
|spin_lock(&ksm_mmlist_lock);
|if (mm_slot && ksm_scan.mm_slot != mm_slot) {
| if (!mm_slot->rmap_list) {
| easy_to_free = 1;
| ...
|
|if (easy_to_free) {
| mmdrop(mm);
| ...
|
|So this mm_struct may be freed in the mmput().
|
up_read(&mm->mmap_sem); |

As we can see above, the ksmd thread may access a mm_struct that already
been freed to the kmem_cache.  Suppose a fork will get this mm_struct from
the kmem_cache, the ksmd thread then call up_read(&mm->mmap_sem), will
cause mmap_sem.count to become -1.

As suggested by Andrea Arcangeli, unmerge_and_remove_all_rmap_items has
the same SMP race condition, so fix it too.  My prev fix in function
scan_get_next_rmap_item will introduce a different SMP race condition, so
just invert the up_read/spin_unlock order as Andrea Arcangeli said.

Link: http://lkml.kernel.org/r/1462708815-31301-1-git-send-email-zhouchengming1@huawei.com
Signed-off-by: Zhou Chengming <zhouchengming1@huawei.com>
Suggested-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Geliang Tang <geliangtang@163.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Hanjun Guo <guohanjun@huawei.com>
Cc: Ding Tianhong <dingtianhong@huawei.com>
Cc: Li Bin <huawei.libin@huawei.com>
Cc: Zhen Lei <thunder.leizhen@huawei.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>