Matthew Wilcox [Wed, 7 Dec 2016 22:44:33 +0000 (14:44 -0800)]
redo: radix tree test suite: fix compilation
[ This resurrects commit 53855d10f456, which was reverted in 2b41226b39b6. It depended on commit d544abd5ff7d ("lib/radix-tree:
Convert to hotplug state machine") so now it is correct to apply ]
Patch "lib/radix-tree: Convert to hotplug state machine" breaks the test
suite as it adds a call to cpuhp_setup_state_nocalls() which is not
currently emulated in the test suite. Add it, and delete the emulation
of the old CPU hotplug mechanism.
Link: http://lkml.kernel.org/r/1480369871-5271-36-git-send-email-mawilcox@linuxonhyperv.com Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com> Tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
kernel/printk/printk.c:1893: warning: ‘cont’ defined but not used
Note that there are actually two different struct cont definitions and
objects: the first one is used if CONFIG_PRINTK=y, the second one became
unused by removing console_cont_flush().
Fixes: 5c2992ee7fd8 ("printk: remove console flushing special cases for partial buffered lines") Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Acked-by: Petr Mladek <pmladek@suse.com>
[ I do the occasional "allnoconfig" builds, but apparently not often
enough - Linus ] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
driver-api: create an edac.rst file with EDAC documentation
Currently, there's no device driver documentation for the EDAC
subsystem at the driver-api book. Fill in the blanks for the
structures and functions that misses documentation, uniform
the word on the existing ones, and add a new edac.rst file at
driver-api, in order to document the EDAC subsystem.
The edac.txt assumes that the reader has already deep knowledge
on RAS features. However, this may not be the case. So, add an
introduction chapter explaining the main concepts that are used by
the EDAC subsystem and by other RAS drivers within the Kernel.
edac.txt: update information about newer Intel CPUs
There's a chapter at edac.rst written by the time Nehalem
support was added. Such information is used not only by the
Nehalem driver (i7core_edac), but by all newer Intel CPU
architectures that are supported by i7core_edac, sb_edac
and sbx_edac drivers.
edac.txt: remove info that the Nehalem EDAC is experimental
This driver has been there for almost 3 years, without any
conceptual changes. So, it is not experimental anymore, and
won't likely have any changes at the API or on log outputs.
Converts the EDAC driver subsystem documentation to ReST:
- Put paragraph titles in lower case;
- Add code blocks where needed;
- Convert tables to ReST markup;
- Mark filesystem and module names as verbatim;
- Adjust document to be properly displayed in html.
edac.txt: add a section explaining the dimmX and rankX directories
Documentation for those are missing at the EDAC description.
I guess we end by moving such descriptions in the past to the
ABI document (or only added it there), but it means that the
EDAC documentation is incomplete. So, add it there.
* patchwork: (496 commits)
[media] v4l: tvp5150: Add missing break in set control handler
[media] v4l: tvp5150: Don't inline the tvp5150_selmux() function
[media] v4l: tvp5150: Compile tvp5150_link_setup out if !CONFIG_MEDIA_CONTROLLER
[media] em28xx: don't store usb_device at struct em28xx
[media] em28xx: use usb_interface for dev_foo() calls
[media] em28xx: don't change the device's name
[media] mn88472: fix chip id check on probe
[media] mn88473: fix chip id check on probe
[media] lirc: fix error paths in lirc_cdev_add()
[media] s5p-mfc: Add support for MFC v8 available in Exynos 5433 SoCs
[media] s5p-mfc: Rework clock handling
[media] s5p-mfc: Don't keep clock prepared all the time
[media] s5p-mfc: Kill all IS_ERR_OR_NULL in clocks management code
[media] s5p-mfc: Remove dead conditional code
[media] s5p-mfc: Ensure that clock is disabled before turning power off
[media] s5p-mfc: Remove special clock rate management
[media] s5p-mfc: Use printk_ratelimited for reporting ioctl errors
[media] s5p-mfc: Set DMA_ATTR_ALLOC_SINGLE_PAGES
[media] vivid: Set color_enc on HSV formats
[media] v4l2-tpg: Init hv_enc field with a valid value
...
Linus Torvalds [Thu, 15 Dec 2016 05:35:31 +0000 (21:35 -0800)]
Merge tag 'xfs-for-linus-4.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs
Pull xfs updates from Dave Chinner:
"There is quite a varied bunch of stuff in this update, and some of it
you will have already merged through the ext4 tree which imported the
dax-4.10-iomap-pmd topic branch from the XFS tree.
There is also a new direct IO implementation that uses the iomap
infrastructure. It's much simpler, faster, and has lower IO latency
than the existing direct IO infrastructure.
Summary:
- DAX PMD faults via iomap infrastructure
- Direct-io support in iomap infrastructure
- removal of now-redundant XFS inode iolock, replaced with VFS
i_rwsem
- synchronisation with fixes and changes in userspace libxfs code
- extent tree lookup helpers
- lots of little corruption detection improvements to verifiers
- optimised CRC calculations
- faster buffer cache lookups
- deprecation of barrier/nobarrier mount options - we always use
REQ_FUA/REQ_FLUSH where appropriate for data integrity now
- cleanups to speculative preallocation
- miscellaneous minor bug fixes and cleanups"
* tag 'xfs-for-linus-4.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (63 commits)
xfs: nuke unused tracepoint definitions
xfs: use GPF_NOFS when allocating btree cursors
xfs: use xfs_vn_setattr_size to check on new size
xfs: deprecate barrier/nobarrier mount option
xfs: Always flush caches when integrity is required
xfs: ignore leaf attr ichdr.count in verifier during log replay
xfs: use rhashtable to track buffer cache
xfs: optimise CRC updates
xfs: make xfs btree stats less huge
xfs: don't cap maximum dedupe request length
xfs: don't allow di_size with high bit set
xfs: error out if trying to add attrs and anextents > 0
xfs: don't crash if reading a directory results in an unexpected hole
xfs: complain if we don't get nextents bmap records
xfs: check for bogus values in btree block headers
xfs: forbid AG btrees with level == 0
xfs: several xattr functions can be void
xfs: handle cow fork in xfs_bmap_trace_exlist
xfs: pass state not whichfork to trace_xfs_extlist
xfs: Move AGI buffer type setting to xfs_read_agi
...
Linus Torvalds [Tue, 25 Oct 2016 18:27:31 +0000 (11:27 -0700)]
printk: remove console flushing special cases for partial buffered lines
It actively hurts proper merging, and makes for a lot of special cases.
There was a good(ish) reason for doing it originally, but it's getting
too painful to maintain. And most of the original reasons for it are
long gone.
So instead of having special code to flush partial lines to the console
(as opposed to the record buffers), do _all_ the console writing from
the record buffer, and be done with it.
If an oops happens (or some other synchronous event), we will flush the
partial lines due to the oops printing activity, so this does not affect
that. It does mean that if you have a completely hung machine, a
partial preceding line may not have been printed out.
That was some of the original reason for this complexity, in fact, back
when we used to test for the historical i386 "halt" instruction problem
by doing
and that model no longer works (it the 'hlt' instruction kills the
machine, the partial line won't have been flushed, so you won't even see
it).
Of course, that was also back in the days when people actually had
textual console output rather than a graphical splash-screen at bootup.
How times change..
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Cc: Joe Perches <joe@perches.com> Cc: Steven Rostedt <rostedt@goodmis.org> Tested-by: Petr Mladek <pmladek@suse.com> Tested-by: Geert Uytterhoeven <geert@linux-m68k.org> Tested-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Tue, 25 Oct 2016 18:27:31 +0000 (11:27 -0700)]
printk: remove games with previous record flags
The record logging code looks at the previous record flags in various
ways, and they are all wrong.
You can't use the previous record flags to determine anything about the
next record, because they may simply not be related. In particular, the
reason the previous record was a continuation record may well be exactly
_because_ the new record was printed by a different process, which is
why the previous record was flushed.
So all those games are simply wrong, and make the code hard to
understand (because the code fundamentally cdoes not make sense).
vhost_umem_interval_tree is only used locally within vhost.c, mark it
static. As some functions generated go unused, this triggers warnings
unless we also mark it inline.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
virtio_gpu_queue_ctrl_buffer_locked is called with ctrlq.qlock taken, it
releases and acquires this lock. This causes a sparse warning. Add
appropriate annotations for sparse context checking.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
When virtio_gpu_free_vbufs exits due to list empty, it does not
drop the free_vbufs lock that it took.
list empty is not expected to happen anyway, but it can't hurt to fix
this and drop the lock.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
struct ports_device includes a config field including the whole
virtio_console_config, but only max_nr_ports in there is ever updated or
used. The rest is unused and in fact does not even mirror the
device config. Drop everything except max_nr_ports,
saving some memory.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Jason Wang <jasowang@redhat.com>
Linus Torvalds [Thu, 15 Dec 2016 04:54:33 +0000 (20:54 -0800)]
Merge tag 'for-linus-4.10' of git://git.code.sf.net/p/openipmi/linux-ipmi
Pull IPMI updates from Corey Minyard:
"Various small fixes for IPMI. Cleanups in the documentation and
convertion printk() to pr_xxx() and removal of an unused module
parameter. Some small bug fixes and enhancements.
This also adds a post softdep from the IPMI core module to the IPMI
device interface. Many people have complained that the device
interface isn't automatically avaiable when IPMI is loaded. I don't
want to make the device interface mandatory, though, plenty of people
use IPMI internally (like with ACPI) and don't need a device interface
or the added possible security entry. A softdep should make it work
'out of the box' but allow people to not have it if they don't want
it"
* tag 'for-linus-4.10' of git://git.code.sf.net/p/openipmi/linux-ipmi:
ipmi: create hardware-independent softdep for ipmi_devintf
ipmi: Fix sequence number handling
ipmi: Pick up slave address from SMBIOS on an ACPI device
ipmi_si: Clean up printks
Move platform device creation earlier in the initialization
ipmi: Update documentation
ipmi_ssif: Remove an unused module parameter
ipmi: Periodically check for events, not messages
Logfs was introduced to the kernel in 2009, and hasn't seen any non
drive-by changes since 2012, while having lots of unsolved issues
including the complete lack of error handling, with more and more
issues popping up without any fixes.
The logfs.org domain has been bouncing from a mail, and the maintainer
on the non-logfs.org domain hasn't repsonded to past queries either.
Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Linus Torvalds [Thu, 15 Dec 2016 04:42:45 +0000 (20:42 -0800)]
Merge tag 'dmaengine-4.10-rc1' of git://git.infradead.org/users/vkoul/slave-dma
Pull dmaengine updates from Vinod Koul:
"Fairly routine update this time around with all changes specific to
drivers:
- New driver for STMicroelectronics FDMA
- Memory-to-memory transfers on dw dmac
- Support for slave maps on pl08x devices
- Bunch of driver fixes to use dma_pool_zalloc
- Bunch of compile and warning fixes spread across drivers"
[ The ST FDMA driver already came in earlier through the remoteproc tree ]
* tag 'dmaengine-4.10-rc1' of git://git.infradead.org/users/vkoul/slave-dma: (68 commits)
dmaengine: sirf-dma: remove unused ‘sdesc’
dmaengine: pl330: remove unused ‘regs’
dmaengine: s3c24xx: remove unused ‘cdata’
dmaengine: stm32-dma: remove unused ‘src_addr’
dmaengine: stm32-dma: remove unused ‘dst_addr’
dmaengine: stm32-dma: remove unused ‘sfcr’
dmaengine: pch_dma: remove unused ‘cookie’
dmaengine: mic_x100_dma: remove unused ‘data’
dmaengine: img-mdc: remove unused ‘prev_phys’
dmaengine: usb-dmac: remove unused ‘uchan’
dmaengine: ioat: remove unused ‘res’
dmaengine: ioat: remove unused ‘ioat_dma’
dmaengine: ioat: remove unused ‘is_raid_device’
dmaengine: pl330: do not generate unaligned access
dmaengine: k3dma: move to dma_pool_zalloc
dmaengine: at_hdmac: move to dma_pool_zalloc
dmaengine: at_xdmac: don't restore unsaved status
dmaengine: ioat: set error code on failures
dmaengine: ioat: set error code on failures
dmaengine: DW DMAC: add multi-block property to device tree
...
Gonglei [Tue, 22 Nov 2016 05:51:50 +0000 (13:51 +0800)]
virtio_ring: fix complaint by sparse
# make C=2 CF="-D__CHECK_ENDIAN__" ./drivers/virtio/
drivers/virtio/virtio_ring.c:423:19: warning: incorrect type in assignment (different base types)
drivers/virtio/virtio_ring.c:423:19: expected unsigned int [unsigned] [assigned] i
drivers/virtio/virtio_ring.c:423:19: got restricted __virtio16 [usertype] next
drivers/virtio/virtio_ring.c:423:19: warning: incorrect type in assignment (different base types)
drivers/virtio/virtio_ring.c:423:19: expected unsigned int [unsigned] [assigned] i
drivers/virtio/virtio_ring.c:423:19: got restricted __virtio16 [usertype] next
drivers/virtio/virtio_ring.c:423:19: warning: incorrect type in assignment (different base types)
drivers/virtio/virtio_ring.c:423:19: expected unsigned int [unsigned] [assigned] i
drivers/virtio/virtio_ring.c:423:19: got restricted __virtio16 [usertype] next
drivers/virtio/virtio_ring.c:604:39: warning: incorrect type in initializer (different base types)
drivers/virtio/virtio_ring.c:604:39: expected unsigned short [unsigned] [usertype] nextflag
drivers/virtio/virtio_ring.c:604:39: got restricted __virtio16
drivers/virtio/virtio_ring.c:612:33: warning: restricted __virtio16 degrades to integer
Signed-off-by: Gonglei <arei.gonglei@huawei.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Linus Torvalds [Thu, 15 Dec 2016 04:12:43 +0000 (20:12 -0800)]
Merge tag 'modules-for-v4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux
Pull modules updates from Jessica Yu:
"Summary of modules changes for the 4.10 merge window:
- The rodata= cmdline parameter has been extended to additionally
apply to module mappings
- Fix a hard to hit race between module loader error/clean up
handling and ftrace registration
- Some code cleanups, notably panic.c and modules code use a unified
taint_flags table now. This is much cleaner than duplicating the
taint flag code in modules.c"
* tag 'modules-for-v4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux:
module: fix DEBUG_SET_MODULE_RONX typo
module: extend 'rodata=off' boot cmdline parameter to module mappings
module: Fix a comment above strong_try_module_get()
module: When modifying a module's text ignore modules which are going away too
module: Ensure a module's state is set accordingly during module coming cleanup code
module: remove trailing whitespace
taint/module: Clean up global and module taint flags handling
modpost: free allocated memory
Linus Torvalds [Thu, 15 Dec 2016 01:25:18 +0000 (17:25 -0800)]
Merge branch 'akpm' (patches from Andrew)
Merge more updates from Andrew Morton:
- a few misc things
- kexec updates
- DMA-mapping updates to better support networking DMA operations
- IPC updates
- various MM changes to improve DAX fault handling
- lots of radix-tree changes, mainly to the test suite. All leading up
to reimplementing the IDA/IDR code to be a wrapper layer over the
radix-tree. However the final trigger-pulling patch is held off for
4.11.
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (114 commits)
radix tree test suite: delete unused rcupdate.c
radix tree test suite: add new tag check
radix-tree: ensure counts are initialised
radix tree test suite: cache recently freed objects
radix tree test suite: add some more functionality
idr: reduce the number of bits per level from 8 to 6
rxrpc: abstract away knowledge of IDR internals
tpm: use idr_find(), not idr_find_slowpath()
idr: add ida_is_empty
radix tree test suite: check multiorder iteration
radix-tree: fix replacement for multiorder entries
radix-tree: add radix_tree_split_preload()
radix-tree: add radix_tree_split
radix-tree: add radix_tree_join
radix-tree: delete radix_tree_range_tag_if_tagged()
radix-tree: delete radix_tree_locate_item()
radix-tree: improve multiorder iterators
btrfs: fix race in btrfs_free_dummy_fs_info()
radix-tree: improve dump output
radix-tree: make radix_tree_find_next_bit more useful
...
Linus Torvalds [Thu, 15 Dec 2016 01:21:53 +0000 (17:21 -0800)]
Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull block IO fixes from Jens Axboe:
"A few fixes that I collected as post-merge.
I was going to wait a bit with sending this out, but the O_DIRECT fix
should really go in sooner rather than later"
* 'for-linus' of git://git.kernel.dk/linux-block:
blk-mq: Fix failed allocation path when mapping queues
blk-mq: Avoid memory reclaim when remapping queues
block_dev: don't update file access position for sync direct IO
nvme/pci: Log PCI_STATUS when the controller dies
block_dev: don't test bdev->bd_contains when it is not stable
Linus Torvalds [Thu, 15 Dec 2016 01:09:00 +0000 (17:09 -0800)]
Merge branch 'for-4.10/fs-unmap' of git://git.kernel.dk/linux-block
Pull fs meta data unmap optimization from Jens Axboe:
"A series from Jan Kara, providing a more efficient way for unmapping
meta data from in the buffer cache than doing it block-by-block.
Provide a general helper that existing callers can use"
* 'for-4.10/fs-unmap' of git://git.kernel.dk/linux-block:
fs: Remove unmap_underlying_metadata
fs: Add helper to clean bdev aliases under a bh and use it
ext2: Use clean_bdev_aliases() instead of iteration
ext4: Use clean_bdev_aliases() instead of iteration
direct-io: Use clean_bdev_aliases() instead of handmade iteration
fs: Provide function to unmap metadata for a range of blocks
Linus Torvalds [Thu, 15 Dec 2016 00:30:12 +0000 (16:30 -0800)]
docs: add back 'Documentation/Changes' file (as symlink)
Jaegeuk Kim reports that the debian kernel package build gets confused
by the lack of Documentation/Changes file. We also refer to that path
name in ver_linux and various how-to files and Kconfig files.
The file got renamed away in commit 186128f75392 ("docs-rst: add
documents to development-process"), and as Jaegeuk Kim points out, the
commit message for that change says "use symlinks instead of renames",
but then the commit itself actually does renames after all.
Maybe we should do the other files too, but for now this just adds the
minimal symlink back to the historical name, so that people looking for
Documentation/Changes will actually find what they are looking for, and
the debian scripts continue to work.
Matthew Wilcox [Wed, 14 Dec 2016 23:09:36 +0000 (15:09 -0800)]
radix tree test suite: delete unused rcupdate.c
This file was used to implement call_rcu() before liburcu implemented
that function. It hasn't even been compiled since before the test suite
was added to the kernel. Remove it to reduce confusion.
Link: http://lkml.kernel.org/r/1481667692-14500-5-git-send-email-mawilcox@linuxonhyperv.com Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Matthew Wilcox [Wed, 14 Dec 2016 23:09:34 +0000 (15:09 -0800)]
radix tree test suite: add new tag check
We have a check that setting a tag on a single entry at root succeeds,
but we were missing a check that clearing a tag on that same entry also
succeeds.
Link: http://lkml.kernel.org/r/1481667692-14500-4-git-send-email-mawilcox@linuxonhyperv.com Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Matthew Wilcox [Wed, 14 Dec 2016 23:09:31 +0000 (15:09 -0800)]
radix-tree: ensure counts are initialised
radix_tree_join() was freeing nodes with a non-zero ->exceptional count,
and radix_tree_split() wasn't zeroing ->exceptional when it allocated
the new node. Fix this by making all callers of radix_tree_node_alloc()
pass in the new counts (and some other always-initialised fields), which
will prevent the problem recurring if in future we decide to do
something similar.
Link: http://lkml.kernel.org/r/1481667692-14500-3-git-send-email-mawilcox@linuxonhyperv.com Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Matthew Wilcox [Wed, 14 Dec 2016 23:09:28 +0000 (15:09 -0800)]
radix tree test suite: cache recently freed objects
The kmem_cache_alloc implementation simply allocates new memory from
malloc() and calls the ctor, which zeroes out the entire object. This
means it cannot spot bugs where the object isn't properly reinitialised
before being freed.
Add a small (11 objects) cache before freeing objects back to malloc.
This is enough to let us write a test to catch it, although the memory
allocator is now aware of the structure of the radix tree node, since it
chains free objects through ->private_data (like the percpu cache does).
Link: http://lkml.kernel.org/r/1481667692-14500-2-git-send-email-mawilcox@linuxonhyperv.com Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Matthew Wilcox [Wed, 14 Dec 2016 23:09:22 +0000 (15:09 -0800)]
idr: reduce the number of bits per level from 8 to 6
In preparation for merging the IDR and radix tree, reduce the fanout at
each level from 256 to 64. If this causes a performance problem then a
bisect will point to this commit, and we'll have a better idea about
what we might do to fix it.
Link: http://lkml.kernel.org/r/1480369871-5271-66-git-send-email-mawilcox@linuxonhyperv.com Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Matthew Wilcox [Wed, 14 Dec 2016 23:09:19 +0000 (15:09 -0800)]
rxrpc: abstract away knowledge of IDR internals
Add idr_get_cursor() / idr_set_cursor() APIs, and remove the reference
to IDR_SIZE.
Link: http://lkml.kernel.org/r/1480369871-5271-65-git-send-email-mawilcox@linuxonhyperv.com Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com> Reviewed-by: David Howells <dhowells@redhat.com> Tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Matthew Wilcox [Wed, 14 Dec 2016 23:09:16 +0000 (15:09 -0800)]
tpm: use idr_find(), not idr_find_slowpath()
idr_find_slowpath() is not intended to be part of the public API, it's
an implementation detail. There's no reason to skip straight to the
slowpath here.
Link: http://lkml.kernel.org/r/1480369871-5271-64-git-send-email-mawilcox@linuxonhyperv.com Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com> Tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Peter Huewe <peterhuewe@gmx.de> Cc: Marcel Selhorst <tpmdd@selhorst.net> Cc: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com> Cc: Jason Gunthorpe <jgunthorpe@obsidianresearch.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Matthew Wilcox [Wed, 14 Dec 2016 23:09:10 +0000 (15:09 -0800)]
radix tree test suite: check multiorder iteration
The random iteration test only inserts order-0 entries currently.
Update it to insert entries of order between 7 and 0. Also make the
maximum index configurable, make some variables static, make the test
duration variable, remove some useless spinning, and add a fifth thread
which calls tag_tagged_items().
Link: http://lkml.kernel.org/r/1480369871-5271-62-git-send-email-mawilcox@linuxonhyperv.com Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com> Tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Matthew Wilcox [Wed, 14 Dec 2016 23:09:07 +0000 (15:09 -0800)]
radix-tree: fix replacement for multiorder entries
When replacing an entry with NULL, we need to delete any sibling
entries. Also account deleting exceptional entries properly. Also fix
a bug with radix_tree_iter_replace() where we would fail to remove
entirely freed nodes. Also fix accounting bug when switching between
normal and exceptional entries with replace_slot. Also add testcases
for all these bugs.
Link: http://lkml.kernel.org/r/1480369871-5271-61-git-send-email-mawilcox@linuxonhyperv.com Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com> Tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Matthew Wilcox [Wed, 14 Dec 2016 23:09:04 +0000 (15:09 -0800)]
radix-tree: add radix_tree_split_preload()
Calculate how many nodes we need to allocate to split an old_order entry
into multiple entries, each of size new_order. The test suite checks
that we allocated exactly the right number of nodes; neither too many
(checked by rtp->nr == 0), nor too few (checked by comparing
nr_allocated before and after the call to radix_tree_split()).
Link: http://lkml.kernel.org/r/1480369871-5271-60-git-send-email-mawilcox@linuxonhyperv.com Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Matthew Wilcox [Wed, 14 Dec 2016 23:09:01 +0000 (15:09 -0800)]
radix-tree: add radix_tree_split
This new function splits a larger multiorder entry into smaller entries
(potentially multi-order entries). These entries are initialised to
RADIX_TREE_RETRY to ensure that RCU walkers who see this state aren't
confused. The caller should then call radix_tree_for_each_slot() and
radix_tree_replace_slot() in order to turn these retry entries into the
intended new entries. Tags are replicated from the original multiorder
entry into each new entry.
Link: http://lkml.kernel.org/r/1480369871-5271-59-git-send-email-mawilcox@linuxonhyperv.com Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Matthew Wilcox [Wed, 14 Dec 2016 23:08:58 +0000 (15:08 -0800)]
radix-tree: add radix_tree_join
This new function allows for the replacement of many smaller entries in
the radix tree with one larger multiorder entry. From the point of view
of an RCU walker, they may see a mixture of the smaller entries and the
large entry during the same walk, but they will never see NULL for an
index which was populated before the join.
Link: http://lkml.kernel.org/r/1480369871-5271-58-git-send-email-mawilcox@linuxonhyperv.com Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is an exceptionally complicated function with just one caller
(tag_pages_for_writeback). We devote a large portion of the runtime of
the test suite to testing this one function which has one caller. By
introducing the new function radix_tree_iter_tag_set(), we can eliminate
all of the complexity while keeping the performance. The caller can now
use a fairly standard radix_tree_for_each() loop, and it doesn't need to
worry about tricksy things like 'start' wrapping.
The test suite continues to spend a large amount of time investigating
this function, but now it's testing the underlying primitives such as
radix_tree_iter_resume() and the radix_tree_for_each_tagged() iterator
which are also used by other parts of the kernel.
Link: http://lkml.kernel.org/r/1480369871-5271-57-git-send-email-mawilcox@linuxonhyperv.com Signed-off-by: Matthew Wilcox <willy@infradead.org> Tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Matthew Wilcox [Wed, 14 Dec 2016 23:08:52 +0000 (15:08 -0800)]
radix-tree: delete radix_tree_locate_item()
This rather complicated function can be better implemented as an
iterator. It has only one caller, so move the functionality to the only
place that needs it. Update the test suite to follow the same pattern.
Link: http://lkml.kernel.org/r/1480369871-5271-56-git-send-email-mawilcox@linuxonhyperv.com Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com> Acked-by: Konstantin Khlebnikov <koct9i@gmail.com> Tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Matthew Wilcox [Wed, 14 Dec 2016 23:08:49 +0000 (15:08 -0800)]
radix-tree: improve multiorder iterators
This fixes several interlinked problems with the iterators in the
presence of multiorder entries.
1. radix_tree_iter_next() would only advance by one slot, which would
result in the iterators returning the same entry more than once if
there were sibling entries.
2. radix_tree_next_slot() could return an internal pointer instead of
a user pointer if a tagged multiorder entry was immediately followed by
an entry of lower order.
3. radix_tree_next_slot() expanded to a lot more code than it used to
when multiorder support was compiled in. And I wasn't comfortable with
entry_to_node() being in a header file.
Fixing radix_tree_iter_next() for the presence of sibling entries
necessarily involves examining the contents of the radix tree, so we now
need to pass 'slot' to radix_tree_iter_next(), and we need to change the
calling convention so it is called *before* dropping the lock which
protects the tree. Also rename it to radix_tree_iter_resume(), as some
people thought it was necessary to call radix_tree_iter_next() each time
around the loop.
radix_tree_next_slot() becomes closer to how it looked before multiorder
support was introduced. It only checks to see if the next entry in the
chunk is a sibling entry or a pointer to a node; this should be rare
enough that handling this case out of line is not a performance impact
(and such impact is amortised by the fact that the entry we just
processed was a multiorder entry). Also, radix_tree_next_slot() used to
force a new chunk lookup for untagged entries, which is more expensive
than the out of line sibling entry skipping.
Link: http://lkml.kernel.org/r/1480369871-5271-55-git-send-email-mawilcox@linuxonhyperv.com Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com> Tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Matthew Wilcox [Wed, 14 Dec 2016 23:08:46 +0000 (15:08 -0800)]
btrfs: fix race in btrfs_free_dummy_fs_info()
We drop the lock which protects the radix tree, so we must call
radix_tree_iter_next() in order to avoid a modification to the tree
invalidating the iterator state.
Link: http://lkml.kernel.org/r/1480369871-5271-54-git-send-email-mawilcox@linuxonhyperv.com Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com> Tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Matthew Wilcox [Wed, 14 Dec 2016 23:08:43 +0000 (15:08 -0800)]
radix-tree: improve dump output
Print the indices of the entries as unsigned (instead of signed)
integers and print the parent node of each entry to help navigate around
larger trees where the layout is not quite so obvious. Print the
indices covered by a node. Rearrange the order of fields printed so the
indices and parents line up for each type of entry.
Link: http://lkml.kernel.org/r/1480369871-5271-53-git-send-email-mawilcox@linuxonhyperv.com Signed-off-by: Matthew Wilcox <willy@infradead.org> Tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Matthew Wilcox [Wed, 14 Dec 2016 23:08:40 +0000 (15:08 -0800)]
radix-tree: make radix_tree_find_next_bit more useful
Since this function is specialised to the radix tree, pass in the node
and tag to calculate the address of the bitmap in
radix_tree_find_next_bit() instead of the caller. Likewise, there is no
need to pass in the size of the bitmap.
Link: http://lkml.kernel.org/r/1480369871-5271-52-git-send-email-mawilcox@linuxonhyperv.com Signed-off-by: Matthew Wilcox <willy@infradead.org> Tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Matthew Wilcox [Wed, 14 Dec 2016 23:08:34 +0000 (15:08 -0800)]
radix-tree: move rcu_head into a union with private_list
I want to be able to reference node->parent after freeing node.
Currently node->parent is in a union with rcu_head, so it is overwritten
when the node is put on the RCU list. We know that private_list is not
referenced after the node is freed, so it is safe for these two members
to share space.
Link: http://lkml.kernel.org/r/1480369871-5271-50-git-send-email-mawilcox@linuxonhyperv.com Signed-off-by: Matthew Wilcox <willy@infradead.org> Tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Matthew Wilcox [Wed, 14 Dec 2016 23:08:31 +0000 (15:08 -0800)]
radix-tree: fix typo
Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Matthew Wilcox [Wed, 14 Dec 2016 23:08:26 +0000 (15:08 -0800)]
tools: add more bitmap functions
I need the following functions for the radix tree:
bitmap_fill
bitmap_empty
bitmap_full
Copy the implementations from include/linux/bitmap.h
Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Matthew Wilcox [Wed, 14 Dec 2016 23:08:23 +0000 (15:08 -0800)]
radix tree test suite: record order in each item
This probably doubles the size of each item allocated by the test suite
but it lets us check a few more things, and may be needed for upcoming
API changes that require the caller pass in the order of the entry.
Link: http://lkml.kernel.org/r/1480369871-5271-46-git-send-email-mawilcox@linuxonhyperv.com Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com> Tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Matthew Wilcox [Wed, 14 Dec 2016 23:08:20 +0000 (15:08 -0800)]
radix tree test suite: handle exceptional entries
item_kill_tree() assumes that everything in the tree is a pointer to a
struct item, which is annoying when testing the behaviour of exceptional
entries. Fix it to delete exceptional entries on the assumption they
don't need to be freed.
Link: http://lkml.kernel.org/r/1480369871-5271-45-git-send-email-mawilcox@linuxonhyperv.com Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com> Tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Matthew Wilcox [Wed, 14 Dec 2016 23:08:17 +0000 (15:08 -0800)]
radix tree test suite: use rcu_barrier
Calling rcu_barrier() allows all of the rcu-freed memory to be actually
returned to the pool, and allows nr_allocated to return to 0. As well
as allowing diffs between runs to be more useful, it also lets us
pinpoint leaks more effectively.
Link: http://lkml.kernel.org/r/1480369871-5271-44-git-send-email-mawilcox@linuxonhyperv.com Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com> Tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Matthew Wilcox [Wed, 14 Dec 2016 23:08:11 +0000 (15:08 -0800)]
radix tree test suite: iteration test misuses RCU
Each thread needs to register itself with RCU, otherwise the reading
thread's read lock has no effect and the freeing thread will free the
memory in the tree without waiting for the read lock to be dropped.
Link: http://lkml.kernel.org/r/1480369871-5271-42-git-send-email-mawilcox@linuxonhyperv.com Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com> Tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Matthew Wilcox [Wed, 14 Dec 2016 23:08:08 +0000 (15:08 -0800)]
radix tree test suite: make runs more reproducible
Instead of reseeding the random number generator every time around the
loop in big_gang_check(), seed it at the beginning of execution. Use
rand_r() and an independent base seed for each thread in
iteration_test() so they don't stomp all over each others state. Since
this particular test depends on the kernel scheduler, the iteration test
can't be reproduced based purely on the random seed, but at least it
won't pollute the other tests.
Print the seed, and allow the seed to be specified so that a run which
hits a problem can be reproduced.
Link: http://lkml.kernel.org/r/1480369871-5271-41-git-send-email-mawilcox@linuxonhyperv.com Signed-off-by: Matthew Wilcox <willy@infradead.org> Tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Matthew Wilcox [Wed, 14 Dec 2016 23:08:05 +0000 (15:08 -0800)]
radix tree test suite: free preallocated nodes
It can be a source of mild concern when the test suite shows that we're
leaking nodes. While poring over the source code looking for leaks can
lead to some fascinating bugs being discovered, sometimes the leak is
simply that these nodes were preallocated and are sitting on the per-CPU
list. Free them by calling the CPU dead callback.
Link: http://lkml.kernel.org/r/1480369871-5271-40-git-send-email-mawilcox@linuxonhyperv.com Signed-off-by: Matthew Wilcox <willy@infradead.org> Tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Matthew Wilcox [Wed, 14 Dec 2016 23:08:02 +0000 (15:08 -0800)]
radix tree test suite: track preempt_count
Rather than simply NOP out preempt_enable() and preempt_disable(), keep
track of preempt_count and display it regularly in case either the test
suite or the code under test is forgetting to balance the enables &
disables. Only found a test-case that was forgetting to re-enable
preemption, but it's a possibility worth checking.
Link: http://lkml.kernel.org/r/1480369871-5271-39-git-send-email-mawilcox@linuxonhyperv.com Signed-off-by: Matthew Wilcox <willy@infradead.org> Tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Matthew Wilcox [Wed, 14 Dec 2016 23:07:59 +0000 (15:07 -0800)]
radix tree test suite: allow GFP_ATOMIC allocations to fail
In order to test the preload code, it is necessary to fail GFP_ATOMIC
allocations, which requires defining GFP_KERNEL and GFP_ATOMIC properly.
Remove the obsolete __GFP_WAIT and copy the definitions of the __GFP
flags which are used from the kernel include files. We also need the
real definition of gfpflags_allow_blocking() to persuade the radix tree
to actually use its preallocated nodes.
Link: http://lkml.kernel.org/r/1480369871-5271-38-git-send-email-mawilcox@linuxonhyperv.com Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jan Kara [Wed, 14 Dec 2016 23:07:53 +0000 (15:07 -0800)]
dax: clear dirty entry tags on cache flush
Currently we never clear dirty tags in DAX mappings and thus address
ranges to flush accumulate. Now that we have locking of radix tree
entries, we have all the locking necessary to reliably clear the radix
tree dirty tag when flushing caches for corresponding address range.
Similarly to page_mkclean() we also have to write-protect pages to get a
page fault when the page is next written to so that we can mark the
entry dirty again.
Link: http://lkml.kernel.org/r/1479460644-25076-21-git-send-email-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jan Kara [Wed, 14 Dec 2016 23:07:50 +0000 (15:07 -0800)]
dax: protect PTE modification on WP fault by radix tree entry lock
Currently PTE gets updated in wp_pfn_shared() after dax_pfn_mkwrite()
has released corresponding radix tree entry lock. When we want to
writeprotect PTE on cache flush, we need PTE modification to happen
under radix tree entry lock to ensure consistent updates of PTE and
radix tree (standard faults use page lock to ensure this consistency).
So move update of PTE bit into dax_pfn_mkwrite().
Link: http://lkml.kernel.org/r/1479460644-25076-20-git-send-email-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jan Kara [Wed, 14 Dec 2016 23:07:47 +0000 (15:07 -0800)]
dax: make cache flushing protected by entry lock
Currently, flushing of caches for DAX mappings was ignoring entry lock.
So far this was ok (modulo a bug that a difference in entry lock could
cause cache flushing to be mistakenly skipped) but in the following
patches we will write-protect PTEs on cache flushing and clear dirty
tags. For that we will need more exclusion. So do cache flushing under
an entry lock. This allows us to remove one lock-unlock pair of
mapping->tree_lock as a bonus.
Link: http://lkml.kernel.org/r/1479460644-25076-19-git-send-email-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jan Kara [Wed, 14 Dec 2016 23:07:45 +0000 (15:07 -0800)]
mm: export follow_pte()
DAX will need to implement its own version of page_check_address(). To
avoid duplicating page table walking code, export follow_pte() which
does what we need.
Link: http://lkml.kernel.org/r/1479460644-25076-18-git-send-email-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jan Kara [Wed, 14 Dec 2016 23:07:42 +0000 (15:07 -0800)]
mm: change return values of finish_mkwrite_fault()
Currently finish_mkwrite_fault() returns 0 when PTE got changed before
we acquired PTE lock and VM_FAULT_WRITE when we succeeded in modifying
the PTE. This is somewhat confusing since 0 generally means success, it
is also inconsistent with finish_fault() which returns 0 on success.
Change finish_mkwrite_fault() to return 0 on success and VM_FAULT_NOPAGE
when PTE changed. Practically, there should be no behavioral difference
since we bail out from the fault the same way regardless whether we
return 0, VM_FAULT_NOPAGE, or VM_FAULT_WRITE. Also note that
VM_FAULT_WRITE has no effect for shared mappings since the only two
places that check it - KSM and GUP - care about private mappings only.
Generally the meaning of VM_FAULT_WRITE for shared mappings is not well
defined and we should probably clean that up.
Link: http://lkml.kernel.org/r/1479460644-25076-17-git-send-email-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jan Kara [Wed, 14 Dec 2016 23:07:39 +0000 (15:07 -0800)]
mm: provide helper for finishing mkwrite faults
Provide a helper function for finishing write faults due to PTE being
read-only. The helper will be used by DAX to avoid the need of
complicating generic MM code with DAX locking specifics.
Link: http://lkml.kernel.org/r/1479460644-25076-16-git-send-email-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jan Kara [Wed, 14 Dec 2016 23:07:36 +0000 (15:07 -0800)]
mm: move part of wp_page_reuse() into the single call site
wp_page_reuse() handles write shared faults which is needed only in
wp_page_shared(). Move the handling only into that location to make
wp_page_reuse() simpler and avoid a strange situation when we sometimes
pass in locked page, sometimes unlocked etc.
Link: http://lkml.kernel.org/r/1479460644-25076-15-git-send-email-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jan Kara [Wed, 14 Dec 2016 23:07:33 +0000 (15:07 -0800)]
mm: use vmf->page during WP faults
So far we set vmf->page during WP faults only when we needed to pass it
to the ->page_mkwrite handler. Set it in all the cases now and use that
instead of passing page pointer explicitly around.
Link: http://lkml.kernel.org/r/1479460644-25076-14-git-send-email-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jan Kara [Wed, 14 Dec 2016 23:07:30 +0000 (15:07 -0800)]
mm: pass vm_fault structure into do_page_mkwrite()
We will need more information in the ->page_mkwrite() helper for DAX to
be able to fully finish faults there. Pass vm_fault structure to
do_page_mkwrite() and use it there so that information propagates
properly from upper layers.
Link: http://lkml.kernel.org/r/1479460644-25076-13-git-send-email-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jan Kara [Wed, 14 Dec 2016 23:07:27 +0000 (15:07 -0800)]
mm: factor out common parts of write fault handling
Currently we duplicate handling of shared write faults in
wp_page_reuse() and do_shared_fault(). Factor them out into a common
function.
Link: http://lkml.kernel.org/r/1479460644-25076-12-git-send-email-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jan Kara [Wed, 14 Dec 2016 23:07:24 +0000 (15:07 -0800)]
mm: move handling of COW faults into DAX code
Move final handling of COW faults from generic code into DAX fault
handler. That way generic code doesn't have to be aware of
peculiarities of DAX locking so remove that knowledge and make locking
functions private to fs/dax.c.
Link: http://lkml.kernel.org/r/1479460644-25076-11-git-send-email-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jan Kara [Wed, 14 Dec 2016 23:07:21 +0000 (15:07 -0800)]
mm: factor out functionality to finish page faults
Introduce finish_fault() as a helper function for finishing page faults.
It is rather thin wrapper around alloc_set_pte() but since we'd want to
call this from DAX code or filesystems, it is still useful to avoid some
boilerplate code.
Link: http://lkml.kernel.org/r/1479460644-25076-10-git-send-email-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jan Kara [Wed, 14 Dec 2016 23:07:18 +0000 (15:07 -0800)]
mm: allow full handling of COW faults in ->fault handlers
Patch series "dax: Clear dirty bits after flushing caches", v5.
Patchset to clear dirty bits from radix tree of DAX inodes when caches
for corresponding pfns have been flushed. In principle, these patches
enable handlers to easily update PTEs and do other work necessary to
finish the fault without duplicating the functionality present in the
generic code. I'd like to thank Kirill and Ross for reviews of the
series!
This patch (of 20):
To allow full handling of COW faults add memcg field to struct vm_fault
and a return value of ->fault() handler meaning that COW fault is fully
handled and memcg charge must not be canceled. This will allow us to
remove knowledge about special DAX locking from the generic fault code.
Link: http://lkml.kernel.org/r/1479460644-25076-9-git-send-email-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jan Kara [Wed, 14 Dec 2016 23:07:16 +0000 (15:07 -0800)]
mm: add orig_pte field into vm_fault
Add orig_pte field to vm_fault structure to allow ->page_mkwrite
handlers to fully handle the fault.
This also allows us to save some passing of extra arguments around.
Link: http://lkml.kernel.org/r/1479460644-25076-8-git-send-email-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jan Kara [Wed, 14 Dec 2016 23:07:13 +0000 (15:07 -0800)]
mm: use passed vm_fault structure for in wp_pfn_shared()
Instead of creating another vm_fault structure, use the one passed to
wp_pfn_shared() for passing arguments into pfn_mkwrite handler.
Link: http://lkml.kernel.org/r/1479460644-25076-7-git-send-email-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jan Kara [Wed, 14 Dec 2016 23:07:10 +0000 (15:07 -0800)]
mm: trim __do_fault() arguments
Use vm_fault structure to pass cow_page, page, and entry in and out of
the function.
That reduces number of __do_fault() arguments from 4 to 1.
Link: http://lkml.kernel.org/r/1479460644-25076-6-git-send-email-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jan Kara [Wed, 14 Dec 2016 23:07:07 +0000 (15:07 -0800)]
mm: use passed vm_fault structure in __do_fault()
Instead of creating another vm_fault structure, use the one passed to
__do_fault() for passing arguments into fault handler.
Link: http://lkml.kernel.org/r/1479460644-25076-5-git-send-email-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jan Kara [Wed, 14 Dec 2016 23:07:04 +0000 (15:07 -0800)]
mm: use pgoff in struct vm_fault instead of passing it separately
struct vm_fault has already pgoff entry. Use it instead of passing
pgoff as a separate argument and then assigning it later.
Link: http://lkml.kernel.org/r/1479460644-25076-4-git-send-email-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>