Linus Torvalds [Fri, 22 Nov 2013 03:46:00 +0000 (19:46 -0800)]
Merge branch 'for-linus2' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security
Pull security subsystem updates from James Morris:
"In this patchset, we finally get an SELinux update, with Paul Moore
taking over as maintainer of that code.
Also a significant update for the Keys subsystem, as well as
maintenance updates to Smack, IMA, TPM, and Apparmor"
and since I wanted to know more about the updates to key handling,
here's the explanation from David Howells on that:
"Okay. There are a number of separate bits. I'll go over the big bits
and the odd important other bit, most of the smaller bits are just
fixes and cleanups. If you want the small bits accounting for, I can
do that too.
(1) Keyring capacity expansion.
KEYS: Consolidate the concept of an 'index key' for key access
KEYS: Introduce a search context structure
KEYS: Search for auth-key by name rather than target key ID
Add a generic associative array implementation.
KEYS: Expand the capacity of a keyring
Several of the patches are providing an expansion of the capacity of a
keyring. Currently, the maximum size of a keyring payload is one page.
Subtract a small header and then divide up into pointers, that only gives
you ~500 pointers on an x86_64 box. However, since the NFS idmapper uses
a keyring to store ID mapping data, that has proven to be insufficient to
the cause.
Whatever data structure I use to handle the keyring payload, it can only
store pointers to keys, not the keys themselves because several keyrings
may point to a single key. This precludes inserting, say, and rb_node
struct into the key struct for this purpose.
I could make an rbtree of records such that each record has an rb_node
and a key pointer, but that would use four words of space per key stored
in the keyring. It would, however, be able to use much existing code.
I selected instead a non-rebalancing radix-tree type approach as that
could have a better space-used/key-pointer ratio. I could have used the
radix tree implementation that we already have and insert keys into it by
their serial numbers, but that means any sort of search must iterate over
the whole radix tree. Further, its nodes are a bit on the capacious side
for what I want - especially given that key serial numbers are randomly
allocated, thus leaving a lot of empty space in the tree.
So what I have is an associative array that internally is a radix-tree
with 16 pointers per node where the index key is constructed from the key
type pointer and the key description. This means that an exact lookup by
type+description is very fast as this tells us how to navigate directly to
the target key.
I made the data structure general in lib/assoc_array.c as far as it is
concerned, its index key is just a sequence of bits that leads to a
pointer. It's possible that someone else will be able to make use of it
also. FS-Cache might, for example.
(2) Mark keys as 'trusted' and keyrings as 'trusted only'.
KEYS: verify a certificate is signed by a 'trusted' key
KEYS: Make the system 'trusted' keyring viewable by userspace
KEYS: Add a 'trusted' flag and a 'trusted only' flag
KEYS: Separate the kernel signature checking keyring from module signing
These patches allow keys carrying asymmetric public keys to be marked as
being 'trusted' and allow keyrings to be marked as only permitting the
addition or linkage of trusted keys.
Keys loaded from hardware during kernel boot or compiled into the kernel
during build are marked as being trusted automatically. New keys can be
loaded at runtime with add_key(). They are checked against the system
keyring contents and if their signatures can be validated with keys that
are already marked trusted, then they are marked trusted also and can
thus be added into the master keyring.
Patches from Mimi Zohar make this usable with the IMA keyrings also.
(3) Remove the date checks on the key used to validate a module signature.
X.509: Remove certificate date checks
It's not reasonable to reject a signature just because the key that it was
generated with is no longer valid datewise - especially if the kernel
hasn't yet managed to set the system clock when the first module is
loaded - so just remove those checks.
(4) Make it simpler to deal with additional X.509 being loaded into the kernel.
KEYS: Load *.x509 files into kernel keyring
KEYS: Have make canonicalise the paths of the X.509 certs better to deduplicate
The builder of the kernel now just places files with the extension ".x509"
into the kernel source or build trees and they're concatenated by the
kernel build and stuffed into the appropriate section.
(5) Add support for userspace kerberos to use keyrings.
KEYS: Add per-user_namespace registers for persistent per-UID kerberos caches
KEYS: Implement a big key type that can save to tmpfs
Fedora went to, by default, storing kerberos tickets and tokens in tmpfs.
We looked at storing it in keyrings instead as that confers certain
advantages such as tickets being automatically deleted after a certain
amount of time and the ability for the kernel to get at these tokens more
easily.
To make this work, two things were needed:
(a) A way for the tickets to persist beyond the lifetime of all a user's
sessions so that cron-driven processes can still use them.
The problem is that a user's session keyrings are deleted when the
session that spawned them logs out and the user's user keyring is
deleted when the UID is deleted (typically when the last log out
happens), so neither of these places is suitable.
I've added a system keyring into which a 'persistent' keyring is
created for each UID on request. Each time a user requests their
persistent keyring, the expiry time on it is set anew. If the user
doesn't ask for it for, say, three days, the keyring is automatically
expired and garbage collected using the existing gc. All the kerberos
tokens it held are then also gc'd.
(b) A key type that can hold really big tickets (up to 1MB in size).
The problem is that Active Directory can return huge tickets with lots
of auxiliary data attached. We don't, however, want to eat up huge
tracts of unswappable kernel space for this, so if the ticket is
greater than a certain size, we create a swappable shmem file and dump
the contents in there and just live with the fact we then have an
inode and a dentry overhead. If the ticket is smaller than that, we
slap it in a kmalloc()'d buffer"
* 'for-linus2' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (121 commits)
KEYS: Fix keyring content gc scanner
KEYS: Fix error handling in big_key instantiation
KEYS: Fix UID check in keyctl_get_persistent()
KEYS: The RSA public key algorithm needs to select MPILIB
ima: define '_ima' as a builtin 'trusted' keyring
ima: extend the measurement list to include the file signature
kernel/system_certificate.S: use real contents instead of macro GLOBAL()
KEYS: fix error return code in big_key_instantiate()
KEYS: Fix keyring quota misaccounting on key replacement and unlink
KEYS: Fix a race between negating a key and reading the error set
KEYS: Make BIG_KEYS boolean
apparmor: remove the "task" arg from may_change_ptraced_domain()
apparmor: remove parent task info from audit logging
apparmor: remove tsk field from the apparmor_audit_struct
apparmor: fix capability to not use the current task, during reporting
Smack: Ptrace access check mode
ima: provide hash algo info in the xattr
ima: enable support for larger default filedata hash algorithms
ima: define kernel parameter 'ima_template=' to change configured default
ima: add Kconfig default measurement list template
...
Linus Torvalds [Fri, 22 Nov 2013 03:18:14 +0000 (19:18 -0800)]
Merge git://git.infradead.org/users/eparis/audit
Pull audit updates from Eric Paris:
"Nothing amazing. Formatting, small bug fixes, couple of fixes where
we didn't get records due to some old VFS changes, and a change to how
we collect execve info..."
Fixed conflict in fs/exec.c as per Eric and linux-next.
* git://git.infradead.org/users/eparis/audit: (28 commits)
audit: fix type of sessionid in audit_set_loginuid()
audit: call audit_bprm() only once to add AUDIT_EXECVE information
audit: move audit_aux_data_execve contents into audit_context union
audit: remove unused envc member of audit_aux_data_execve
audit: Kill the unused struct audit_aux_data_capset
audit: do not reject all AUDIT_INODE filter types
audit: suppress stock memalloc failure warnings since already managed
audit: log the audit_names record type
audit: add child record before the create to handle case where create fails
audit: use given values in tty_audit enable api
audit: use nlmsg_len() to get message payload length
audit: use memset instead of trying to initialize field by field
audit: fix info leak in AUDIT_GET requests
audit: update AUDIT_INODE filter rule to comparator function
audit: audit feature to set loginuid immutable
audit: audit feature to only allow unsetting the loginuid
audit: allow unsetting the loginuid (with priv)
audit: remove CONFIG_AUDIT_LOGINUID_IMMUTABLE
audit: loginuid functions coding style
selinux: apply selinux checks on new audit message types
...
Linus Torvalds [Wed, 20 Nov 2013 23:13:47 +0000 (15:13 -0800)]
Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc
Pull powerpc LE updates from Ben Herrenschmidt:
"With my previous pull request I mentioned some remaining Little Endian
patches, notably support for our new ABI, which I was sitting on
making sure it was all finalized.
The toolchain folks confirmed it now, the new ABI is stable and merged
with gcc, so we are all good. Oh and we actually missed the actual
Kconfig switch for LE so here it is, along with a couple more bug
fixes.
I have more fixes but not related to LE so I'll send them as a
separate pull request tomorrow, let's get this one out of the way.
Note that this supports running user space binaries using the new ABI,
but the kernel itself still needs to be built with the old one. We'll
bring fixes for that after -rc1.
Here's Anton log that goes with this series:
This patch series adds support for the new ABI, LPAR support for
H_SET_MODE and finally adds a kconfig option and defconfig.
ABIv2 support was recently committed to binutils and gcc, and should
be merged into glibc soon. There are a number of very nice
improvements including the removal of function descriptors. Rusty's
kernel patches allow binaries of either ABI to work, easing the
transition"
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc:
powerpc: Wrong DWARF CFI in the kernel vdso for little-endian / ELFv2
powerpc: Add pseries_le_defconfig
powerpc: Add CONFIG_CPU_LITTLE_ENDIAN kernel config option.
powerpc: Don't use ELFv2 ABI to build the kernel
powerpc: ELF2 binaries signal handling
powerpc: ELF2 binaries launched directly.
powerpc: Set eflags correctly for ELF ABIv2 core dumps.
powerpc: Add TIF_ELF2ABI flag.
pseries: Add H_SET_MODE to change exception endianness
powerpc/pseries: Fix endian issues in pseries EEH code
Linus Torvalds [Wed, 20 Nov 2013 23:03:45 +0000 (15:03 -0800)]
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mattst88/alpha
Pull alpha updates from Matt Turner:
"It contains a few fixes and some work from Richard to make alpha
emulation under QEMU much more usable"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mattst88/alpha:
alpha: Prevent a NULL ptr dereference in csum_partial_copy.
alpha: perf: fix out-of-bounds array access triggered from raw event
alpha: Use qemu+cserve provided high-res clock and alarm.
alpha: Switch to GENERIC_CLOCKEVENTS
alpha: Enable the rpcc clocksource for single processor
alpha: Reorganize rtc handling
alpha: Primitive support for CPU power down.
alpha: Allow HZ to be configured
alpha: Notice if we're being run under QEMU
alpha: Eliminate compiler warning from memset macro
Linus Torvalds [Wed, 20 Nov 2013 23:02:50 +0000 (15:02 -0800)]
Merge branch 'parisc-3.13' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux
Pull parisc fixes from Helge Deller:
- revert an access_ok() patch which broke 32bit userspace on 64bit
kernels
- avoid a gcc miscompilation in two internal pa_memcpy() functions by
not inlining those
- do not export the definition of SOCK_NONBLOCK via uapi header (fixes
build of audit package)
- depending on the fault type we now correctly report either SIGBUS or
SIGSEGV
- a small fix to not compare a size_t variable for < 0
* 'parisc-3.13' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
parisc: size_t is unsigned, so comparison size < 0 doesn't make sense.
parisc: improve SIGBUS/SIGSEGV error reporting
parisc: break out SOCK_NONBLOCK define to own asm header file
parisc: do not inline pa_memcpy() internal functions
Revert "parisc: implement full version of access_ok()"
Linus Torvalds [Wed, 20 Nov 2013 23:02:22 +0000 (15:02 -0800)]
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/egtvedt/linux-avr32
Pull AVR32 updates from Hans-Christian Egtvedt.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/egtvedt/linux-avr32:
avr32: uapi: be sure of "_UAPI" prefix for all guard macros
avr32: add kprobe_ctlblk memory struct
avr32: fix out-of-range jump in large kernels
avr32: setup crt for early panic()
Linus Torvalds [Wed, 20 Nov 2013 22:51:37 +0000 (14:51 -0800)]
Merge tag 'squashfs-updates' of git://git.kernel.org/pub/scm/linux/kernel/git/pkl/squashfs-next
Pull squashfs updates from Phillip Lougher:
"These patches optionally improve the multi-threading peformance of
Squashfs by adding parallel decompression, and direct decompression
into the page cache, eliminating an intermediate buffer (removing
memcpy overhead and lock contention)"
* tag 'squashfs-updates' of git://git.kernel.org/pub/scm/linux/kernel/git/pkl/squashfs-next:
Squashfs: Check stream is not NULL in decompressor_multi.c
Squashfs: Directly decompress into the page cache for file data
Squashfs: Restructure squashfs_readpage()
Squashfs: Generalise paging handling in the decompressors
Squashfs: add multi-threaded decompression using percpu variable
squashfs: Enhance parallel I/O
Squashfs: Refactor decompressor interface and code
Al points out that while the commit *does* actually create a separate
slab for the page->ptl allocation, that slab is never actually used, and
the code continues to use kmalloc/kfree.
Damien Wyart points out that the original patch did have the conversion
to use kmem_cache_alloc/free, so it got lost somewhere on its way to me.
Revert the half-arsed attempt that didn't do anything. If we really do
want the special slab (remember: this is all relevant just for debug
builds, so it's not necessarily all that critical) we might as well redo
the patch fully.
Reported-by: Al Viro <viro@zeniv.linux.org.uk> Acked-by: Andrew Morton <akpm@linux-foundation.org> Cc: Kirill A Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Wed, 20 Nov 2013 22:25:39 +0000 (14:25 -0800)]
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs bits and pieces from Al Viro:
"Assorted bits that got missed in the first pull request + fixes for a
couple of coredump regressions"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fold try_to_ascend() into the sole remaining caller
dcache.c: get rid of pointless macros
take read_seqbegin_or_lock() and friends to seqlock.h
consolidate simple ->d_delete() instances
gfs2: endianness misannotations
dump_emit(): use __kernel_write(), not vfs_write()
dump_align(): fix the dumb braino
Ulrich Weigand [Wed, 20 Nov 2013 20:38:05 +0000 (07:38 +1100)]
powerpc: Wrong DWARF CFI in the kernel vdso for little-endian / ELFv2
I've finally tracked down why my CR signal-unwind test case still
fails on little-endian. The problem turned to be that the kernel
installs a signal trampoline in the vDSO, and provides a DWARF CFI
record for that trampoline. This CFI describes the save location
for CR:
rsave (70, 38*RSIZE + (RSIZE - CRSIZE))
which is correct for big-endian, but points to the wrong word on
little-endian. This is wrong no matter which ABI.
In addition, for the ELFv2 ABI, we should not only provide a CFI
record for register 70 (cr2), but for all CR fields separately.
Strictly speaking, I guess this would mean providing two separate
vDSO images, one for ELFv1 processes and one for ELFv2 processes (or
maybe playing some tricks with conditional DWARF expressions).
However, having CFI records for the other CR fields in ELFv1 is not
actually wrong, they just will be ignored. So it seems the simplest
fix would be just to always provide CFI for all the fields.
Signed-off-by: Ulrich Weigand <Ulrich.Weigand@de.ibm.com> Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Alistair Popple [Wed, 20 Nov 2013 11:15:04 +0000 (22:15 +1100)]
powerpc: Don't use ELFv2 ABI to build the kernel
The kernel doesn't build correctly using the ELFv2 ABI. This patch
ensures that the ELFv1 ABI is used when building a kernel with an
ELFv2 enabled compiler.
Signed-off-by: Alistair Popple <alistair@popple.id.au> Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Rusty Russell [Wed, 20 Nov 2013 11:15:03 +0000 (22:15 +1100)]
powerpc: ELF2 binaries signal handling
For the ELFv2 ABI, the hander is the entry point, not a function descriptor.
We also need to set up r12, and fortunately the fast_exception_return
exit path restores r12 for us so nothing else is required.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Rusty Russell [Wed, 20 Nov 2013 11:15:02 +0000 (22:15 +1100)]
powerpc: ELF2 binaries launched directly.
No function descriptor, but we set r12 up and set TIF_RESTOREALL as it
normally isn't restored on return from syscall.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Rusty Russell [Wed, 20 Nov 2013 11:15:01 +0000 (22:15 +1100)]
powerpc: Set eflags correctly for ELF ABIv2 core dumps.
We leave it at zero (though it could be 1) for old tasks.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Rusty Russell [Wed, 20 Nov 2013 11:15:00 +0000 (22:15 +1100)]
powerpc: Add TIF_ELF2ABI flag.
Little endian ppc64 is getting an exciting new ABI. This is reflected
by the bottom two bits of e_flags in the ELF header:
0 == legacy binaries (v1 ABI)
1 == binaries using the old ABI (compiled with a new toolchain)
2 == binaries using the new ABI.
We store this in a thread flag, because we need to set it in core
dumps and for signal delivery. Our chief concern is that it doesn't
use function descriptors.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Anton Blanchard [Wed, 20 Nov 2013 11:14:59 +0000 (22:14 +1100)]
pseries: Add H_SET_MODE to change exception endianness
On little endian builds call H_SET_MODE so exceptions have the
correct endianness. We need to reset the endian during kexec
so do that in the MMU hashtable clear callback.
Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Linus Torvalds [Wed, 20 Nov 2013 21:25:04 +0000 (13:25 -0800)]
Merge tag 'pm+acpi-2-3.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull more ACPI and power management updates from Rafael Wysocki:
- ACPI-based device hotplug fixes for issues introduced recently and a
fix for an older error code path bug in the ACPI PCI host bridge
driver
- Fix for recently broken OMAP cpufreq build from Viresh Kumar
- Fix for a recent hibernation regression related to s2disk
- Fix for a locking-related regression in the ACPI EC driver from
Puneet Kumar
- System suspend error code path fix related to runtime PM and runtime
PM documentation update from Ulf Hansson
- cpufreq's conservative governor fix from Xiaoguang Chen
- New processor IDs for intel_idle and turbostat and removal of an
obsolete Kconfig option from Len Brown
- New device IDs for the ACPI LPSS (Low-Power Subsystem) driver and
ACPI-based PCI hotplug (ACPIPHP) cleanup from Mika Westerberg
- Removal of several ACPI video DMI blacklist entries that are not
necessary any more from Aaron Lu
- Rework of the ACPI companion representation in struct device and code
cleanup related to that change from Rafael J Wysocki, Lan Tianyu and
Jarkko Nikula
- Fixes for assigning names to ACPI-enumerated I2C and SPI devices from
Jarkko Nikula
* tag 'pm+acpi-2-3.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (24 commits)
PCI / hotplug / ACPI: Drop unused acpiphp_debug declaration
ACPI / scan: Set flags.match_driver in acpi_bus_scan_fixed()
ACPI / PCI root: Clear driver_data before failing enumeration
ACPI / hotplug: Fix PCI host bridge hot removal
ACPI / hotplug: Fix acpi_bus_get_device() return value check
cpufreq: governor: Remove fossil comment in the cpufreq_governor_dbs()
ACPI / video: clean up DMI table for initial black screen problem
ACPI / EC: Ensure lock is acquired before accessing ec struct members
PM / Hibernate: Do not crash kernel in free_basic_memory_bitmaps()
ACPI / AC: Remove struct acpi_device pointer from struct acpi_ac
spi: Use stable dev_name for ACPI enumerated SPI slaves
i2c: Use stable dev_name for ACPI enumerated I2C slaves
ACPI: Provide acpi_dev_name accessor for struct acpi_device device name
ACPI / bind: Use (put|get)_device() on ACPI device objects too
ACPI: Eliminate the DEVICE_ACPI_HANDLE() macro
ACPI / driver core: Store an ACPI device pointer in struct acpi_dev_node
cpufreq: OMAP: Fix compilation error 'r & ret undeclared'
PM / Runtime: Fix error path for prepare
PM / Runtime: Update documentation around probe|remove|suspend
cpufreq: conservative: set requested_freq to policy max when it is over policy max
...
Linus Torvalds [Wed, 20 Nov 2013 21:20:24 +0000 (13:20 -0800)]
Merge branch 'next' of git://git.infradead.org/users/vkoul/slave-dma
Pull slave-dmaengine changes from Vinod Koul:
"This brings for slave dmaengine:
- Change dma notification flag to DMA_COMPLETE from DMA_SUCCESS as
dmaengine can only transfer and not verify validaty of dma
transfers
- Bunch of fixes across drivers:
- cppi41 driver fixes from Daniel
- 8 channel freescale dma engine support and updated bindings from
Hongbo
- msx-dma fixes and cleanup by Markus
- DMAengine updates from Dan:
- Bartlomiej and Dan finalized a rework of the dma address unmap
implementation.
- In the course of testing 1/ a collection of enhancements to
dmatest fell out. Notably basic performance statistics, and
fixed / enhanced test control through new module parameters
'run', 'wait', 'noverify', and 'verbose'. Thanks to Andriy and
Linus [Walleij] for their review.
- Testing the raid related corner cases of 1/ triggered bugs in
the recently added 16-source operation support in the ioatdma
driver.
- Some minor fixes / cleanups to mv_xor and ioatdma"
* 'next' of git://git.infradead.org/users/vkoul/slave-dma: (99 commits)
dma: mv_xor: Fix mis-usage of mmio 'base' and 'high_base' registers
dma: mv_xor: Remove unneeded NULL address check
ioat: fix ioat3_irq_reinit
ioat: kill msix_single_vector support
raid6test: add new corner case for ioatdma driver
ioatdma: clean up sed pool kmem_cache
ioatdma: fix selection of 16 vs 8 source path
ioatdma: fix sed pool selection
ioatdma: Fix bug in selftest after removal of DMA_MEMSET.
dmatest: verbose mode
dmatest: convert to dmaengine_unmap_data
dmatest: add a 'wait' parameter
dmatest: add basic performance metrics
dmatest: add support for skipping verification and random data setup
dmatest: use pseudo random numbers
dmatest: support xor-only, or pq-only channels in tests
dmatest: restore ability to start test at module load and init
dmatest: cleanup redundant "dmatest: " prefixes
dmatest: replace stored results mechanism, with uniform messages
Revert "dmatest: append verify result to results"
...
Linus Torvalds [Wed, 20 Nov 2013 21:06:20 +0000 (13:06 -0800)]
Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull block IO fixes from Jens Axboe:
"Normally I'd defer my initial for-linus pull request until after the
merge window, but a race was uncovered in the virtio-blk conversion to
blk-mq that could cause hangs. So here's a small collection of fixes
for you to pull:
- The fix for the virtio-blk IO hang reported by Dave Chinner, from
Shaohua and myself.
- Add the Insert blktrace event for blk-mq. This makes 'btt' happy
when it is doing it's state transition analysis.
- Ensure that blk-mq has disk/partition stats enabled by default,
instead of making it opt-in.
- A fix for __bio_add_page() and large sector counts"
* 'for-linus' of git://git.kernel.dk/linux-block:
blk-mq: add blktrace insert event trace
virtio-blk: virtqueue_kick() must be ordered with other virtqueue operations
blk-mq: ensure that we set REQ_IO_STAT so diskstats work
bio: fix argument of __bio_add_page() for max_sectors > 0xffff
Linus Torvalds [Wed, 20 Nov 2013 21:05:25 +0000 (13:05 -0800)]
Merge tag 'md/3.13' of git://neil.brown.name/md
Pull md update from Neil Brown:
"Mostly optimisations and obscure bug fixes.
- raid5 gets less lock contention
- raid1 gets less contention between normal-io and resync-io during
resync"
* tag 'md/3.13' of git://neil.brown.name/md:
md/raid5: Use conf->device_lock protect changing of multi-thread resources.
md/raid5: Before freeing old multi-thread worker, it should flush them.
md/raid5: For stripe with R5_ReadNoMerge, we replace REQ_FLUSH with REQ_NOMERGE.
UAPI: include <asm/byteorder.h> in linux/raid/md_p.h
raid1: Rewrite the implementation of iobarrier.
raid1: Add some macros to make code clearly.
raid1: Replace raise_barrier/lower_barrier with freeze_array/unfreeze_array when reconfiguring the array.
raid1: Add a field array_frozen to indicate whether raid in freeze state.
md: Convert use of typedef ctl_table to struct ctl_table
md/raid5: avoid deadlock when raid5 array has unack badblocks during md_stop_writes.
md: use MD_RECOVERY_INTR instead of kthread_should_stop in resync thread.
md: fix some places where mddev_lock return value is not checked.
raid5: Retry R5_ReadNoMerge flag when hit a read error.
raid5: relieve lock contention in get_active_stripe()
raid5: relieve lock contention in get_active_stripe()
wait: add wait_event_cmd()
md/raid5.c: add proper locking to error path of raid5_start_reshape.
md: fix calculation of stacking limits on level change.
raid5: Use slow_path to release stripe when mddev->thread is null
Chen Gang [Tue, 12 Nov 2013 08:38:47 +0000 (16:38 +0800)]
avr32: uapi: be sure of "_UAPI" prefix for all guard macros
For all uapi headers, need use "_UAPI" prefix for its guard macro
(which will be stripped by "scripts/headers_installer.sh").
Also remove redundant files (bitsperlong.h, errno.h, fcntl.h, ioctl.h,
ioctls.h, ipcbuf.h, kvm_para.h, mman.h, poll.h, resource.h, siginfo.h,
statfs.h, and unistd.h) which are already in Kbuild.
Also be sure that all "#endif" only have one empty line above, and each
file has guard macro.
Signed-off-by: Chen Gang <gang.chen@asianux.com> Signed-off-by: Hans-Christian Egtvedt <hegtvedt@cisco.com>
This patch fixes following error (for big kernels):
---8<---
arch/avr32/boot/u-boot/head.o: In function `no_tag_table':
(.init.text+0x44): relocation truncated to fit: R_AVR32_22H_PCREL against symbol `panic' defined in .text.unlikely section in kernel/built-in.o
arch/avr32/kernel/built-in.o: In function `bad_return':
(.ex.text+0x236): relocation truncated to fit: R_AVR32_22H_PCREL against symbol `panic' defined in .text.unlikely section in kernel/built-in.o
--->8---
It comes up when the kernel increases and 'panic()' is too far away to fit in
the +/- 2MiB range. Which in turn issues from the 21-bit displacement in
'br{cond4}' mnemonic which is one of the two ways to do jumps (rjmp has just
10-bit displacement and therefore a way smaller range). This fact was stated
before in 8d29b7b9f81d6b83d869ff054e6c189d6da73f1f.
One solution to solve this is to add a local storage for the symbol address
and just load the $pc with that value.
Phillip Lougher [Wed, 13 Nov 2013 02:04:19 +0000 (02:04 +0000)]
Squashfs: Directly decompress into the page cache for file data
This introduces an implementation of squashfs_readpage_block()
that directly decompresses into the page cache.
This uses the previously added page handler abstraction to push
down the necessary kmap_atomic/kunmap_atomic operations on the
page cache buffers into the decompressors. This enables
direct copying into the page cache without using the slow
kmap/kunmap calls.
The code detects when multiple threads are racing in
squashfs_readpage() to decompress the same block, and avoids
this regression by falling back to using an intermediate
buffer.
This patch enhances the performance of Squashfs significantly
when multiple processes are accessing the filesystem simultaneously
because it not only reduces memcopying, but it more importantly
eliminates the lock contention on the intermediate buffer.
Phillip Lougher [Mon, 18 Nov 2013 02:59:12 +0000 (02:59 +0000)]
Squashfs: Generalise paging handling in the decompressors
Further generalise the decompressors by adding a page handler
abstraction. This adds helpers to allow the decompressors
to access and process the output buffers in an implementation
independant manner.
This allows different types of output buffer to be passed
to the decompressors, with the implementation specific
aspects handled at decompression time, but without the
knowledge being held in the decompressor wrapper code.
This will allow the decompressors to handle Squashfs
cache buffers, and page cache pages.
This patch adds the abstraction and an implementation for
the caches.
Signed-off-by: Phillip Lougher <phillip@squashfs.org.uk> Reviewed-by: Minchan Kim <minchan@kernel.org>
Minchan Kim [Mon, 28 Oct 2013 05:26:30 +0000 (14:26 +0900)]
squashfs: Enhance parallel I/O
Now squashfs have used for only one stream buffer for decompression
so it hurts parallel read performance so this patch supports
multiple decompressor to enhance performance parallel I/O.
Four 1G file dd read on KVM machine which has 2 CPU and 4G memory.
Phillip Lougher [Wed, 13 Nov 2013 02:56:26 +0000 (02:56 +0000)]
Squashfs: Refactor decompressor interface and code
The decompressor interface and code was written from
the point of view of single-threaded operation. In doing
so it mixed a lot of single-threaded implementation specific
aspects into the decompressor code and elsewhere which makes it
difficult to seamlessly support multiple different decompressor
implementations.
This patch does the following:
1. It removes compressor_options parsing from the decompressor
init() function. This allows the decompressor init() function
to be dynamically called to instantiate multiple decompressors,
without the compressor options needing to be read and parsed each
time.
2. It moves threading and all sleeping operations out of the
decompressors. In doing so, it makes the decompressors
non-blocking wrappers which only deal with interfacing with
the decompressor implementation.
3. It splits decompressor.[ch] into decompressor generic functions
in decompressor.[ch], and moves the single threaded
decompressor implementation into decompressor_single.c.
The result of this patch is Squashfs should now be able to
support multiple decompressors by adding new decompressor_xxx.c
files with specialised implementations of the functions in
decompressor_single.c
Signed-off-by: Phillip Lougher <phillip@squashfs.org.uk> Reviewed-by: Minchan Kim <minchan@kernel.org>
Shaohua Li [Wed, 20 Nov 2013 01:57:24 +0000 (18:57 -0700)]
virtio-blk: virtqueue_kick() must be ordered with other virtqueue operations
It isn't safe to call it without holding the vblk->vq_lock.
Reported-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Shaohua Li <shli@fusionio.com>
Fixed another condition of virtqueue_kick() not holding the lock.
It appears that driver runs into a problem here if fibsize is too small
because we allocate user_srbcmd with fibsize size only but later we
access it until user_srbcmd->sg.count to copy it over to srbcmd.
It is not correct to test (fibsize < sizeof(*user_srbcmd)) because this
structure already includes one sg element and this is not needed for
commands without data. So, we would recommend to add the following
(instead of test for fibsize == 0).
Pull networking fixes from David Miller:
"Mostly these are fixes for fallout due to merge window changes, as
well as cures for problems that have been with us for a much longer
period of time"
1) Johannes Berg noticed two major deficiencies in our genetlink
registration. Some genetlink protocols we passing in constant
counts for their ops array rather than something like
ARRAY_SIZE(ops) or similar. Also, some genetlink protocols were
using fixed IDs for their multicast groups.
We have to retain these fixed IDs to keep existing userland tools
working, but reserve them so that other multicast groups used by
other protocols can not possibly conflict.
In dealing with these two problems, we actually now use less state
management for genetlink operations and multicast groups.
2) When configuring interface hardware timestamping, fix several
drivers that simply do not validate that the hwtstamp_config value
is one the driver actually supports. From Ben Hutchings.
3) Invalid memory references in mwifiex driver, from Amitkumar Karwar.
4) In dev_forward_skb(), set the skb->protocol in the right order
relative to skb_scrub_packet(). From Alexei Starovoitov.
5) Bridge erroneously fails to use the proper wrapper functions to make
calls to netdev_ops->ndo_vlan_rx_{add,kill}_vid. Fix from Toshiaki
Makita.
6) When detaching a bridge port, make sure to flush all VLAN IDs to
prevent them from leaking, also from Toshiaki Makita.
7) Put in a compromise for TCP Small Queues so that deep queued devices
that delay TX reclaim non-trivially don't have such a performance
decrease. One particularly problematic area is 802.11 AMPDU in
wireless. From Eric Dumazet.
8) Fix crashes in tcp_fastopen_cache_get(), we can see NULL socket dsts
here. Fix from Eric Dumzaet, reported by Dave Jones.
9) Fix use after free in ipv6 SIT driver, from Willem de Bruijn.
10) When computing mergeable buffer sizes, virtio-net fails to take the
virtio-net header into account. From Michael Dalton.
11) Fix seqlock deadlock in ip4_datagram_connect() wrt. statistic
bumping, this one has been with us for a while. From Eric Dumazet.
12) Fix NULL deref in the new TIPC fragmentation handling, from Erik
Hugne.
13) 6lowpan bit used for traffic classification was wrong, from Jukka
Rissanen.
14) macvlan has the same issue as normal vlans did wrt. propagating LRO
disabling down to the real device, fix it the same way. From Michal
Kubecek.
15) CPSW driver needs to soft reset all slaves during suspend, from
Daniel Mack.
16) Fix small frame pacing in FQ packet scheduler, from Eric Dumazet.
17) The xen-netfront RX buffer refill timer isn't properly scheduled on
partial RX allocation success, from Ma JieYue.
18) When ipv6 ping protocol support was added, the AF_INET6 protocol
initialization cleanup path on failure was borked a little. Fix
from Vlad Yasevich.
19) If a socket disconnects during a read/recvmsg/recvfrom/etc that
blocks we can do the wrong thing with the msg_name we write back to
userspace. From Hannes Frederic Sowa. There is another fix in the
works from Hannes which will prevent future problems of this nature.
20) Fix route leak in VTI tunnel transmit, from Fan Du.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (106 commits)
genetlink: make multicast groups const, prevent abuse
genetlink: pass family to functions using groups
genetlink: add and use genl_set_err()
genetlink: remove family pointer from genl_multicast_group
genetlink: remove genl_unregister_mc_group()
hsr: don't call genl_unregister_mc_group()
quota/genetlink: use proper genetlink multicast APIs
drop_monitor/genetlink: use proper genetlink multicast APIs
genetlink: only pass array to genl_register_family_with_ops()
tcp: don't update snd_nxt, when a socket is switched from repair mode
atm: idt77252: fix dev refcnt leak
xfrm: Release dst if this dst is improper for vti tunnel
netlink: fix documentation typo in netlink_set_err()
be2net: Delete secondary unicast MAC addresses during be_close
be2net: Fix unconditional enabling of Rx interface options
net, virtio_net: replace the magic value
ping: prevent NULL pointer dereference on write to msg_name
bnx2x: Prevent "timeout waiting for state X"
bnx2x: prevent CFC attention
bnx2x: Prevent panic during DMAE timeout
...
Linus Torvalds [Tue, 19 Nov 2013 23:49:31 +0000 (15:49 -0800)]
Merge tag 'please-pull-fixia64' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux
Pull ia64 fix from Tony Luck:
"Unbreak ia64 build by avoiding circular dependency"
* tag 'please-pull-fixia64' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux:
kernel/bounds: avoid circular dependencies in generated headers
Helge Deller [Sun, 17 Nov 2013 21:03:11 +0000 (22:03 +0100)]
parisc: do not inline pa_memcpy() internal functions
gcc (4.8.x) creates wrong code when the pa_memcpy() functions are
inlined. Especially in 32bit builds it calculates wrong return values
if we encounter a fault during execution of the memcpy.
It broke userspace and adding more checking is not needed.
Even checking if a syscall would access memory in page zero doesn't
makes sense since it may lead to some syscalls returning -EFAULT
where we would return other error codes on other platforms.
In summary, just drop this change and return to always return 1.
kernel/bounds: avoid circular dependencies in generated headers
<linux/spinlock.h> has heavy dependencies on other header files.
It triggers circular dependencies in generated headers on IA64, at
least:
CC kernel/bounds.s
In file included from /home/space/kas/git/public/linux/arch/ia64/include/asm/thread_info.h:9:0,
from include/linux/thread_info.h:54,
from include/asm-generic/preempt.h:4,
from arch/ia64/include/generated/asm/preempt.h:1,
from include/linux/preempt.h:18,
from include/linux/spinlock.h:50,
from kernel/bounds.c:14:
/home/space/kas/git/public/linux/arch/ia64/include/asm/asm-offsets.h:1:35: fatal error: generated/asm-offsets.h: No such file or directory
compilation terminated.
Let's replace <linux/spinlock.h> with <linux/spinlock_types.h>, it's
enough to find out size of spinlock_t.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reported-and-Tested-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com>
David S. Miller [Tue, 19 Nov 2013 21:39:42 +0000 (16:39 -0500)]
Merge branch 'genetlink_mcast'
Johannes Berg says:
====================
genetlink: clean up multicast group APIs
The generic netlink multicast group registration doesn't have to
be dynamic, and can thus be simplified just like I did with the
ops. This removes some complexity in registration code.
Additionally, two users of generic netlink already use multicast
groups in a wrong way, add workarounds for those two to keep the
userspace API working, but at the same time make them not clash
with other users of multicast groups as might happen now.
While making it all a bit easier, also prevent such abuse by adding
checks to the APIs so each family can only use the groups it owns.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Johannes Berg [Tue, 19 Nov 2013 14:19:39 +0000 (15:19 +0100)]
genetlink: make multicast groups const, prevent abuse
Register generic netlink multicast groups as an array with
the family and give them contiguous group IDs. Then instead
of passing the global group ID to the various functions that
send messages, pass the ID relative to the family - for most
families that's just 0 because the only have one group.
This avoids the list_head and ID in each group, adding a new
field for the mcast group ID offset to the family.
At the same time, this allows us to prevent abusing groups
again like the quota and dropmon code did, since we can now
check that a family only uses a group it owns.
Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Johannes Berg [Tue, 19 Nov 2013 14:19:38 +0000 (15:19 +0100)]
genetlink: pass family to functions using groups
This doesn't really change anything, but prepares for the
next patch that will change the APIs to pass the group ID
within the family, rather than the global group ID.
Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Johannes Berg [Tue, 19 Nov 2013 14:19:37 +0000 (15:19 +0100)]
genetlink: add and use genl_set_err()
Add a static inline to generic netlink to wrap netlink_set_err()
to make it easier to use here - use it in openvswitch (the only
generic netlink user of netlink_set_err()).
Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Johannes Berg [Tue, 19 Nov 2013 14:19:33 +0000 (15:19 +0100)]
quota/genetlink: use proper genetlink multicast APIs
The quota code is abusing the genetlink API and is using
its family ID as the multicast group ID, which is invalid
and may belong to somebody else (and likely will.)
Make the quota code use the correct API, but since this
is already used as-is by userspace, reserve a family ID
for this code and also reserve that group ID to not break
userspace assumptions.
Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Johannes Berg [Tue, 19 Nov 2013 14:19:32 +0000 (15:19 +0100)]
drop_monitor/genetlink: use proper genetlink multicast APIs
The drop monitor code is abusing the genetlink API and is
statically using the generic netlink multicast group 1, even
if that group belongs to somebody else (which it invariably
will, since it's not reserved.)
Make the drop monitor code use the proper APIs to reserve a
group ID, but also reserve the group id 1 in generic netlink
code to preserve the userspace API. Since drop monitor can
be a module, don't clear the bit for it on unregistration.
Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Johannes Berg [Tue, 19 Nov 2013 14:19:31 +0000 (15:19 +0100)]
genetlink: only pass array to genl_register_family_with_ops()
As suggested by David Miller, make genl_register_family_with_ops()
a macro and pass only the array, evaluating ARRAY_SIZE() in the
macro, this is a little safer.
The openvswitch has some indirection, assing ops/n_ops directly in
that code. This might ultimately just assign the pointers in the
family initializations, saving the struct genl_family_and_ops and
code (once mcast groups are handled differently.)
Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Andrey Vagin [Tue, 19 Nov 2013 18:10:06 +0000 (22:10 +0400)]
tcp: don't update snd_nxt, when a socket is switched from repair mode
snd_nxt must be updated synchronously with sk_send_head. Otherwise
tp->packets_out may be updated incorrectly, what may bring a kernel panic.
Here is a kernel panic from my host.
[ 103.043194] BUG: unable to handle kernel NULL pointer dereference at 0000000000000048
[ 103.044025] IP: [<ffffffff815aaaaf>] tcp_rearm_rto+0xcf/0x150
...
[ 146.301158] Call Trace:
[ 146.301158] [<ffffffff815ab7f0>] tcp_ack+0xcc0/0x12c0
Before this panic a tcp socket was restored. This socket had sent and
unsent data in the write queue. Sent data was restored in repair mode,
then the socket was switched from reapair mode and unsent data was
restored. After that the socket was switched back into repair mode.
In that moment we had a socket where write queue looks like this:
snd_una snd_nxt write_seq
|_________|________|
|
sk_send_head
After a second switching from repair mode the state of socket was
changed:
This state is inconsistent, because snd_nxt and sk_send_head are not
synchronized.
Bellow you can find a call trace, how packets_out can be incremented
twice for one skb, if snd_nxt and sk_send_head are not synchronized.
In this case packets_out will be always positive, even when
sk_write_queue is empty.
I think update of snd_nxt isn't required, when a socket is switched from
repair mode. Because it's initialized in tcp_connect_init. Then when a
write queue is restored, snd_nxt is incremented in tcp_event_new_data_sent,
so it's always is in consistent state.
I have checked, that the bug is not reproduced with this patch and
all tests about restoring tcp connections work fine.
Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Eric Dumazet <edumazet@google.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: James Morris <jmorris@namei.org> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: Patrick McHardy <kaber@trash.net> Signed-off-by: Andrey Vagin <avagin@openvz.org> Acked-by: Pavel Emelyanov <xemul@parallels.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
fan.du [Tue, 19 Nov 2013 08:53:28 +0000 (16:53 +0800)]
xfrm: Release dst if this dst is improper for vti tunnel
After searching rt by the vti tunnel dst/src parameter,
if this rt has neither attached to any transformation
nor the transformation is not tunnel oriented, this rt
should be released back to ip layer.
otherwise causing dst memory leakage.
Signed-off-by: Fan Du <fan.du@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Mika Westerberg [Tue, 19 Nov 2013 14:15:35 +0000 (16:15 +0200)]
PCI / hotplug / ACPI: Drop unused acpiphp_debug declaration
Commit bd950799d951 (PCI: acpiphp: Convert to dynamic debug) removed users
of acpiphp_debug variable and the variable itself but the declaration was
left in the header file. Drop this unused declaration.
Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com> Acked-by: Bjorn Helgaas <bhelgaas@google.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Linus Torvalds [Tue, 19 Nov 2013 19:44:15 +0000 (11:44 -0800)]
Merge tag 'arc-v3.13-rc1-part2' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc
Pull second set of ARC changes from Vineet Gupta:
- Support for Perf from Mischa
- Enabling GPIO/Pinctrl drivers for Abilis TB10x platform
- New defconfig for buildroot
* tag 'arc-v3.13-rc1-part2' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc:
ARC: [plat-arcfpga] Add defconfig without initramfs location
ARC: perf: ARC 700 PMU doesn't support sampling events
ARC: Add documentation on DT binding for ARC700 PMU
ARC: Add perf support for ARC700 cores
ARC: [TB10x] Updates for GPIO and pinctrl
Linus Torvalds [Tue, 19 Nov 2013 19:43:21 +0000 (11:43 -0800)]
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull second set of s390 patches from Martin Schwidefsky:
"The handling of the PCI hotplug notifications has been improved, the
zfcp dumper can now detect the HSA size dynamically and the default
install kernel has been changed to the compressed bzImage. And two
bug-fixes for scm and 3720"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
s390/pci: implement hotplug notifications
s390/scm_block: do not hide eadm subchannel dependency
s390/sclp: Consolidate early sclp init calls to sclp_early_detect()
s390/sclp: Move early code from sclp_cmd.c to sclp_early.c
s390/sclp: Determine HSA size dynamically for zfcpdump
s390/sclp: Move declarations for sclp_sdias into separate header file
s390/pci: implement pcibios_remove_bus
s390/pci: improve handling of bus resources
s390/3270: fix missing device_destroy() call
s390/boot: Install bzImage as default kernel image
Linus Torvalds [Tue, 19 Nov 2013 19:42:32 +0000 (11:42 -0800)]
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml
Pull UML changes from Richard Weinberger:
"This pile contains a nice defconfig cleanup, a rewritten stack
unwinder and various cleanups"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml:
um: Remove unused declarations from <as-layout.h>
um: remove used STDIO_CONSOLE Kconfig param
um/vdso: add .gitignore for a couple of targets
arch/um: make it work with defconfig and x86_64
um: Make kstack_depth_to_print conform to arch/x86
um: Get rid of thread_struct->saved_task
um: Make stack trace reliable against kernel mode faults
um: Rewrite show_stack()
Linus Torvalds [Tue, 19 Nov 2013 18:40:00 +0000 (10:40 -0800)]
Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq cleanups from Ingo Molnar:
"This is a multi-arch cleanup series from Thomas Gleixner, which we
kept to near the end of the merge window, to not interfere with
architecture updates.
This series (motivated by the -rt kernel) unifies more aspects of IRQ
handling and generalizes PREEMPT_ACTIVE"
* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
preempt: Make PREEMPT_ACTIVE generic
sparc: Use preempt_schedule_irq
ia64: Use preempt_schedule_irq
m32r: Use preempt_schedule_irq
hardirq: Make hardirq bits generic
m68k: Simplify low level interrupt handling code
genirq: Prevent spurious detection for unconditionally polled interrupts
Jens Axboe [Tue, 19 Nov 2013 16:25:07 +0000 (09:25 -0700)]
blk-mq: ensure that we set REQ_IO_STAT so diskstats work
If disk stats are enabled on the queue, a request needs to
be marked with REQ_IO_STAT for accounting to be active on
that request. This fixes an issue with virtio-blk not
showing up in /proc/diskstats after the conversion to
blk-mq.
Add QUEUE_FLAG_MQ_DEFAULT, setting stats and same cpu-group
completion on by default.
Reported-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
This is because raid5d() can be running when the multi-thread
resources are changed via system. We see need to provide locking.
mddev->device_lock is suitable, but we cannot simple call
alloc_thread_groups under this lock as we cannot allocate memory
while holding a spinlock.
So change alloc_thread_groups() to allocate and return the data
structures, then raid5_store_group_thread_cnt() can take the lock
while updating the pointers to the data structures.
This fixes a bug introduced in 3.12 and so is suitable for the 3.12.x
stable series.
Fixes: b721420e8719131896b009b11edbbd27 Cc: stable@vger.kernel.org (3.12) Signed-off-by: Jianpeng Ma <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Shaohua Li <shli@kernel.org>
This can happen if there are some stripes left, fewer than MAX_STRIPE_BATCH.
A worker is queued to handle them.
But before calling raid5_do_work, raid5d handles those
stripes making conf->active_stripe = 0.
So mddev_suspend() can return.
We might then free old worker resources before the queued
raid5_do_work() handled them. When it runs, it crashes.
Aurelien Jarno [Thu, 14 Nov 2013 04:16:19 +0000 (15:16 +1100)]
UAPI: include <asm/byteorder.h> in linux/raid/md_p.h
linux/raid/md_p.h is using conditionals depending on endianess and fails
with an error if neither of __BIG_ENDIAN, __LITTLE_ENDIAN or
__BYTE_ORDER are defined, but it doesn't include any header which can
define these constants. This make this header unusable alone.
This patch adds a #include <asm/byteorder.h> at the beginning of this
header to make it usable alone. This is needed to compile klibc on MIPS.
majianpeng [Fri, 15 Nov 2013 06:55:02 +0000 (14:55 +0800)]
raid1: Rewrite the implementation of iobarrier.
There is an iobarrier in raid1 because of contention between normal IO and
resync IO. It suspends all normal IO when resync/recovery happens.
However if normal IO is out side the resync window, there is no contention.
So this patch changes the barrier mechanism to only block IO that
could contend with the resync that is currently happening.
We partition the whole space into five parts.
|---------|-----------|------------|----------------|-------|
start next_resync start_next_window end_window
1 - RESYNC_WINDOW: For resync, there are 32 resync requests at most at the
same time. A sync request is RESYNC_BLOCK_SIZE(64*1024).
So the RESYNC_WINDOW is 32 * RESYNC_BLOCK_SIZE, that is 2MB.
2 - NEXT_NORMALIO_DISTANCE: the distance between next_resync
and start_next_window. It also indicates the distance between
start_next_window and end_window.
It is currently 3 * RESYNC_WINDOW_SIZE but could be tuned if
this turned out not to be optimal.
3 - next_resync: the next sector at which we will do sync IO.
4 - start: a position which is at most RESYNC_WINDOW before
next_resync.
5 - start_next_window: a position which is NEXT_NORMALIO_DISTANCE
beyond next_resync. Normal-io after this position doesn't need to
wait for resync-io to complete.
6 - end_window: a position which is 2 * NEXT_NORMALIO_DISTANCE beyond
next_resync. This also doesn't need to wait, but is counted
differently.
7 - current_window_requests: the count of normalIO between
start_next_window and end_window.
8 - next_window_requests: the count of normalIO after end_window.
NormalIO will be partitioned into four types:
NormIO1: the end sector of bio is smaller or equal the start
NormIO2: the start sector of bio larger or equal to end_window
NormIO3: the start sector of bio larger or equal to
start_next_window.
NormIO4: the location between start_next_window and end_window
For NormIO1, we don't need any io barrier.
For NormIO4, we used a similar approach to the original iobarrier
mechanism. The normalIO and resyncIO must be kept separate.
For NormIO2/3, we add two fields to struct r1conf: "current_window_requests"
and "next_window_requests". They indicate the count of active
requests in the two window.
For these, we don't wait for resync io to complete.
For resync action, if there are NormIO4s, we must wait for it.
If not, we can proceed.
But if resync action reaches start_next_window and
current_window_requests > 0 (that is there are NormIO3s), we must
wait until the current_window_requests becomes zero.
When current_window_requests becomes zero, start_next_window also
moves forward. Then current_window_requests will replaced by
next_window_requests.
There is a problem which when and how to change from NormIO2 to
NormIO3. Only then can sync action progress.
We add a field in struct r1conf "start_next_window".
A: if start_next_window == MaxSector, it means there are no NormIO2/3.
So start_next_window = next_resync + NEXT_NORMALIO_DISTANCE
B: if current_window_requests == 0 && next_window_requests != 0, it
means start_next_window move to end_window
There is another problem which how to differentiate between
old NormIO2(now it is NormIO3) and NormIO2.
For example, there are many bios which are NormIO2 and a bio which is
NormIO3. NormIO3 firstly completed, so the bios of NormIO2 became NormIO3.
We add a field in struct r1bio "start_next_window".
This is used to record the position conf->start_next_window when the call
to wait_barrier() is made in make_request().
In allow_barrier(), we check the conf->start_next_window.
If r1bio->stat_next_window == conf->start_next_window, it means
there is no transition between NormIO2 and NormIO3.
If r1bio->start_next_window != conf->start_next_window, it mean
there was a transition between NormIO2 and NormIO3. There can only
have been one transition. So it only means the bio is old NormIO2.
For one bio, there may be many r1bio's. So we make sure
all the r1bio->start_next_window are the same value.
If we met blocked_dev in make_request(), it must call allow_barrier
and wait_barrier. So the former and the later value of
conf->start_next_window will be change.
If there are many r1bio's with differnet start_next_window,
for the relevant bio, it depend on the last value of r1bio.
It will cause error. To avoid this, we must wait for previous r1bios
to complete.
Signed-off-by: Jianpeng Ma <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
majianpeng [Thu, 14 Nov 2013 04:16:18 +0000 (15:16 +1100)]
raid1: Replace raise_barrier/lower_barrier with freeze_array/unfreeze_array when reconfiguring the array.
We used to use raise_barrier to suspend normal IO while we reconfigure
the array. However raise_barrier will soon only suspend some normal
IO, not all. So we need something else.
Change it to use freeze_array.
But freeze_array not only suspends normal io, it also suspends
resync io.
For the place where call raise_barrier for reconfigure, it isn't a
problem.
Signed-off-by: Jianpeng Ma <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
majianpeng [Thu, 14 Nov 2013 04:16:18 +0000 (15:16 +1100)]
raid1: Add a field array_frozen to indicate whether raid in freeze state.
Because the following patch will rewrite the content between normal IO
and resync IO. So we used a parameter to indicate whether raid is in freeze
array.
Signed-off-by: Jianpeng Ma <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
NeilBrown [Thu, 14 Nov 2013 04:16:17 +0000 (15:16 +1100)]
md/raid5: avoid deadlock when raid5 array has unack badblocks during md_stop_writes.
When raid5 recovery hits a fresh badblock, this badblock will flagged as unack
badblock until md_update_sb() is called.
But md_stop will take reconfig lock which means raid5d can't call
md_update_sb() in md_check_recovery(), the badblock will always
be unack, so raid5d thread enters an infinite loop and md_stop_write()
can never stop sync_thread. This causes deadlock.
To solve this, when STOP_ARRAY ioctl is issued and sync_thread is
running, we need set md->recovery FROZEN and INTR flags and wait for
sync_thread to stop before we (re)take reconfig lock.
This requires that raid5 reshape_request notices MD_RECOVERY_INTR
(which it probably should have noticed anyway) and stops waiting for a
metadata update in that case.
NeilBrown [Tue, 19 Nov 2013 01:02:01 +0000 (12:02 +1100)]
md: use MD_RECOVERY_INTR instead of kthread_should_stop in resync thread.
We currently use kthread_should_stop() in various places in the
sync/reshape code to abort early.
However some places set MD_RECOVERY_INTR but don't immediately call
md_reap_sync_thread() (and we will shortly get another one).
When this happens we are relying on md_check_recovery() to reap the
thread and that only happen when it finishes normally.
So MD_RECOVERY_INTR must lead to a normal finish without the
kthread_should_stop() test.
So replace all relevant tests, and be more careful when the thread is
interrupted not to acknowledge that latest step in a reshape as it may
not be fully committed yet.
Also add a test on MD_RECOVERY_INTR in the 'is_mddev_idle' loop
so we don't wait have to wait for the speed to drop before we can abort.
NeilBrown [Thu, 14 Nov 2013 06:54:51 +0000 (17:54 +1100)]
md: fix some places where mddev_lock return value is not checked.
Sometimes we need to lock and mddev and cannot cope with
failure due to interrupt.
In these cases we should use mutex_lock, not mutex_lock_interruptible.
Bian Yu [Thu, 14 Nov 2013 04:16:17 +0000 (15:16 +1100)]
raid5: Retry R5_ReadNoMerge flag when hit a read error.
Because of block layer merge, one bio fails will cause other bios
which belongs to the same request fails, so raid5_end_read_request
will record all these bios as badblocks.
If retry request with R5_ReadNoMerge flag to avoid bios merge,
badblocks can only record sector which is bad exactly.
* pm-cpufreq:
cpufreq: governor: Remove fossil comment in the cpufreq_governor_dbs()
cpufreq: OMAP: Fix compilation error 'r & ret undeclared'
cpufreq: conservative: set requested_freq to policy max when it is over policy max
ACPI / scan: Set flags.match_driver in acpi_bus_scan_fixed()
Before commit 6931007cc90b (ACPI / scan: Start matching drivers
after trying scan handlers) the match_driver flag for all devices
was set in acpi_add_single_object(), but now it is set by
acpi_bus_device_attach() which is not called for the "fixed"
devices added by acpi_bus_scan_fixed(). This means that
flags.match_driver is never set for those devices now, so make
acpi_bus_scan_fixed() set it before calling device_attach().
Fixes: 6931007cc90b (ACPI / scan: Start matching drivers after trying scan handlers) Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
ACPI / PCI root: Clear driver_data before failing enumeration
If a PCI host bridge cannot be enumerated due to an error in
pci_acpi_scan_root(), its ACPI device object's driver_data field
has to be cleared by acpi_pci_root_add() before freeing the
object pointed to by that field, or some later acpi_pci_find_root()
checks that should fail may succeed and cause quite a bit of
confusion to ensue.
Fix acpi_pci_root_add() to clear device->driver_data before
returning an error code as appropriate.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Toshi Kani <toshi.kani@hp.com>
Since the PCI host bridge scan handler does not set hotplug.enabled,
the check of it in acpi_bus_device_eject() effectively prevents the
root bridge hot removal from working after commit a3b1b1ef78cd
(ACPI / hotplug: Merge device hot-removal routines). However, that
check is not necessary, because the other acpi_bus_device_eject()
users, acpi_hotplug_notify_cb and acpi_eject_store(), do the same
check by themselves before executing that function.
For this reason, remove the scan handler check from
acpi_bus_device_eject() to make PCI hot bridge hot removal work
again.
Fixes: a3b1b1ef78cd (ACPI / hotplug: Merge device hot-removal routines) Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Toshi Kani <toshi.kani@hp.com>
Linus Torvalds [Mon, 18 Nov 2013 23:56:13 +0000 (15:56 -0800)]
Merge git://www.linux-watchdog.org/linux-watchdog
Pull watchdog changes from Wim Van Sebroeck:
- addition of MOXA ART watchdog driver (moxart_wdt)
- addition of CSR SiRFprimaII and SiRFatlasVI watchdog driver
(sirfsoc_wdt)
- addition of ralink watchdog driver (rt2880_wdt)
- various fixes and cleanups (__user annotation, ioctl return codes,
removal of redundant of_match_ptr, removal of unnecessary
amba_set_drvdata(), use allocated buffer for usb_control_msg, ...)
- removal of MODULE_ALIAS_MISCDEV statements
- watchdog related DT bindings
- first set of improvements on the w83627hf_wdt driver
* git://www.linux-watchdog.org/linux-watchdog: (26 commits)
watchdog: w83627hf: Use helper functions to access superio registers
watchdog: w83627hf: Enable watchdog device only if not already enabled
watchdog: w83627hf: Enable watchdog only once
watchdog: w83627hf: Convert to watchdog infrastructure
watchdog: omap_wdt: raw read and write endian fix
watchdog: sirf: don't depend on dummy value of CLOCK_TICK_RATE
watchdog: pcwd_usb: overflow in usb_pcwd_send_command()
watchdog: rt2880_wdt: fix return value check in rt288x_wdt_probe()
watchdog: watchdog_core: Fix a trivial typo
watchdog: dw: Enable OF support for DW watchdog timer
watchdog: Get rid of MODULE_ALIAS_MISCDEV statements
watchdog: ts72xx_wdt: Propagate return value from timeout_to_regval
watchdog: pcwd_usb: Use allocated buffer for usb_control_msg
watchdog: sp805_wdt: Remove unnecessary amba_set_drvdata()
watchdog: sirf: add watchdog driver of CSR SiRFprimaII and SiRFatlasVI
watchdog: Remove redundant of_match_ptr
watchdog: ts72xx_wdt: cleanup return codes in ioctl
documentation/devicetree: Move DT bindings from gpio to watchdog
watchdog: add ralink watchdog driver
watchdog: Add MOXA ART watchdog driver
...
Linus Torvalds [Mon, 18 Nov 2013 23:50:07 +0000 (15:50 -0800)]
Merge branch 'i2c/for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux
Pull i2c changes from Wolfram Sang:
- new drivers for exynos5, bcm kona, and st micro
- bigger overhauls for drivers mxs and rcar
- typical driver bugfixes, cleanups, improvements
- got rid of the superfluous 'driver' member in i2c_client struct This
touches a few drivers in other subsystems. All acked.
* 'i2c/for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux: (38 commits)
i2c: bcm-kona: fix error return code in bcm_kona_i2c_probe()
i2c: i2c-eg20t: do not print error message in syslog if no ACK received
i2c: bcm-kona: Introduce Broadcom I2C Driver
i2c: cbus-gpio: Fix device tree binding
i2c: wmt: add missing clk_disable_unprepare() on error
i2c: designware: add new ACPI IDs
i2c: i801: Add Device IDs for Intel Wildcat Point-LP PCH
i2c: exynos5: Remove incorrect clk_disable_unprepare
i2c: i2c-st: Add ST I2C controller
i2c: exynos5: add High Speed I2C controller driver
i2c: rcar: fixup rcar type naming
i2c: scmi: remove some bogus NULL checks
i2c: sh_mobile & rcar: Enable the driver on all ARM platforms
i2c: sh_mobile: Convert to clk_prepare/unprepare
i2c: mux: gpio: use reg value for i2c_add_mux_adapter
i2c: mux: gpio: use gpio_set_value_cansleep()
i2c: Include linux/of.h header
i2c: mxs: Fix PIO mode on i.MX23
i2c: mxs: Rework the PIO mode operation
i2c: mxs: distinguish i.MX23 and i.MX28 based I2C controller
...
Linus Torvalds [Mon, 18 Nov 2013 23:36:04 +0000 (15:36 -0800)]
Merge tag 'rdma-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband
Pull infiniband/rdma updates from Roland Dreier:
- Re-enable flow steering verbs with new improved userspace ABI
- Fixes for slow connection due to GID lookup scalability
- IPoIB fixes
- Many fixes to HW drivers including mlx4, mlx5, ocrdma and qib
- Further improvements to SRP error handling
- Add new transport type for Cisco usNIC
* tag 'rdma-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband: (66 commits)
IB/core: Re-enable create_flow/destroy_flow uverbs
IB/core: extended command: an improved infrastructure for uverbs commands
IB/core: Remove ib_uverbs_flow_spec structure from userspace
IB/core: Use a common header for uverbs flow_specs
IB/core: Make uverbs flow structure use names like verbs ones
IB/core: Rename 'flow' structs to match other uverbs structs
IB/core: clarify overflow/underflow checks on ib_create/destroy_flow
IB/ucma: Convert use of typedef ctl_table to struct ctl_table
IB/cm: Convert to using idr_alloc_cyclic()
IB/mlx5: Fix page shift in create CQ for userspace
IB/mlx4: Fix device max capabilities check
IB/mlx5: Fix list_del of empty list
IB/mlx5: Remove dead code
IB/core: Encorce MR access rights rules on kernel consumers
IB/mlx4: Fix endless loop in resize CQ
RDMA/cma: Remove unused argument and minor dead code
RDMA/ucma: Discard events for IDs not yet claimed by user space
IB/core: Add Cisco usNIC rdma node and transport types
RDMA/nes: Remove self-assignment from nes_query_qp()
IB/srp: Report receive errors correctly
...
Linus Torvalds [Mon, 18 Nov 2013 23:35:09 +0000 (15:35 -0800)]
Merge tag 'for-v3.13' of git://git.infradead.org/battery-2.6
Pull battery updates from Anton Vorontsov:
"Highlights:
- A new driver for TI BQ24735 Battery Chargers, courtesy of NVidia.
- Device tree bindings for TWL4030 chips.
- Random fixes and cleanups"
* tag 'for-v3.13' of git://git.infradead.org/battery-2.6:
pm2301-charger: Remove unneeded NULL checks
twl4030_charger: Add devicetree support
power_supply: Fix documentation for TEMP_*ALERT* properties
max17042_battery: Support regmap to access device's registers
max17042_battery: Use SIMPLE_DEV_PM_OPS
charger-manager : Replace kzalloc to devm_kzalloc and remove uneccessary code
bq2415x_charger: Fix max battery regulation voltage
tps65090-charger: Use "IS_ENABLED(CONFIG_OF)" for DT code
tps65090-charger: Drop devm_free_irq of devm_ allocated irq
power_supply: Add support for bq24735 charger
pm2301-charger: Staticize pm2xxx_charger_die_therm_mngt
pm2301-charger: Check return value of regulator_enable
ab8500-charger: Remove redundant break
ab8500-charger: Check return value of regulator_enable
isp1704_charger: Fix driver to work with changes introduced in v3.5