Javier González [Fri, 5 Jan 2018 13:16:17 +0000 (14:16 +0100)]
lightnvm: pblk: ensure kthread alloc. before kicking it
When creating the write thread, ensure that the kthread has been created
before initializing the timer responsible from kicking it. Otherwise, if
the kthread creation fails or gets killed from used space, we risk
kicking an empty thread structure.
Also, since the kthread creation can be interrupted form user space,
adapt the error path to not report an error when this happens, since it
is intentional that the instance creation is aborted.
Signed-off-by: Javier González <javier@cnexlabs.com>
Updated source to reflect the new timer_setup API. Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Javier González [Fri, 5 Jan 2018 13:16:16 +0000 (14:16 +0100)]
lightnvm: pblk: do not log recovery read errors
On scan recovery, reads can fail. This happens because the first page
for each line is read in order to determined if the line has been used
(and thus needs to be recovered), or not. This can lead to "empty page"
read errors.
Since these errors are normal, do not log them, as they are confusing
when reviewing the logs.
Javier González [Fri, 5 Jan 2018 13:16:14 +0000 (14:16 +0100)]
lightnvm: set target over-provision on create ioctl
Allow to set the over-provision percentage on target creation. In case
that the value is not provided, fall back to the default value set by
the target.
In pblk, set the default OP to 11% of the total size of the device
Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Hans Holmberg <hans.holmberg@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Javier González [Fri, 5 Jan 2018 13:16:13 +0000 (14:16 +0100)]
lightnvm: pblk: use exact free block counter in RL
Until now, pblk's rate-limiter has used a heuristic to reserve space for
GC I/O given that the over-provision area was fixed.
In preparation for allowing to define the over-provision area on target
creation, define a dedicated free_block counter in the rate-limiter to
track the number of blocks being used for user data.
Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Hans Holmberg <hans.holmberg@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Hans Holmberg [Fri, 5 Jan 2018 13:16:12 +0000 (14:16 +0100)]
lightnvm: pblk: remove pblk_gc_stop
pblk_gc_stop just sets pblk->gc->gc_active to zero, ignoring
the flush parameter. This is plain confusing, so remove the
function and set the gc active flag at the call points instead.
Signed-off-by: Hans Holmberg <hans.holmberg@cnexlabs.com> Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Hans Holmberg [Fri, 5 Jan 2018 13:16:10 +0000 (14:16 +0100)]
lightnvm: pblk: clear flush point on completed writes
Move completion of syncs and clearing of flush points to the
write completion path - this ensures that the data has been
comitted to the media before completing bios containing syncs.
Signed-off-by: Hans Holmberg <hans.holmberg@cnexlabs.com> Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Hans Holmberg [Fri, 5 Jan 2018 13:16:08 +0000 (14:16 +0100)]
lightnvm: pblk: refactor emeta consistency check
Currently pblk_recov_get_lba list does two separate things:
it checks the consistency of the emeta and extracts the lba list.
This patch separates the consistency check to make the code easier
to read and to prepare for version checks of the line emeta
persistent data format version.
Signed-off-by: Hans Holmberg <hans.holmberg@cnexlabs.com> Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Javier González [Fri, 5 Jan 2018 13:16:06 +0000 (14:16 +0100)]
lightnvm: pblk: compress and reorder helper functions
Through time, we have generated some redundant helper functions.
Refactor them to eliminate redundant and unnecessary code. Also, reorder
them to improve readability
Javier González [Fri, 5 Jan 2018 13:16:05 +0000 (14:16 +0100)]
lightnvm: guarantee target unique name across devs.
Until now, target unique naming is only guaranteed per device. This is
ok from a lightnvm perspective, but not from a sysfs one, since groups
will collide regardless of the underlying device.
Check that names are unique across all lightnvm-capable devices.
Matias Bjørling [Fri, 5 Jan 2018 13:16:02 +0000 (14:16 +0100)]
lightnvm: remove lower page tables
The lower page table is unused. All page tables reported by 1.2
devices are all reporting a sequential 1:1 page mapping. This is
also not used going forward with the 2.0 revision.
Javier González [Fri, 5 Jan 2018 13:16:01 +0000 (14:16 +0100)]
lightnvm: remove unnecessary field from nvm_rq
Remove the wait filed in nvm_rq. It is not used anymore, as targets rely
on the functionality provided by the LightNVM subsystem when sending
sync I/O.
Matias Bjørling [Fri, 5 Jan 2018 13:15:57 +0000 (14:15 +0100)]
null_blk: remove lightnvm support
With rrpc to be removed, the null_blk lightnvm support is no longer
functional. Remove the lightnvm implementation and maybe add it to
another module in the future if someone takes on the challenge.
Liu Bo [Fri, 5 Jan 2018 07:09:06 +0000 (00:09 -0700)]
blk-mq: remove confusing comment of blk_mq_sched_dispatch_requests
Commit de1482974080
("blk-mq: introduce .get_budget and .put_budget in blk_mq_ops")
changes the function to return bool type, and then commit 1f460b63d4b3
("blk-mq: don't restart queue when .get_budget returns BLK_STS_RESOURCE")
changes it back to void, but the comment remains.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Tue, 14 Nov 2017 17:24:58 +0000 (10:24 -0700)]
blk-mq: improve heavily contended tag case
Even with a number of waitqueues, we can get into a situation where we
are heavily contended on the waitqueue lock. I got a report on spc1
where we're spending seconds doing this. Arguably the use case is nasty,
I reproduce it with one device and 1000 threads banging on the device.
But that doesn't mean we shouldn't be handling it better.
What ends up happening is that a thread will fail to get a tag, add
itself to the waitqueue, and subsequently get woken up when a tag is
freed - only to find itself going back to sleep on the waitqueue.
Instead of waking all threads, use an exclusive wait and wake up our
sbitmap batch count instead. This seems to work well for me (massive
improvement for this use case), and it survives basic testing. But I
haven't fully verified it yet.
An additional improvement is running the queue and checking for a new
tag BEFORE needing to add ourselves to the waitqueue.
SELinux runs with secureexec for all non-"noatsecure" domain transitions,
which means lots of processes end up hitting the stack hard-limit change
that was introduced in order to fix a race with prlimit(). That race fix
will need to be redesigned.
Reported-by: Laura Abbott <labbott@redhat.com> Reported-by: Tomáš Trnka <trnka@scm.com> Cc: stable@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Sun, 17 Dec 2017 21:57:08 +0000 (13:57 -0800)]
Merge branch 'WIP.x86-pti.base-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull Page Table Isolation (PTI) v4.14 backporting base tree from Ingo Molnar:
"This tree contains the v4.14 PTI backport preparatory tree, which
consists of four merges of upstream trees and 7 cherry-picked commits,
which the upcoming PTI work depends on"
NOTE! The resulting tree is exactly the same as the original base tree
(ie the diff between this commit and its immediate first parent is
empty).
The only reason for this merge is literally to have a common point for
the actual PTI changes so that the commits can be shared in both the
4.15 and 4.14 trees.
* 'WIP.x86-pti.base-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mm/kasan: Don't use vmemmap_populate() to initialize shadow
locking/barriers: Convert users of lockless_dereference() to READ_ONCE()
locking/barriers: Add implicit smp_read_barrier_depends() to READ_ONCE()
bpf: fix build issues on um due to mising bpf_perf_event.h
perf/x86: Enable free running PEBS for REGS_USER/INTR
x86: Make X86_BUG_FXSAVE_LEAK detectable in CPUID on AMD
x86/cpufeature: Add User-Mode Instruction Prevention definitions
Linus Torvalds [Sun, 17 Dec 2017 21:54:31 +0000 (13:54 -0800)]
Merge branch 'WIP.x86-pti.base.prep-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull Page Table Isolation (PTI) preparatory tree from Ingo Molnar:
"This does a rename to free up linux/pti.h to be used by the upcoming
page table isolation feature"
* 'WIP.x86-pti.base.prep-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
drivers/misc/intel/pti: Rename the header file to free up the namespace
Arnd Bergmann [Fri, 10 Nov 2017 14:57:21 +0000 (15:57 +0100)]
cramfs: fix MTD dependency
With CONFIG_MTD=m and CONFIG_CRAMFS=y, we now get a link failure:
fs/cramfs/inode.o: In function `cramfs_mount': inode.c:(.text+0x220): undefined reference to `mount_mtd'
fs/cramfs/inode.o: In function `cramfs_mtd_fill_super':
inode.c:(.text+0x6d8): undefined reference to `mtd_point'
inode.c:(.text+0xae4): undefined reference to `mtd_unpoint'
This adds a more specific Kconfig dependency to avoid the broken
configuration.
Alternatively we could make CRAMFS itself depend on "MTD || !MTD" with a
similar result.
Fixes: 99c18ce580c6 ("cramfs: direct memory access support") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Nicolas Pitre <nico@linaro.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Sun, 17 Dec 2017 20:18:35 +0000 (12:18 -0800)]
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs fixes from Al Viro:
"The alloc_super() one is a regression in this merge window, lazytime
thing is older..."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
VFS: Handle lazytime in do_mount()
alloc_super(): do ->s_umount initialization earlier
Linus Torvalds [Sun, 17 Dec 2017 20:14:33 +0000 (12:14 -0800)]
Merge tag 'ext4_for_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 fixes from Ted Ts'o:
"Fix a regression which caused us to fail to interpret symlinks in very
ancient ext3 file system images.
Also fix two xfstests failures, one of which could cause an OOPS, plus
an additional bug fix caught by fuzz testing"
* tag 'ext4_for_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: fix crash when a directory's i_size is too small
ext4: add missing error check in __ext4_new_inode()
ext4: fix fdatasync(2) after fallocate(2) operation
ext4: support fast symlinks from ext3 file systems
Andrey Ryabinin [Thu, 16 Nov 2017 01:36:35 +0000 (17:36 -0800)]
x86/mm/kasan: Don't use vmemmap_populate() to initialize shadow
[ Note, this is a Git cherry-pick of the following commit:
d17a1d97dc20: ("x86/mm/kasan: don't use vmemmap_populate() to initialize shadow")
... for easier x86 PTI code testing and back-porting. ]
The KASAN shadow is currently mapped using vmemmap_populate() since that
provides a semi-convenient way to map pages into init_top_pgt. However,
since that no longer zeroes the mapped pages, it is not suitable for
KASAN, which requires zeroed shadow memory.
Add kasan_populate_shadow() interface and use it instead of
vmemmap_populate(). Besides, this allows us to take advantage of
gigantic pages and use them to populate the shadow, which should save us
some memory wasted on page tables and reduce TLB pressure.
Link: http://lkml.kernel.org/r/20171103185147.2688-2-pasha.tatashin@oracle.com Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Steven Sistare <steven.sistare@oracle.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Bob Picco <bob.picco@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Alexander Potapenko <glider@google.com> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: David S. Miller <davem@davemloft.net> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Sam Ravnborg <sam@ravnborg.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@kernel.org>
Will Deacon [Tue, 24 Oct 2017 10:22:47 +0000 (11:22 +0100)]
locking/barriers: Add implicit smp_read_barrier_depends() to READ_ONCE()
[ Note, this is a Git cherry-pick of the following commit:
76ebbe78f739 ("locking/barriers: Add implicit smp_read_barrier_depends() to READ_ONCE()")
... for easier x86 PTI code testing and back-porting. ]
In preparation for the removal of lockless_dereference(), which is the
same as READ_ONCE() on all architectures other than Alpha, add an
implicit smp_read_barrier_depends() to READ_ONCE() so that it can be
used to head dependency chains on all architectures.
Signed-off-by: Will Deacon <will.deacon@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1508840570-22169-3-git-send-email-will.deacon@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
Daniel Borkmann [Tue, 12 Dec 2017 01:25:31 +0000 (02:25 +0100)]
bpf: fix build issues on um due to mising bpf_perf_event.h
[ Note, this is a Git cherry-pick of the following commit:
a23f06f06dbe ("bpf: fix build issues on um due to mising bpf_perf_event.h")
... for easier x86 PTI code testing and back-porting. ]
Since c895f6f703ad ("bpf: correct broken uapi for
BPF_PROG_TYPE_PERF_EVENT program type") um (uml) won't build
on i386 or x86_64:
[...]
CC init/main.o
In file included from ../include/linux/perf_event.h:18:0,
from ../include/linux/trace_events.h:10,
from ../include/trace/syscall.h:7,
from ../include/linux/syscalls.h:82,
from ../init/main.c:20:
../include/uapi/linux/bpf_perf_event.h:11:32: fatal error:
asm/bpf_perf_event.h: No such file or directory #include
<asm/bpf_perf_event.h>
[...]
Lets add missing bpf_perf_event.h also to um arch. This seems
to be the only one still missing.
Fixes: c895f6f703ad ("bpf: correct broken uapi for BPF_PROG_TYPE_PERF_EVENT program type") Reported-by: Randy Dunlap <rdunlap@infradead.org> Suggested-by: Richard Weinberger <richard@sigma-star.at> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: Randy Dunlap <rdunlap@infradead.org> Cc: Hendrik Brueckner <brueckner@linux.vnet.ibm.com> Cc: Richard Weinberger <richard@sigma-star.at> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Richard Weinberger <richard@nod.at> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org>
Andi Kleen [Thu, 31 Aug 2017 21:46:30 +0000 (14:46 -0700)]
perf/x86: Enable free running PEBS for REGS_USER/INTR
[ Note, this is a Git cherry-pick of the following commit:
a47ba4d77e12 ("perf/x86: Enable free running PEBS for REGS_USER/INTR")
... for easier x86 PTI code testing and back-porting. ]
Currently free running PEBS is disabled when user or interrupt
registers are requested. Most of the registers are actually
available in the PEBS record and can be supported.
So we just need to check for the supported registers and then
allow it: it is all except for the segment register.
For user registers this only works when the counter is limited
to ring 3 only, so this also needs to be checked.
Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20170831214630.21892-1-andi@firstfloor.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
Rudolf Marek [Tue, 28 Nov 2017 21:01:06 +0000 (22:01 +0100)]
x86: Make X86_BUG_FXSAVE_LEAK detectable in CPUID on AMD
[ Note, this is a Git cherry-pick of the following commit:
2b67799bdf25 ("x86: Make X86_BUG_FXSAVE_LEAK detectable in CPUID on AMD")
... for easier x86 PTI code testing and back-porting. ]
The latest AMD AMD64 Architecture Programmer's Manual
adds a CPUID feature XSaveErPtr (CPUID_Fn80000008_EBX[2]).
If this feature is set, the FXSAVE, XSAVE, FXSAVEOPT, XSAVEC, XSAVES
/ FXRSTOR, XRSTOR, XRSTORS always save/restore error pointers,
thus making the X86_BUG_FXSAVE_LEAK workaround obsolete on such CPUs.
Signed-Off-By: Rudolf Marek <r.marek@assembler.cz> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Borislav Petkov <bp@suse.de> Tested-by: Borislav Petkov <bp@suse.de> Cc: Andy Lutomirski <luto@amacapital.net> Link: https://lkml.kernel.org/r/bdcebe90-62c5-1f05-083c-eba7f08b2540@assembler.cz Signed-off-by: Ingo Molnar <mingo@kernel.org>
... for easier x86 PTI code testing and back-porting. ]
User-Mode Instruction Prevention is a security feature present in new
Intel processors that, when set, prevents the execution of a subset of
instructions if such instructions are executed in user mode (CPL > 0).
Attempting to execute such instructions causes a general protection
exception.
The subset of instructions comprises:
* SGDT - Store Global Descriptor Table
* SIDT - Store Interrupt Descriptor Table
* SLDT - Store Local Descriptor Table
* SMSW - Store Machine Status Word
* STR - Store Task Register
This feature is also added to the list of disabled-features to allow
a cleaner handling of build-time configuration.
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Borislav Petkov <bp@suse.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Chen Yucong <slaoub@gmail.com> Cc: Chris Metcalf <cmetcalf@mellanox.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Huang Rui <ray.huang@amd.com> Cc: Jiri Slaby <jslaby@suse.cz> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ravi V. Shankar <ravi.v.shankar@intel.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Tony Luck <tony.luck@intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: ricardo.neri@intel.com Link: http://lkml.kernel.org/r/1509935277-22138-7-git-send-email-ricardo.neri-calderon@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
Ingo Molnar [Tue, 5 Dec 2017 13:14:47 +0000 (14:14 +0100)]
drivers/misc/intel/pti: Rename the header file to free up the namespace
We'd like to use the 'PTI' acronym for 'Page Table Isolation' - free up the
namespace by renaming the <linux/pti.h> driver header to <linux/intel-pti.h>.
(Also standardize the header guard name while at it.)
Linus Torvalds [Sat, 16 Dec 2017 21:43:08 +0000 (13:43 -0800)]
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull rdma fixes from Jason Gunthorpe:
"More fixes from testing done on the rc kernel, including more SELinux
testing. Looking forward, lockdep found regression today in ipoib
which is still being fixed.
Summary:
- Fix for SELinux on the umad SMI path. Some old hardware does not
fill the PKey properly exposing another bug in the newer SELinux
code.
- Check the input port as we can exceed array bounds from this user
supplied value
- Users are unable to use the hash field support as they want due to
incorrect checks on the field restrictions, correct that so the
feature works as intended
- User triggerable oops in the NETLINK_RDMA handler
- cxgb4 driver fix for a bad interaction with CQ flushing in iser
caused by patches in this merge window, and bad CQ flushing during
normal close.
- Unbalanced memalloc_noio in ipoib in an error path"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
IB/ipoib: Restore MM behavior in case of tx_ring allocation failure
iw_cxgb4: only insert drain cqes if wq is flushed
iw_cxgb4: only clear the ARMED bit if a notification is needed
RDMA/netlink: Fix general protection fault
IB/mlx4: Fix RSS hash fields restrictions
IB/core: Don't enforce PKey security on SMI MADs
IB/core: Bound check alternate path port number
Linus Torvalds [Sat, 16 Dec 2017 21:12:53 +0000 (13:12 -0800)]
Merge tag 'nfs-for-4.15-3' of git://git.linux-nfs.org/projects/anna/linux-nfs
Pull NFS client fixes from Anna Schumaker:
"This has two stable bugfixes, one to fix a BUG_ON() when
nfs_commit_inode() is called with no outstanding commit requests and
another to fix a race in the SUNRPC receive codepath.
Additionally, there are also fixes for an NFS client deadlock and an
xprtrdma performance regression.
Summary:
Stable bugfixes:
- NFS: Avoid a BUG_ON() in nfs_commit_inode() by not waiting for a
commit in the case that there were no commit requests.
- SUNRPC: Fix a race in the receive code path
Other fixes:
- NFS: Fix a deadlock in nfs client initialization
- xprtrdma: Fix a performance regression for small IOs"
* tag 'nfs-for-4.15-3' of git://git.linux-nfs.org/projects/anna/linux-nfs:
SUNRPC: Fix a race in the receive code path
nfs: don't wait on commit in nfs_commit_inode() if there were no commit requests
xprtrdma: Spread reply processing over more CPUs
nfs: fix a deadlock in nfs client initialization
We'll probably need to revisit this, but basically we should not
complicate the get_user_pages_fast() case, and checking the actual page
table protection key bits will require more care anyway, since the
protection keys depend on the exact state of the VM in question.
Particularly when doing a "remote" page lookup (ie in somebody elses VM,
not your own), you need to be much more careful than this was. Dave
Hansen says:
"So, the underlying bug here is that we now a get_user_pages_remote()
and then go ahead and do the p*_access_permitted() checks against the
current PKRU. This was introduced recently with the addition of the
new p??_access_permitted() calls.
We have checks in the VMA path for the "remote" gups and we avoid
consulting PKRU for them. This got missed in the pkeys selftests
because I did a ptrace read, but not a *write*. I also didn't
explicitly test it against something where a COW needed to be done"
It's also not entirely clear that it makes sense to check the protection
key bits at this level at all. But one possible eventual solution is to
make the get_user_pages_fast() case just abort if it sees protection key
bits set, which makes us fall back to the regular get_user_pages() case,
which then has a vma and can do the check there if we want to.
We'll see.
Somewhat related to this all: what we _do_ want to do some day is to
check the PAGE_USER bit - it should obviously always be set for user
pages, but it would be a good check to have back. Because we have no
generic way to test for it, we lost it as part of moving over from the
architecture-specific x86 GUP implementation to the generic one in
commit e585513b76f7 ("x86/mm/gup: Switch GUP to the generic
get_user_page_fast() implementation").
Cc: Peter Zijlstra <peterz@infradead.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: "Jérôme Glisse" <jglisse@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
1) Clamp timeouts to INT_MAX in conntrack, from Jay Elliot.
2) Fix broken UAPI for BPF_PROG_TYPE_PERF_EVENT, from Hendrik
Brueckner.
3) Fix locking in ieee80211_sta_tear_down_BA_sessions, from Johannes
Berg.
4) Add missing barriers to ptr_ring, from Michael S. Tsirkin.
5) Don't advertise gigabit in sh_eth when not available, from Thomas
Petazzoni.
6) Check network namespace when delivering to netlink taps, from Kevin
Cernekee.
7) Kill a race in raw_sendmsg(), from Mohamed Ghannam.
8) Use correct address in TCP md5 lookups when replying to an incoming
segment, from Christoph Paasch.
9) Add schedule points to BPF map alloc/free, from Eric Dumazet.
10) Don't allow silly mtu values to be used in ipv4/ipv6 multicast, also
from Eric Dumazet.
11) Fix SKB leak in tipc, from Jon Maloy.
12) Disable MAC learning on OVS ports of mlxsw, from Yuval Mintz.
13) SKB leak fix in skB_complete_tx_timestamp(), from Willem de Bruijn.
14) Add some new qmi_wwan device IDs, from Daniele Palmas.
15) Fix static key imbalance in ingress qdisc, from Jiri Pirko.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (76 commits)
net: qcom/emac: Reduce timeout for mdio read/write
net: sched: fix static key imbalance in case of ingress/clsact_init error
net: sched: fix clsact init error path
ip_gre: fix wrong return value of erspan_rcv
net: usb: qmi_wwan: add Telit ME910 PID 0x1101 support
pkt_sched: Remove TC_RED_OFFLOADED from uapi
net: sched: Move to new offload indication in RED
net: sched: Add TCA_HW_OFFLOAD
net: aquantia: Increment driver version
net: aquantia: Fix typo in ethtool statistics names
net: aquantia: Update hw counters on hw init
net: aquantia: Improve link state and statistics check interval callback
net: aquantia: Fill in multicast counter in ndev stats from hardware
net: aquantia: Fill ndev stat couters from hardware
net: aquantia: Extend stat counters to 64bit values
net: aquantia: Fix hardware DMA stream overload on large MRRS
net: aquantia: Fix actual speed capabilities reporting
sock: free skb in skb_complete_tx_timestamp on error
s390/qeth: update takeover IPs after configuration change
s390/qeth: lock IP table while applying takeover changes
...
Linus Torvalds [Fri, 15 Dec 2017 21:03:25 +0000 (13:03 -0800)]
Merge tag 'usb-4.15-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
Pull USB fixes from Greg KH:
"Here are some USB fixes for 4.15-rc4.
There is the usual handful gadget/dwc2/dwc3 fixes as always, for
reported issues. But the most important things in here is the core fix
from Alan Stern to resolve a nasty security bug (my first attempt is
reverted, Alan's was much cleaner), as well as a number of usbip fixes
from Shuah Khan to resolve those reported security issues.
All of these have been in linux-next with no reported issues"
* tag 'usb-4.15-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
USB: core: prevent malicious bNumInterfaces overflow
Revert "USB: core: only clean up what we allocated"
USB: core: only clean up what we allocated
Revert "usb: gadget: allow to enable legacy drivers without USB_ETH"
usb: gadget: webcam: fix V4L2 Kconfig dependency
usb: dwc2: Fix TxFIFOn sizes and total TxFIFO size issues
usb: dwc3: gadget: Fix PCM1 for ISOC EP with ep->mult less than 3
usb: dwc3: of-simple: set dev_pm_ops
usb: dwc3: of-simple: fix missing clk_disable_unprepare
usb: dwc3: gadget: Wait longer for controller to end command processing
usb: xhci: fix TDS for MTK xHCI1.1
xhci: Don't add a virt_dev to the devs array before it's fully allocated
usbip: fix stub_send_ret_submit() vulnerability to null transfer_buffer
usbip: prevent vhci_hcd driver from leaking a socket pointer address
usbip: fix stub_rx: harden CMD_SUBMIT path to handle malicious input
usbip: fix stub_rx: get_pipe() to validate endpoint number
tools/usbip: fixes potential (minor) "buffer overflow" (detected on recent gcc with -Werror)
USB: uas and storage: Add US_FL_BROKEN_FUA for another JMicron JMS567 ID
usb: musb: da8xx: fix babble condition handling
Linus Torvalds [Fri, 15 Dec 2017 20:59:48 +0000 (12:59 -0800)]
Merge tag 'staging-4.15-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging
Pull staging fixes from Greg KH:
"Here are some small staging driver fixes for 4.15-rc4.
One patch for the ccree driver to prevent an unitialized value from
being returned to a caller, and the other fixes a logic error in the
pi433 driver"
* tag 'staging-4.15-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
staging: pi433: Fixes issue with bit shift in rf69_get_modulation
staging: ccree: Uninitialized return in ssi_ahash_import()
Linus Torvalds [Fri, 15 Dec 2017 20:56:23 +0000 (12:56 -0800)]
Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost
Pull virtio regression fixes from Michael Tsirkin:
"Fixes two issues in the latest kernel"
* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
virtio_mmio: fix devm cleanup
ptr_ring: fix up after recent ptr_ring changes
Linus Torvalds [Fri, 15 Dec 2017 20:53:37 +0000 (12:53 -0800)]
Merge tag 'for-4.15/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper fixes from Mike Snitzer:
- fix a particularly nasty DM core bug in a 4.15 refcount_t conversion.
- fix various targets to dm_register_target after module __init
resources created; otherwise racing lvm2 commands could result in a
NULL pointer during initialization of associated DM kernel module.
- fix regression in bio-based DM multipath queue_if_no_path handling.
- fix DM bufio's shrinker to reclaim more than one buffer per scan.
* tag 'for-4.15/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm bufio: fix shrinker scans when (nr_to_scan < retain_target)
dm mpath: fix bio-based multipath queue_if_no_path handling
dm: fix various targets to dm_register_target after module __init resources created
dm table: fix regression from improper dm_dev_internal.count refcount_t conversion
Linus Torvalds [Fri, 15 Dec 2017 20:51:42 +0000 (12:51 -0800)]
Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"The most important one is the bfa fix because it's easy to oops the
kernel with this driver (this includes the commit that corrects the
compiler warning in the original), a regression in the new timespec
conversion in aacraid and a regression in the Fibre Channel ELS
handling patch.
The other three are a theoretical problem with termination in the
vendor/host matching code and a use after free in lpfc.
The additional patches are a fix for an I/O hang in the mq code under
certain circumstances and a rare oops in some debugging code"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: core: Fix a scsi_show_rq() NULL pointer dereference
scsi: MAINTAINERS: change FCoE list to linux-scsi
scsi: libsas: fix length error in sas_smp_handler()
scsi: bfa: fix type conversion warning
scsi: core: run queue if SCSI device queue isn't ready and queue is idle
scsi: scsi_devinfo: cleanly zero-pad devinfo strings
scsi: scsi_devinfo: handle non-terminated strings
scsi: bfa: fix access to bfad_im_port_s
scsi: aacraid: address UBSAN warning regression
scsi: libfc: fix ELS request handling
scsi: lpfc: Use after free in lpfc_rq_buf_free()
Linus Torvalds [Fri, 15 Dec 2017 20:49:54 +0000 (12:49 -0800)]
Merge tag 'mmc-v4.15-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc
Pull MMC fixes from Ulf Hansson:
"A couple of MMC fixes:
- fix use of uninitialized drv_typ variable
- apply NO_CMD23 quirk to some specific SD cards to make them work"
* tag 'mmc-v4.15-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc:
mmc: core: apply NO_CMD23 quirk to some specific cards
mmc: core: properly init drv_type
Linus Torvalds [Fri, 15 Dec 2017 20:46:48 +0000 (12:46 -0800)]
Merge branch 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs
Pull overlayfs fixes from Miklos Szeredi:
- fix incomplete syncing of filesystem
- fix regression in readdir on ovl over 9p
- only follow redirects when needed
- misc fixes and cleanups
* 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
ovl: fix overlay: warning prefix
ovl: Use PTR_ERR_OR_ZERO()
ovl: Sync upper dirty data when syncing overlayfs
ovl: update ctx->pos on impure dir iteration
ovl: Pass ovl_get_nlink() parameters in right order
ovl: don't follow redirects if redirect_dir=off
Hemanth Puranik [Fri, 15 Dec 2017 14:35:58 +0000 (20:05 +0530)]
net: qcom/emac: Reduce timeout for mdio read/write
Currently mdio read/write takes around ~115us as the timeout
between status check is set to 100us.
By reducing the timeout to 1us mdio read/write takes ~15us to
complete. This improves the link up event response.
Signed-off-by: Hemanth Puranik <hpuranik@codeaurora.org> Acked-by: Timur Tabi <timur@codeaurora.org> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Fri, 15 Dec 2017 20:44:49 +0000 (12:44 -0800)]
Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 fixes from Will Deacon:
"There are some significant fixes in here for FP state corruption,
hardware access/dirty PTE corruption and an erratum workaround for the
Falkor CPU.
I'm hoping that things finally settle down now, but never say never...
Summary:
- Fix FPSIMD context switch regression introduced in -rc2
- Fix ABI break with SVE CPUID register reporting
- Fix use of uninitialised variable
- Fixes to hardware access/dirty management and sanity checking
- CPU erratum workaround for Falkor CPUs
- Fix reporting of writeable+executable mappings
- Fix signal reporting for RAS errors"
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
arm64: fpsimd: Fix copying of FP state from signal frame into task struct
arm64/sve: Report SVE to userspace via CPUID only if supported
arm64: fix CONFIG_DEBUG_WX address reporting
arm64: fault: avoid send SIGBUS two times
arm64: hw_breakpoint: Use linux/uaccess.h instead of asm/uaccess.h
arm64: Add software workaround for Falkor erratum 1041
arm64: Define cputype macros for Falkor CPU
arm64: mm: Fix false positives in set_pte_at access/dirty race detection
arm64: mm: Fix pte_mkclean, pte_mkdirty semantics
arm64: Initialise high_memory global variable earlier
Jiri Pirko [Fri, 15 Dec 2017 11:40:13 +0000 (12:40 +0100)]
net: sched: fix static key imbalance in case of ingress/clsact_init error
Move static key increments to the beginning of the init function
so they pair 1:1 with decrements in ingress/clsact_destroy,
which is called in case ingress/clsact_init fails.
Fixes: 6529eaba33f0 ("net: sched: introduce tcf block infractructure") Signed-off-by: Jiri Pirko <jiri@mellanox.com> Acked-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Fri, 15 Dec 2017 11:40:12 +0000 (12:40 +0100)]
net: sched: fix clsact init error path
Since in qdisc_create, the destroy op is called when init fails, we
don't do cleanup in init and leave it up to destroy.
This fixes use-after-free when trying to put already freed block.
Fixes: 6e40cf2d4dee ("net: sched: use extended variants of block_get/put in ingress and clsact qdiscs") Signed-off-by: Jiri Pirko <jiri@mellanox.com> Acked-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Fri, 15 Dec 2017 20:14:33 +0000 (12:14 -0800)]
Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Ingo Molnar:
"Misc fixes:
- fix the s2ram regression related to confusion around segment
register restoration, plus related cleanups that make the code more
robust
- a guess-unwinder Kconfig dependency fix
- an isoimage build target fix for certain tool chain combinations
- instruction decoder opcode map fixes+updates, and the syncing of
the kernel decoder headers to the objtool headers
- a kmmio tracing fix
- two 5-level paging related fixes
- a topology enumeration fix on certain SMP systems"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
objtool: Resync objtool's instruction decoder source code copy with the kernel's latest version
x86/decoder: Fix and update the opcodes map
x86/power: Make restore_processor_context() sane
x86/power/32: Move SYSENTER MSR restoration to fix_processor_context()
x86/power/64: Use struct desc_ptr for the IDT in struct saved_context
x86/unwinder/guess: Prevent using CONFIG_UNWINDER_GUESS=y with CONFIG_STACKDEPOT=y
x86/build: Don't verify mtools configuration file for isoimage
x86/mm/kmmio: Fix mmiotrace for page unaligned addresses
x86/boot/compressed/64: Print error if 5-level paging is not supported
x86/boot/compressed/64: Detect and handle 5-level paging at boot-time
x86/smpboot: Do not use smp_num_siblings in __max_logical_packages calculation
Linus Torvalds [Fri, 15 Dec 2017 19:44:59 +0000 (11:44 -0800)]
Merge branch 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking fixes from Ingo Molnar:
"Misc fixes:
- Fix a S390 boot hang that was caused by the lock-break logic.
Remove lock-break to begin with, as review suggested it was
unreasonably fragile and our confidence in its continued good
health is lower than our confidence in its removal.
- Remove the lockdep cross-release checking code for now, because of
unresolved false positive warnings. This should make lockdep work
well everywhere again.
- Get rid of the final (and single) ACCESS_ONCE() straggler and
remove the API from v4.15.
- Fix a liblockdep build warning"
* 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
tools/lib/lockdep: Add missing declaration of 'pr_cont()'
checkpatch: Remove ACCESS_ONCE() warning
compiler.h: Remove ACCESS_ONCE()
tools/include: Remove ACCESS_ONCE()
tools/perf: Convert ACCESS_ONCE() to READ_ONCE()
locking/lockdep: Remove the cross-release locking checks
locking/core: Remove break_lock field when CONFIG_GENERIC_LOCKBREAK=y
locking/core: Fix deadlock during boot on systems with GENERIC_LOCKBREAK
Linus Torvalds [Fri, 15 Dec 2017 19:40:24 +0000 (11:40 -0800)]
Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
"Two fixes: a crash fix for an ARM SoC platform, and kernel-doc
warnings fixes"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/rt: Do not pull from current CPU if only one CPU to pull
sched/core: Fix kernel-doc warnings after code movement
Linus Torvalds [Fri, 15 Dec 2017 19:32:09 +0000 (11:32 -0800)]
Merge tag 'for-linus-4.15-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen fixes from Juergen Gross:
"Two minor fixes for running as Xen dom0:
- when built as 32 bit kernel on large machines the Xen LAPIC
emulation should report a rather modern LAPIC in order to support
enough APIC-Ids
- The Xen LAPIC emulation is needed for dom0 only, so build it only
for kernels supporting to run as Xen dom0"
* tag 'for-linus-4.15-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
xen: XEN_ACPI_PROCESSOR is Dom0-only
x86/Xen: don't report ancient LAPIC version
Scott Mayhew [Fri, 8 Dec 2017 21:00:12 +0000 (16:00 -0500)]
nfs: don't wait on commit in nfs_commit_inode() if there were no commit requests
If there were no commit requests, then nfs_commit_inode() should not
wait on the commit or mark the inode dirty, otherwise the following
BUG_ON can be triggered:
Chuck Lever [Mon, 4 Dec 2017 19:04:04 +0000 (14:04 -0500)]
xprtrdma: Spread reply processing over more CPUs
Commit d8f532d20ee4 ("xprtrdma: Invoke rpcrdma_reply_handler
directly from RECV completion") introduced a performance regression
for NFS I/O small enough to not need memory registration. In multi-
threaded benchmarks that generate primarily small I/O requests,
IOPS throughput is reduced by nearly a third. This patch restores
the previous level of throughput.
Because workqueues are typically BOUND (in particular ib_comp_wq,
nfsiod_workqueue, and rpciod_workqueue), NFS/RDMA workloads tend
to aggregate on the CPU that is handling Receive completions.
The usual approach to addressing this problem is to create a QP
and CQ for each CPU, and then schedule transactions on the QP
for the CPU where you want the transaction to complete. The
transaction then does not require an extra context switch during
completion to end up on the same CPU where the transaction was
started.
This approach doesn't work for the Linux NFS/RDMA client because
currently the Linux NFS client does not support multiple connections
per client-server pair, and the RDMA core API does not make it
straightforward for ULPs to determine which CPU is responsible for
handling Receive completions for a CQ.
So for the moment, record the CPU number in the rpcrdma_req before
the transport sends each RPC Call. Then during Receive completion,
queue the RPC completion on that same CPU.
Additionally, move all RPC completion processing to the deferred
handler so that even RPCs with simple small replies complete on
the CPU that sent the corresponding RPC Call.
Fixes: d8f532d20ee4 ("xprtrdma: Invoke rpcrdma_reply_handler ...") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Scott Mayhew [Tue, 5 Dec 2017 18:55:44 +0000 (13:55 -0500)]
nfs: fix a deadlock in nfs client initialization
The following deadlock can occur between a process waiting for a client
to initialize in while walking the client list during nfsv4 server trunking
detection and another process waiting for the nfs_clid_init_mutex so it
can initialize that client:
Process 1 Process 2
--------- ---------
spin_lock(&nn->nfs_client_lock);
list_add_tail(&CLIENTA->cl_share_link,
&nn->nfs_client_list);
spin_unlock(&nn->nfs_client_lock);
spin_lock(&nn->nfs_client_lock);
list_add_tail(&CLIENTB->cl_share_link,
&nn->nfs_client_list);
spin_unlock(&nn->nfs_client_lock);
mutex_lock(&nfs_clid_init_mutex);
nfs41_walk_client_list(clp, result, cred);
nfs_wait_client_init_complete(CLIENTA);
(waiting for nfs_clid_init_mutex)
Make sure nfs_match_client() only evaluates clients that have completed
initialization in order to prevent that deadlock.
This patch also fixes v4.0 trunking behavior by not marking the client
NFS_CS_READY until the clientid has been confirmed.
Signed-off-by: Scott Mayhew <smayhew@redhat.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Haishuang Yan [Fri, 15 Dec 2017 02:46:16 +0000 (10:46 +0800)]
ip_gre: fix wrong return value of erspan_rcv
If pskb_may_pull return failed, return PACKET_REJECT instead of -ENOMEM.
Fixes: 84e54fe0a5ea ("gre: introduce native tunnel support for ERSPAN") Cc: William Tu <u9012063@gmail.com> Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com> Acked-by: William Tu <u9012063@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
====================
net: sched: Make qdisc offload uapi uniform
Several qdiscs can already be offloaded to hardware, but there's an
inconsistecy in regard to the uapi through which they indicate such
an offload is taking place - indication is passed to the user via
TCA_OPTIONS where each qdisc retains private logic for setting it.
The recent addition of offloading to RED in 602f3baf2218 ("net_sch: red: Add offload ability to RED qdisc") caused
the addition of yet another uapi field for this purpose -
TC_RED_OFFLOADED.
For clarity and prevention of bloat in the uapi we want to eliminate
said added uapi, replacing it with a common mechanism that can be used
to reflect offload status of the various qdiscs.
The first patch introduces TCA_HW_OFFLOAD as the generic message meant
for this purpose. The second changes the current RED implementation into
setting the internal bits necessary for passing it, and the third removes
TC_RED_OFFLOADED as its no longer needed.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Yuval Mintz [Thu, 14 Dec 2017 13:54:31 +0000 (15:54 +0200)]
pkt_sched: Remove TC_RED_OFFLOADED from uapi
Following the previous patch, RED is now using the new uniform uapi
for indicating it's offloaded. As a result, TC_RED_OFFLOADED is no
longer utilized by kernel and can be removed [as it's still not
part of any stable release].
Fixes: 602f3baf2218 ("net_sch: red: Add offload ability to RED qdisc") Signed-off-by: Yuval Mintz <yuvalm@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Yuval Mintz [Thu, 14 Dec 2017 13:54:30 +0000 (15:54 +0200)]
net: sched: Move to new offload indication in RED
Let RED utilize the new internal flag, TCQ_F_OFFLOADED,
to mark a given qdisc as offloaded instead of using a dedicated
indication.
Also, change internal logic into looking at said flag when possible.
Fixes: 602f3baf2218 ("net_sch: red: Add offload ability to RED qdisc") Signed-off-by: Yuval Mintz <yuvalm@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Yuval Mintz [Thu, 14 Dec 2017 13:54:29 +0000 (15:54 +0200)]
net: sched: Add TCA_HW_OFFLOAD
Qdiscs can be offloaded to HW, but current implementation isn't uniform.
Instead, qdiscs either pass information about offload status via their
TCA_OPTIONS or omit it altogether.
Introduce a new attribute - TCA_HW_OFFLOAD that would form a uniform
uAPI for the offloading status of qdiscs.
Signed-off-by: Yuval Mintz <yuvalm@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
The patchset contains important hardware fix for machines with large MRRS
and couple of improvement in stats and capabilities reporting
patch v3:
- Fixed patch #7 after Andrew's finding. NIC level stats actually
have to be cleaned only on hw struct creation (and this is done
in kzalloc). On each hwinit we only have to reset link state
to make sure hw stats update will not increment nic stats during init.
patch v2:
- split into more detailed commits
Comment from David on wrong defines case will be submitted separately later
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Igor Russkikh [Thu, 14 Dec 2017 09:34:46 +0000 (12:34 +0300)]
net: aquantia: Update hw counters on hw init
On very first start we should read out current HW counter values
to make diff based calculations later.
This also should be done each time NIC gets down/up or wakes up
after sleep state. We reset link state explicitly to prevent diffs
from being summed this first time.
Signed-off-by: Igor Russkikh <igor.russkikh@aquantia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Igor Russkikh [Thu, 14 Dec 2017 09:34:43 +0000 (12:34 +0300)]
net: aquantia: Fill ndev stat couters from hardware
Originally they were filled from ring sw counters.
These sometimes incorrectly calculate byte and packet amounts
when using LRO/LSO and jumboframes. Filling ndev counters from
hardware makes them precise.
Signed-off-by: Igor Russkikh <igor.russkikh@aquantia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Igor Russkikh [Thu, 14 Dec 2017 09:34:42 +0000 (12:34 +0300)]
net: aquantia: Extend stat counters to 64bit values
Device hardware provides only 32bit counters. Using these directly
causes byte counters to overflow soon. A separate nic level structure
with 64 bit counters is now used to collect incrementally all the stats
and report these counters to ethtool stats and ndev stats.
Signed-off-by: Igor Russkikh <igor.russkikh@aquantia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Igor Russkikh [Thu, 14 Dec 2017 09:34:41 +0000 (12:34 +0300)]
net: aquantia: Fix hardware DMA stream overload on large MRRS
Systems with large MRRS on device (2K, 4K) with high data rates and/or
large MTU, atlantic observes DMA packet buffer overflow. On some systems
that causes PCIe transaction errors, hardware NMIs or datapath freeze.
This patch
1) Limits MRRS from device side to 2K (thats maximum our hardware supports)
2) Limit maximum size of outstanding TX DMA data read requests. This makes
hardware buffers running fine.
Signed-off-by: Pavel Belous <pavel.belous@aquantia.com> Signed-off-by: Igor Russkikh <igor.russkikh@aquantia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Igor Russkikh [Thu, 14 Dec 2017 09:34:40 +0000 (12:34 +0300)]
net: aquantia: Fix actual speed capabilities reporting
Different hardware device Ids correspond to different maximum speed
available. Extra checks were added for devices D108 and D109 to
remove unsupported speeds from these device capabilities list.
Signed-off-by: Igor Russkikh <igor.russkikh@aquantia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Willem de Bruijn [Wed, 13 Dec 2017 19:41:06 +0000 (14:41 -0500)]
sock: free skb in skb_complete_tx_timestamp on error
skb_complete_tx_timestamp must ingest the skb it is passed. Call
kfree_skb if the skb cannot be enqueued.
Fixes: b245be1f4db1 ("net-timestamp: no-payload only sysctl") Fixes: 9ac25fc06375 ("net: fix socket refcounting in skb_complete_tx_timestamp()") Reported-by: Richard Cochran <richardcochran@gmail.com> Signed-off-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 15 Dec 2017 16:29:44 +0000 (11:29 -0500)]
Merge branch 's390-fixes'
Julian Wiedmann says:
====================
s390/qeth: fixes 2017-12-13
some more patches for 4.15, that fix multiple issues with IP Takeover
configuration in qeth.
Please queue them up for stable kernels as well (4.9 and newer).
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Julian Wiedmann [Wed, 13 Dec 2017 17:56:32 +0000 (18:56 +0100)]
s390/qeth: update takeover IPs after configuration change
Any modification to the takeover IP-ranges requires that we re-evaluate
which IP addresses are takeover-eligible. Otherwise we might do takeover
for some addresses when we no longer should, or vice-versa.
Signed-off-by: Julian Wiedmann <jwi@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Julian Wiedmann [Wed, 13 Dec 2017 17:56:31 +0000 (18:56 +0100)]
s390/qeth: lock IP table while applying takeover changes
Modifying the flags of an IP addr object needs to be protected against
eg. concurrent removal of the same object from the IP table.
Fixes: 5f78e29ceebf ("qeth: optimize IP handling in rx_mode callback") Signed-off-by: Julian Wiedmann <jwi@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Julian Wiedmann [Wed, 13 Dec 2017 17:56:30 +0000 (18:56 +0100)]
s390/qeth: don't apply takeover changes to RXIP
When takeover is switched off, current code clears the 'TAKEOVER' flag on
all IPs. But the flag is also used for RXIP addresses, and those should
not be affected by the takeover mode.
Fix the behaviour by consistenly applying takover logic to NORMAL
addresses only.
Signed-off-by: Julian Wiedmann <jwi@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Julian Wiedmann [Wed, 13 Dec 2017 17:56:29 +0000 (18:56 +0100)]
s390/qeth: apply takeover changes when mode is toggled
Just as for an explicit enable/disable, toggling the takeover mode also
requires that the IP addresses get updated. Otherwise all IPs that were
added to the table before the mode-toggle, get registered with the old
settings.
Signed-off-by: Julian Wiedmann <jwi@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Will Deacon [Fri, 15 Dec 2017 16:07:22 +0000 (16:07 +0000)]
arm64: fpsimd: Fix copying of FP state from signal frame into task struct
Commit 9de52a755cfb6da5 ("arm64: fpsimd: Fix failure to restore FPSIMD
state after signals") fixed an issue reported in our FPSIMD signal
restore code but inadvertently introduced another issue which tends to
manifest as random SEGVs in userspace.
The problem is that when we copy the struct fpsimd_state from the kernel
stack (populated from the signal frame) into the struct held in the
current thread_struct, we blindly copy uninitialised stack into the
"cpu" field, which means that context-switching of the FP registers is
no longer reliable.
This patch fixes the problem by copying only the user_fpsimd member of
struct fpsimd_state. We should really rework the function prototypes
to take struct user_fpsimd_state * instead, but let's just get this
fixed for now.
Cc: Dave Martin <Dave.Martin@arm.com> Fixes: 9de52a755cfb6da5 ("arm64: fpsimd: Fix failure to restore FPSIMD state after signals") Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Will Deacon <will.deacon@arm.com>
Yuval Mintz [Fri, 15 Dec 2017 07:44:21 +0000 (08:44 +0100)]
mlxsw: spectrum: Disable MAC learning for ovs port
Learning is currently enabled for ports which are OVS slaves -
even though OVS doesn't need this indication.
Since we're not associating a fid with the port, HW would continuously
notify driver of learned [& aged] MACs which would be logged as errors.
Fixes: 2b94e58df58c ("mlxsw: spectrum: Allow ports to work under OVS master") Signed-off-by: Yuval Mintz <yuvalm@mellanox.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Steven Rostedt [Sat, 2 Dec 2017 18:04:54 +0000 (13:04 -0500)]
sched/rt: Do not pull from current CPU if only one CPU to pull
Daniel Wagner reported a crash on the BeagleBone Black SoC.
This is a single CPU architecture, and does not have a functional
arch_send_call_function_single_ipi() implementation which can crash
the kernel if that is called.
As it only has one CPU, it shouldn't be called, but if the kernel is
compiled for SMP, the push/pull RT scheduling logic now calls it for
irq_work if the one CPU is overloaded, it can use that function to call
itself and crash the kernel.
Ideally, we should disable the SCHED_FEAT(RT_PUSH_IPI) if the system
only has a single CPU. But SCHED_FEAT is a constant if sched debugging
is turned off. Another fix can also be used, and this should also help
with normal SMP machines. That is, do not initiate the pull code if
there's only one RT overloaded CPU, and that CPU happens to be the
current CPU that is scheduling in a lower priority task.
Even on a system with many CPUs, if there's many RT tasks waiting to
run on a single CPU, and that CPU schedules in another RT task of lower
priority, it will initiate the PULL logic in case there's a higher
priority RT task on another CPU that is waiting to run. But if there is
no other CPU with waiting RT tasks, it will initiate the RT pull logic
on itself (as it still has RT tasks waiting to run). This is a wasted
effort.
Not only does this help with SMP code where the current CPU is the only
one with RT overloaded tasks, it should also solve the issue that
Daniel encountered, because it will prevent the PULL logic from
executing, as there's only one CPU on the system, and the check added
here will cause it to exit the RT pull code.
Reported-by: Daniel Wagner <wagi@monom.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-rt-users <linux-rt-users@vger.kernel.org> Cc: stable@vger.kernel.org Fixes: 4bdced5c9 ("sched/rt: Simplify the IPI based RT balancing logic") Link: http://lkml.kernel.org/r/20171202130454.4cbbfe8d@vmware.local.home Signed-off-by: Ingo Molnar <mingo@kernel.org>