]> git.proxmox.com Git - mirror_ubuntu-eoan-kernel.git/log
mirror_ubuntu-eoan-kernel.git
5 years agoMerge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Linus Torvalds [Wed, 15 May 2019 03:56:31 +0000 (20:56 -0700)]
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma

Pull more rdma updates from Jason Gunthorpe:
 "This is being sent to get a fix for the gcc 9.1 build warnings, and
  I've also pulled in some bug fix patches that were posted in the last
  two weeks.

   - Avoid the gcc 9.1 warning about overflowing a union member

   - Fix the wrong callback type for a single response netlink to doit

   - Bug fixes from more usage of the mlx5 devx interface"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
  net/mlx5: Set completion EQs as shared resources
  IB/mlx5: Verify DEVX general object type correctly
  RDMA/core: Change system parameters callback from dumpit to doit
  RDMA: Directly cast the sockaddr union to sockaddr

5 years agoMerge branch 'akpm' (patches from Andrew)
Linus Torvalds [Wed, 15 May 2019 03:08:51 +0000 (20:08 -0700)]
Merge branch 'akpm' (patches from Andrew)

Merge more updates from Andrew Morton:

 - a couple of hotfixes

 - almost all of the rest of MM

 - lib/ updates

 - binfmt_elf updates

 - autofs updates

 - quite a lot of misc fixes and updates
    - reiserfs, fatfs
    - signals
    - exec
    - cpumask
    - rapidio
    - sysctl
    - pids
    - eventfd
    - gcov
    - panic
    - pps

 - gdb script updates

 - ipc updates

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (126 commits)
  mm: memcontrol: fix NUMA round-robin reclaim at intermediate level
  mm: memcontrol: fix recursive statistics correctness & scalabilty
  mm: memcontrol: move stat/event counting functions out-of-line
  mm: memcontrol: make cgroup stats and events query API explicitly local
  drivers/virt/fsl_hypervisor.c: prevent integer overflow in ioctl
  drivers/virt/fsl_hypervisor.c: dereferencing error pointers in ioctl
  mm, memcg: rename ambiguously named memory.stat counters and functions
  arch: remove <asm/sizes.h> and <asm-generic/sizes.h>
  treewide: replace #include <asm/sizes.h> with #include <linux/sizes.h>
  fs/block_dev.c: Remove duplicate header
  fs/cachefiles/namei.c: remove duplicate header
  include/linux/sched/signal.h: replace `tsk' with `task'
  fs/coda/psdev.c: remove duplicate header
  ipc: do cyclic id allocation for the ipc object.
  ipc: conserve sequence numbers in ipcmni_extend mode
  ipc: allow boot time extension of IPCMNI from 32k to 16M
  ipc/mqueue: optimize msg_get()
  ipc/mqueue: remove redundant wq task assignment
  ipc: prevent lockup on alloc_msg and free_msg
  scripts/gdb: print cached rate in lx-clk-summary
  ...

5 years agomm: memcontrol: fix NUMA round-robin reclaim at intermediate level
Johannes Weiner [Tue, 14 May 2019 22:47:15 +0000 (15:47 -0700)]
mm: memcontrol: fix NUMA round-robin reclaim at intermediate level

When a cgroup is reclaimed on behalf of a configured limit, reclaim
needs to round-robin through all NUMA nodes that hold pages of the memcg
in question.  However, when assembling the mask of candidate NUMA nodes,
the code only consults the *local* cgroup LRU counters, not the
recursive counters for the entire subtree.  Cgroup limits are frequently
configured against intermediate cgroups that do not have memory on their
own LRUs.  In this case, the node mask will always come up empty and
reclaim falls back to scanning only the current node.

If a cgroup subtree has some memory on one node but the processes are
bound to another node afterwards, the limit reclaim will never age or
reclaim that memory anymore.

To fix this, use the recursive LRU counts for a cgroup subtree to
determine which nodes hold memory of that cgroup.

The code has been broken like this forever, so it doesn't seem to be a
problem in practice.  I just noticed it while reviewing the way the LRU
counters are used in general.

Link: http://lkml.kernel.org/r/20190412151507.2769-5-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agomm: memcontrol: fix recursive statistics correctness & scalabilty
Johannes Weiner [Tue, 14 May 2019 22:47:12 +0000 (15:47 -0700)]
mm: memcontrol: fix recursive statistics correctness & scalabilty

Right now, when somebody needs to know the recursive memory statistics
and events of a cgroup subtree, they need to walk the entire subtree and
sum up the counters manually.

There are two issues with this:

1. When a cgroup gets deleted, its stats are lost. The state counters
   should all be 0 at that point, of course, but the events are not.
   When this happens, the event counters, which are supposed to be
   monotonic, can go backwards in the parent cgroups.

2. During regular operation, we always have a certain number of lazily
   freed cgroups sitting around that have been deleted, have no tasks,
   but have a few cache pages remaining. These groups' statistics do not
   change until we eventually hit memory pressure, but somebody
   watching, say, memory.stat on an ancestor has to iterate those every
   time.

This patch addresses both issues by introducing recursive counters at
each level that are propagated from the write side when stats change.

Upward propagation happens when the per-cpu caches spill over into the
local atomic counter.  This is the same thing we do during charge and
uncharge, except that the latter uses atomic RMWs, which are more
expensive; stat changes happen at around the same rate.  In a sparse
file test (page faults and reclaim at maximum CPU speed) with 5 cgroup
nesting levels, perf shows __mod_memcg_page state at ~1%.

Link: http://lkml.kernel.org/r/20190412151507.2769-4-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agomm: memcontrol: move stat/event counting functions out-of-line
Johannes Weiner [Tue, 14 May 2019 22:47:09 +0000 (15:47 -0700)]
mm: memcontrol: move stat/event counting functions out-of-line

These are getting too big to be inlined in every callsite.  They were
stolen from vmstat.c, which already out-of-lines them, and they have
only been growing since.  The callsites aren't that hot, either.

Move __mod_memcg_state()
     __mod_lruvec_state() and
     __count_memcg_events() out of line and add kerneldoc comments.

Link: http://lkml.kernel.org/r/20190412151507.2769-3-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agomm: memcontrol: make cgroup stats and events query API explicitly local
Johannes Weiner [Tue, 14 May 2019 22:47:06 +0000 (15:47 -0700)]
mm: memcontrol: make cgroup stats and events query API explicitly local

Patch series "mm: memcontrol: memory.stat cost & correctness".

The cgroup memory.stat file holds recursive statistics for the entire
subtree.  The current implementation does this tree walk on-demand
whenever the file is read.  This is giving us problems in production.

1. The cost of aggregating the statistics on-demand is high.  A lot of
   system service cgroups are mostly idle and their stats don't change
   between reads, yet we always have to check them.  There are also always
   some lazily-dying cgroups sitting around that are pinned by a handful
   of remaining page cache; the same applies to them.

   In an application that periodically monitors memory.stat in our
   fleet, we have seen the aggregation consume up to 5% CPU time.

2. When cgroups die and disappear from the cgroup tree, so do their
   accumulated vm events.  The result is that the event counters at
   higher-level cgroups can go backwards and confuse some of our
   automation, let alone people looking at the graphs over time.

To address both issues, this patch series changes the stat
implementation to spill counts upwards when the counters change.

The upward spilling is batched using the existing per-cpu cache.  In a
sparse file stress test with 5 level cgroup nesting, the additional cost
of the flushing was negligible (a little under 1% of CPU at 100% CPU
utilization, compared to the 5% of reading memory.stat during regular
operation).

This patch (of 4):

memcg_page_state(), lruvec_page_state(), memcg_sum_events() are
currently returning the state of the local memcg or lruvec, not the
recursive state.

In practice there is a demand for both versions, although the callers
that want the recursive counts currently sum them up by hand.

Per default, cgroups are considered recursive entities and generally we
expect more users of the recursive counters, with the local counts being
special cases.  To reflect that in the name, add a _local suffix to the
current implementations.

The following patch will re-incarnate these functions with recursive
semantics, but with an O(1) implementation.

[hannes@cmpxchg.org: fix bisection hole]
Link: http://lkml.kernel.org/r/20190417160347.GC23013@cmpxchg.org
Link: http://lkml.kernel.org/r/20190412151507.2769-2-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agodrivers/virt/fsl_hypervisor.c: prevent integer overflow in ioctl
Dan Carpenter [Tue, 14 May 2019 22:47:03 +0000 (15:47 -0700)]
drivers/virt/fsl_hypervisor.c: prevent integer overflow in ioctl

The "param.count" value is a u64 thatcomes from the user.  The code
later in the function assumes that param.count is at least one and if
it's not then it leads to an Oops when we dereference the ZERO_SIZE_PTR.

Also the addition can have an integer overflow which would lead us to
allocate a smaller "pages" array than required.  I can't immediately
tell what the possible run times implications are, but it's safest to
prevent the overflow.

Link: http://lkml.kernel.org/r/20181218082129.GE32567@kadam
Fixes: 6db7199407ca ("drivers/virt: introduce Freescale hypervisor management driver")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Timur Tabi <timur@freescale.com>
Cc: Mihai Caraman <mihai.caraman@freescale.com>
Cc: Kumar Gala <galak@kernel.crashing.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agodrivers/virt/fsl_hypervisor.c: dereferencing error pointers in ioctl
Dan Carpenter [Tue, 14 May 2019 22:47:00 +0000 (15:47 -0700)]
drivers/virt/fsl_hypervisor.c: dereferencing error pointers in ioctl

strndup_user() returns error pointers on error, and then in the error
handling we pass the error pointers to kfree().  It will cause an Oops.

Link: http://lkml.kernel.org/r/20181218082003.GD32567@kadam
Fixes: 6db7199407ca ("drivers/virt: introduce Freescale hypervisor management driver")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Timur Tabi <timur@freescale.com>
Cc: Mihai Caraman <mihai.caraman@freescale.com>
Cc: Kumar Gala <galak@kernel.crashing.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agomm, memcg: rename ambiguously named memory.stat counters and functions
Chris Down [Tue, 14 May 2019 22:46:57 +0000 (15:46 -0700)]
mm, memcg: rename ambiguously named memory.stat counters and functions

I spent literally an hour trying to work out why an earlier version of
my memory.events aggregation code doesn't work properly, only to find
out I was calling memcg->events instead of memcg->memory_events, which
is fairly confusing.

This naming seems in need of reworking, so make it harder to do the
wrong thing by using vmevents instead of events, which makes it more
clear that these are vm counters rather than memcg-specific counters.

There are also a few other inconsistent names in both the percpu and
aggregated structs, so these are all cleaned up to be more coherent and
easy to understand.

This commit contains code cleanup only: there are no logic changes.

[akpm@linux-foundation.org: fix it for preceding changes]
Link: http://lkml.kernel.org/r/20190208224319.GA23801@chrisdown.name
Signed-off-by: Chris Down <chris@chrisdown.name>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Dennis Zhou <dennis@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agoarch: remove <asm/sizes.h> and <asm-generic/sizes.h>
Masahiro Yamada [Tue, 14 May 2019 22:46:54 +0000 (15:46 -0700)]
arch: remove <asm/sizes.h> and <asm-generic/sizes.h>

Now that all instances of #include <asm/sizes.h> have been replaced with
#include <linux/sizes.h>, we can remove these.

Link: http://lkml.kernel.org/r/1553267665-27228-2-git-send-email-yamada.masahiro@socionext.com
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agotreewide: replace #include <asm/sizes.h> with #include <linux/sizes.h>
Masahiro Yamada [Tue, 14 May 2019 22:46:51 +0000 (15:46 -0700)]
treewide: replace #include <asm/sizes.h> with #include <linux/sizes.h>

Since commit dccd2304cc90 ("ARM: 7430/1: sizes.h: move from asm-generic
to <linux/sizes.h>"), <asm/sizes.h> and <asm-generic/sizes.h> are just
wrappers of <linux/sizes.h>.

This commit replaces all <asm/sizes.h> and <asm-generic/sizes.h> to
prepare for the removal.

Link: http://lkml.kernel.org/r/1553267665-27228-1-git-send-email-yamada.masahiro@socionext.com
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agofs/block_dev.c: Remove duplicate header
Sabyasachi Gupta [Tue, 14 May 2019 22:46:48 +0000 (15:46 -0700)]
fs/block_dev.c: Remove duplicate header

linux/dax.h is included more than once.

Link: http://lkml.kernel.org/r/5c867e95.1c69fb81.4f15a.e5e4@mx.google.com
Signed-off-by: Sabyasachi Gupta <sabyasachi.linux@gmail.com>
Acked-by: Souptick Joarder <jrdr.linux@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agofs/cachefiles/namei.c: remove duplicate header
Sabyasachi Gupta [Tue, 14 May 2019 22:46:45 +0000 (15:46 -0700)]
fs/cachefiles/namei.c: remove duplicate header

linux/xattr.h is included more than once.

Link: http://lkml.kernel.org/r/5c86803d.1c69fb81.1a7c6.2b78@mx.google.com
Signed-off-by: Sabyasachi Gupta <sabyasachi.linux@gmail.com>
Acked-by: Souptick Joarder <jrdr.linux@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agoinclude/linux/sched/signal.h: replace `tsk' with `task'
Andrei Vagin [Tue, 14 May 2019 22:46:42 +0000 (15:46 -0700)]
include/linux/sched/signal.h: replace `tsk' with `task'

This file uses "task" 85 times and "tsk" 25 times.  It is better to be
consistent.

Link: http://lkml.kernel.org/r/20181129180547.15976-1-avagin@gmail.com
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agofs/coda/psdev.c: remove duplicate header
Sabyasachi Gupta [Tue, 14 May 2019 22:46:39 +0000 (15:46 -0700)]
fs/coda/psdev.c: remove duplicate header

linux/poll.h is included more than once.

Link: http://lkml.kernel.org/r/5c86820f.1c69fb81.149f0.0834@mx.google.com
Signed-off-by: Sabyasachi Gupta <sabyasachi.linux@gmail.com>
Acked-by: Souptick Joarder <jrdr.linux@gmail.com>
Cc: Jan Harkes <jaharkes@cs.cmu.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agoipc: do cyclic id allocation for the ipc object.
Manfred Spraul [Tue, 14 May 2019 22:46:36 +0000 (15:46 -0700)]
ipc: do cyclic id allocation for the ipc object.

For ipcmni_extend mode, the sequence number space is only 7 bits.  So
the chance of id reuse is relatively high compared with the non-extended
mode.

To alleviate this id reuse problem, this patch enables cyclic allocation
for the index to the radix tree (idx).  The disadvantage is that this
can cause a slight slow-down of the fast path, as the radix tree could
be higher than necessary.

To limit the radix tree height, I have chosen the following limits:
 1) The cycling is done over in_use*1.5.
 2) At least, the cycling is done over
   "normal" ipcnmi mode: RADIX_TREE_MAP_SIZE elements
   "ipcmni_extended": 4096 elements

Result:
- for normal mode:
No change for <= 42 active ipc elements. With more than 42
active ipc elements, a 2nd level would be added to the radix
tree.
Without cyclic allocation, a 2nd level would be added only with
more than 63 active elements.

- for extended mode:
Cycling creates always at least a 2-level radix tree.
With more than 2730 active objects, a 3rd level would be
added, instead of > 4095 active objects until the 3rd level
is added without cyclic allocation.

For a 2-level radix tree compared to a 1-level radix tree, I have
observed < 1% performance impact.

Notes:
1) Normal "x=semget();y=semget();" is unaffected: Then the idx
  is e.g. a and a+1, regardless if idr_alloc() or idr_alloc_cyclic()
  is used.

2) The -1% happens in a microbenchmark after this situation:
x=semget();
for(i=0;i<4000;i++) {t=semget();semctl(t,0,IPC_RMID);}
y=semget();
Now perform semget calls on x and y that do not sleep.

3) The worst-case reuse cycle time is unfortunately unaffected:
   If you have 2^24-1 ipc objects allocated, and get/remove the last
   possible element in a loop, then the id is reused after 128
   get/remove pairs.

Performance check:
A microbenchmark that performes no-op semop() randomly on two IDs,
with only these two IDs allocated.
The IDs were set using /proc/sys/kernel/sem_next_id.
The test was run 5 times, averages are shown.

1 & 2: Base (6.22 seconds for 10.000.000 semops)
1 & 40: -0.2%
1 & 3348: - 0.8%
1 & 27348: - 1.6%
1 & 15777204: - 3.2%

Or: ~12.6 cpu cycles per additional radix tree level.
The cpu is an Intel I3-5010U. ~1300 cpu cycles/syscall is slower
than what I remember (spectre impact?).

V2 of the patch:
- use "min" and "max"
- use RADIX_TREE_MAP_SIZE * RADIX_TREE_MAP_SIZE instead of
(2<<12).

[akpm@linux-foundation.org: fix max() warning]
Link: http://lkml.kernel.org/r/20190329204930.21620-3-longman@redhat.com
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Acked-by: Waiman Long <longman@redhat.com>
Cc: "Luis R. Rodriguez" <mcgrof@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: "Eric W . Biederman" <ebiederm@xmission.com>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agoipc: conserve sequence numbers in ipcmni_extend mode
Manfred Spraul [Tue, 14 May 2019 22:46:33 +0000 (15:46 -0700)]
ipc: conserve sequence numbers in ipcmni_extend mode

Rewrite, based on the patch from Waiman Long:

The mixing in of a sequence number into the IPC IDs is probably to avoid
ID reuse in userspace as much as possible.  With ipcmni_extend mode, the
number of usable sequence numbers is greatly reduced leading to higher
chance of ID reuse.

To address this issue, we need to conserve the sequence number space as
much as possible.  Right now, the sequence number is incremented for
every new ID created.  In reality, we only need to increment the
sequence number when new allocated ID is not greater than the last one
allocated.  It is in such case that the new ID may collide with an
existing one.  This is being done irrespective of the ipcmni mode.

In order to avoid any races, the index is first allocated and then the
pointer is replaced.

Changes compared to the initial patch:
 - Handle failures from idr_alloc().
 - Avoid that concurrent operations can see the wrong sequence number.
   (This is achieved by using idr_replace()).
 - IPCMNI_SEQ_SHIFT is not a constant, thus renamed to
   ipcmni_seq_shift().
 - IPCMNI_SEQ_MAX is not a constant, thus renamed to ipcmni_seq_max().

Link: http://lkml.kernel.org/r/20190329204930.21620-2-longman@redhat.com
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Waiman Long <longman@redhat.com>
Suggested-by: Matthew Wilcox <willy@infradead.org>
Acked-by: Waiman Long <longman@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: "Eric W . Biederman" <ebiederm@xmission.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kees Cook <keescook@chromium.org>
Cc: "Luis R. Rodriguez" <mcgrof@kernel.org>
Cc: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agoipc: allow boot time extension of IPCMNI from 32k to 16M
Waiman Long [Tue, 14 May 2019 22:46:29 +0000 (15:46 -0700)]
ipc: allow boot time extension of IPCMNI from 32k to 16M

The maximum number of unique System V IPC identifiers was limited to
32k.  That limit should be big enough for most use cases.

However, there are some users out there requesting for more, especially
those that are migrating from Solaris which uses 24 bits for unique
identifiers.  To satisfy the need of those users, a new boot time kernel
option "ipcmni_extend" is added to extend the IPCMNI value to 16M.  This
is a 512X increase which should be big enough for users out there that
need a large number of unique IPC identifier.

The use of this new option will change the pattern of the IPC
identifiers returned by functions like shmget(2).  An application that
depends on such pattern may not work properly.  So it should only be
used if the users really need more than 32k of unique IPC numbers.

This new option does have the side effect of reducing the maximum number
of unique sequence numbers from 64k down to 128.  So it is a trade-off.

The computation of a new IPC id is not done in the performance critical
path.  So a little bit of additional overhead shouldn't have any real
performance impact.

Link: http://lkml.kernel.org/r/20190329204930.21620-1-longman@redhat.com
Signed-off-by: Waiman Long <longman@redhat.com>
Acked-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: "Eric W . Biederman" <ebiederm@xmission.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kees Cook <keescook@chromium.org>
Cc: "Luis R. Rodriguez" <mcgrof@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agoipc/mqueue: optimize msg_get()
Davidlohr Bueso [Tue, 14 May 2019 22:46:26 +0000 (15:46 -0700)]
ipc/mqueue: optimize msg_get()

Our msg priorities became an rbtree as of d6629859b36d ("ipc/mqueue:
improve performance of send/recv").  However, consuming a msg in
msg_get() remains logarithmic (still being better than the case before
of course).  By applying well known techniques to cache pointers we can
have the node with the highest priority in O(1), which is specially nice
for the rt cases.  Furthermore, some callers can call msg_get() in a
loop.

A new msg_tree_erase() helper is also added to encapsulate the tree
removal and node_cache game.  Passes ltp mq testcases.

Link: http://lkml.kernel.org/r/20190321190216.1719-2-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agoipc/mqueue: remove redundant wq task assignment
Davidlohr Bueso [Tue, 14 May 2019 22:46:23 +0000 (15:46 -0700)]
ipc/mqueue: remove redundant wq task assignment

We already store the current task fo the new waiter before calling
wq_sleep() in both send and recv paths.  Trivially remove the redundant
assignment.

Link: http://lkml.kernel.org/r/20190321190216.1719-1-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agoipc: prevent lockup on alloc_msg and free_msg
Li Rongqing [Tue, 14 May 2019 22:46:20 +0000 (15:46 -0700)]
ipc: prevent lockup on alloc_msg and free_msg

msgctl10 of ltp triggers the following lockup When CONFIG_KASAN is
enabled on large memory SMP systems, the pages initialization can take a
long time, if msgctl10 requests a huge block memory, and it will block
rcu scheduler, so release cpu actively.

After adding schedule() in free_msg, free_msg can not be called when
holding spinlock, so adding msg to a tmp list, and free it out of
spinlock

  rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
  rcu:     Tasks blocked on level-1 rcu_node (CPUs 16-31): P32505
  rcu:     Tasks blocked on level-1 rcu_node (CPUs 48-63): P34978
  rcu:     (detected by 11, t=35024 jiffies, g=44237529, q=16542267)
  msgctl10        R  running task    21608 32505   2794 0x00000082
  Call Trace:
   preempt_schedule_irq+0x4c/0xb0
   retint_kernel+0x1b/0x2d
  RIP: 0010:__is_insn_slot_addr+0xfb/0x250
  Code: 82 1d 00 48 8b 9b 90 00 00 00 4c 89 f7 49 c1 ee 03 e8 59 83 1d 00 48 b8 00 00 00 00 00 fc ff df 4c 39 eb 48 89 9d 58 ff ff ff <41> c6 04 06 f8 74 66 4c 8d 75 98 4c 89 f1 48 c1 e9 03 48 01 c8 48
  RSP: 0018:ffff88bce041f758 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
  RAX: dffffc0000000000 RBX: ffffffff8471bc50 RCX: ffffffff828a2a57
  RDX: dffffc0000000000 RSI: dffffc0000000000 RDI: ffff88bce041f780
  RBP: ffff88bce041f828 R08: ffffed15f3f4c5b3 R09: ffffed15f3f4c5b3
  R10: 0000000000000001 R11: ffffed15f3f4c5b2 R12: 000000318aee9b73
  R13: ffffffff8471bc50 R14: 1ffff1179c083ef0 R15: 1ffff1179c083eec
   kernel_text_address+0xc1/0x100
   __kernel_text_address+0xe/0x30
   unwind_get_return_address+0x2f/0x50
   __save_stack_trace+0x92/0x100
   create_object+0x380/0x650
   __kmalloc+0x14c/0x2b0
   load_msg+0x38/0x1a0
   do_msgsnd+0x19e/0xcf0
   do_syscall_64+0x117/0x400
   entry_SYSCALL_64_after_hwframe+0x49/0xbe

  rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
  rcu:     Tasks blocked on level-1 rcu_node (CPUs 0-15): P32170
  rcu:     (detected by 14, t=35016 jiffies, g=44237525, q=12423063)
  msgctl10        R  running task    21608 32170  32155 0x00000082
  Call Trace:
   preempt_schedule_irq+0x4c/0xb0
   retint_kernel+0x1b/0x2d
  RIP: 0010:lock_acquire+0x4d/0x340
  Code: 48 81 ec c0 00 00 00 45 89 c6 4d 89 cf 48 8d 6c 24 20 48 89 3c 24 48 8d bb e4 0c 00 00 89 74 24 0c 48 c7 44 24 20 b3 8a b5 41 <48> c1 ed 03 48 c7 44 24 28 b4 25 18 84 48 c7 44 24 30 d0 54 7a 82
  RSP: 0018:ffff88af83417738 EFLAGS: 00000282 ORIG_RAX: ffffffffffffff13
  RAX: dffffc0000000000 RBX: ffff88bd335f3080 RCX: 0000000000000002
  RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88bd335f3d64
  RBP: ffff88af83417758 R08: 0000000000000000 R09: 0000000000000000
  R10: 0000000000000001 R11: ffffed13f3f745b2 R12: 0000000000000000
  R13: 0000000000000002 R14: 0000000000000000 R15: 0000000000000000
   is_bpf_text_address+0x32/0xe0
   kernel_text_address+0xec/0x100
   __kernel_text_address+0xe/0x30
   unwind_get_return_address+0x2f/0x50
   __save_stack_trace+0x92/0x100
   save_stack+0x32/0xb0
   __kasan_slab_free+0x130/0x180
   kfree+0xfa/0x2d0
   free_msg+0x24/0x50
   do_msgrcv+0x508/0xe60
   do_syscall_64+0x117/0x400
   entry_SYSCALL_64_after_hwframe+0x49/0xbe

Davidlohr said:
 "So after releasing the lock, the msg rbtree/list is empty and new
  calls will not see those in the newly populated tmp_msg list, and
  therefore they cannot access the delayed msg freeing pointers, which
  is good. Also the fact that the node_cache is now freed before the
  actual messages seems to be harmless as this is wanted for
  msg_insert() avoiding GFP_ATOMIC allocations, and after releasing the
  info->lock the thing is freed anyway so it should not change things"

Link: http://lkml.kernel.org/r/1552029161-4957-1-git-send-email-lirongqing@baidu.com
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: Zhang Yu <zhangyu31@baidu.com>
Reviewed-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agoscripts/gdb: print cached rate in lx-clk-summary
Leonard Crestez [Tue, 14 May 2019 22:46:17 +0000 (15:46 -0700)]
scripts/gdb: print cached rate in lx-clk-summary

The clk rate is always stored in clk_core but might be out of date and
require calls to update from hardware.

Deal with that case by printing a (c) suffix.

Link: http://lkml.kernel.org/r/1a474318982a5f0125f2360c4161029b17f56bd1.1556881728.git.leonard.crestez@nxp.com
Signed-off-by: Leonard Crestez <leonard.crestez@nxp.com>
Cc: Jan Kiszka <jan.kiszka@siemens.com>
Cc: Jason Wessel <jason.wessel@windriver.com>
Cc: Kieran Bingham <kbingham@kernel.org>
Cc: Stephen Boyd <swboyd@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agoscripts/gdb: clean up error handling in list helpers
Leonard Crestez [Tue, 14 May 2019 22:46:14 +0000 (15:46 -0700)]
scripts/gdb: clean up error handling in list helpers

An incorrect argument to list_for_each is an internal error in gdb
scripts so a TypeError should be raised.  The gdb.GdbError exception
type is intended for user errors such as incorrect invocation.

Drop the type assertion in list_for_each_entry because list_for_each
isn't going to suddenly yield something else.

Applies to both list and hlist

Link: http://lkml.kernel.org/r/c1d3fd4db13d999a3ba57f5bbc1924862d824f61.1556881728.git.leonard.crestez@nxp.com
Signed-off-by: Leonard Crestez <leonard.crestez@nxp.com>
Reviewed-by: Stephen Boyd <sboyd@kernel.org>
Cc: Jan Kiszka <jan.kiszka@siemens.com>
Cc: Jason Wessel <jason.wessel@windriver.com>
Cc: Kieran Bingham <kbingham@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agoscripts/gdb: add $lx_clk_core_lookup function
Leonard Crestez [Tue, 14 May 2019 22:46:11 +0000 (15:46 -0700)]
scripts/gdb: add $lx_clk_core_lookup function

Finding an individual clk_core requires walking the tree which can be
quite complicated so add a helper for easy access.

(gdb) print *(struct clk_scu*)$lx_clk_core_lookup("uart0_clk")->hw

Link: http://lkml.kernel.org/r/Message-ID:
Signed-off-by: Leonard Crestez <leonard.crestez@nxp.com>
Cc: Jan Kiszka <jan.kiszka@siemens.com>
Cc: Jason Wessel <jason.wessel@windriver.com>
Cc: Kieran Bingham <kbingham@kernel.org>
Cc: Stephen Boyd <swboyd@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agoscripts/gdb: initial clk support: lx-clk-summary
Leonard Crestez [Tue, 14 May 2019 22:46:08 +0000 (15:46 -0700)]
scripts/gdb: initial clk support: lx-clk-summary

Add an lx-clk-summary command which prints a subset of
/sys/kernel/debug/clk/clk_summary.

This can be used to examine hangs caused by clk not being enabled.

Link: http://lkml.kernel.org/r/Message-ID:
Signed-off-by: Leonard Crestez <leonard.crestez@nxp.com>
Cc: Jan Kiszka <jan.kiszka@siemens.com>
Cc: Jason Wessel <jason.wessel@windriver.com>
Cc: Kieran Bingham <kbingham@kernel.org>
Cc: Stephen Boyd <swboyd@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agoscripts/gdb: add hlist utilities
Leonard Crestez [Tue, 14 May 2019 22:46:05 +0000 (15:46 -0700)]
scripts/gdb: add hlist utilities

This allows easily examining kernel hlists in python.

Link: http://lkml.kernel.org/r/Message-ID:
Signed-off-by: Leonard Crestez <leonard.crestez@nxp.com>
Reviewed-by: Stephen Boyd <sboyd@kernel.org>
Cc: Jason Wessel <jason.wessel@windriver.com>
Cc: Jan Kiszka <jan.kiszka@siemens.com>
Cc: Kieran Bingham <kbingham@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agoscripts/gdb: silence pep8 checks
Stephen Boyd [Tue, 14 May 2019 22:46:02 +0000 (15:46 -0700)]
scripts/gdb: silence pep8 checks

These scripts have some pep8 style warnings.  Fix them up so that this
directory is all pep8 clean.

Link: http://lkml.kernel.org/r/20190329220844.38234-6-swboyd@chromium.org
Signed-off-by: Stephen Boyd <swboyd@chromium.org>
Cc: Douglas Anderson <dianders@chromium.org>
Cc: Nikolay Borisov <n.borisov.lkml@gmail.com>
Cc: Kieran Bingham <kbingham@kernel.org>
Cc: Jan Kiszka <jan.kiszka@siemens.com>
Cc: Jackie Liu <liuyun01@kylinos.cn>
Cc: Jason Wessel <jason.wessel@windriver.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agoscripts/gdb: add a timer list command
Stephen Boyd [Tue, 14 May 2019 22:45:59 +0000 (15:45 -0700)]
scripts/gdb: add a timer list command

Implement a command to print the timer list, much like how
/proc/timer_list is implemented.  This can be used to look at the
pending timers on a crashed system.

[swboyd@chromium.org: v2]
Link: http://lkml.kernel.org/r/20190329220844.38234-5-swboyd@chromium.org
Link: http://lkml.kernel.org/r/20190325184522.260535-5-swboyd@chromium.org
Signed-off-by: Stephen Boyd <swboyd@chromium.org>
Cc: Douglas Anderson <dianders@chromium.org>
Cc: Nikolay Borisov <n.borisov.lkml@gmail.com>
Cc: Kieran Bingham <kbingham@kernel.org>
Cc: Jan Kiszka <jan.kiszka@siemens.com>
Cc: Jackie Liu <liuyun01@kylinos.cn>
Cc: Jason Wessel <jason.wessel@windriver.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agoscripts/gdb: add rb tree iterating utilities
Stephen Boyd [Tue, 14 May 2019 22:45:56 +0000 (15:45 -0700)]
scripts/gdb: add rb tree iterating utilities

Implement gdb functions for rb_first(), rb_last(), rb_next(), and
rb_prev().  These can be useful to iterate through the kernel's
red-black trees.

[swboyd@chromium.org: v2]
Link: http://lkml.kernel.org/r/20190329220844.38234-4-swboyd@chromium.org
Link: http://lkml.kernel.org/r/20190325184522.260535-4-swboyd@chromium.org
Signed-off-by: Stephen Boyd <swboyd@chromium.org>
Cc: Douglas Anderson <dianders@chromium.org>
Cc: Nikolay Borisov <n.borisov.lkml@gmail.com>
Cc: Kieran Bingham <kbingham@kernel.org>
Cc: Jan Kiszka <jan.kiszka@siemens.com>
Cc: Jackie Liu <liuyun01@kylinos.cn>
Cc: Jason Wessel <jason.wessel@windriver.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agoscripts/gdb: add kernel config dumping command
Stephen Boyd [Tue, 14 May 2019 22:45:53 +0000 (15:45 -0700)]
scripts/gdb: add kernel config dumping command

lx-configdump <file> dumps the contents of the gzipped .config to a text
file when the config is included in the kernel with CONFIG_IKCONFIG.  By
default, the file written is called config.txt, but it can be any user
supplied filename as well.  If the kernel config is in a module
(configs.ko), then it can be loaded along with symbols for the module
loaded with 'lx-symbols' and then this command will still work.

Obviously if you have the whole vmlinux then this can also be achieved
with scripts/extract-ikconfig, but this gdb script can be useful to
confirm that the memory contents of the config in memory and the vmlinux
contents on disk match what is expected.

[swboyd@chromium.org: v2]
Link: http://lkml.kernel.org/r/20190329220844.38234-3-swboyd@chromium.org
Link: http://lkml.kernel.org/r/20190325184522.260535-3-swboyd@chromium.org
Signed-off-by: Stephen Boyd <swboyd@chromium.org>
Cc: Douglas Anderson <dianders@chromium.org>
Cc: Nikolay Borisov <n.borisov.lkml@gmail.com>
Cc: Kieran Bingham <kbingham@kernel.org>
Cc: Jan Kiszka <jan.kiszka@siemens.com>
Cc: Jackie Liu <liuyun01@kylinos.cn>
Cc: Jason Wessel <jason.wessel@windriver.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agoscripts/gdb: find vmlinux where it was before
Stephen Boyd [Tue, 14 May 2019 22:45:49 +0000 (15:45 -0700)]
scripts/gdb: find vmlinux where it was before

Patch series "gdb script for kconfig and timer list".

This is a handful of changes to the kernel's gdb scripts to do some more
debugging with kgdb.  The first patch allows the vmlinux to be reloaded
from where it was specified on the command line so that this set of
scripts can be used from anywhere.  The second patch adds a script to
dump the config.gz to a file on the host debugging machine.  The third
patch adds some rb tree utilities and the last patch uses those rb tree
walking utilities to dump out the contents of /proc/timer_list from a
system under debug.

This patch (of 5):

If I run 'gdb <path/to/vmlinux>' and there's the vmlinux-gdb.py file
there I can properly see symbols and use the lx commands provided by the
GDB scripts.  But once I run 'lx-symbols' at the command prompt, gdb
reloads the vmlinux symbols assuming that this script was run from the
directory that has vmlinux at the root.  That isn't always true, but we
could just look and see what symbols were already loaded and use that
instead.  Let's do that so this can work by being invoked anywhere.

Link: http://lkml.kernel.org/r/20190325184522.260535-2-swboyd@chromium.org
Signed-off-by: Stephen Boyd <swboyd@chromium.org>
Cc: Douglas Anderson <dianders@chromium.org>
Cc: Nikolay Borisov <n.borisov.lkml@gmail.com>
Cc: Kieran Bingham <kbingham@kernel.org>
Cc: Jan Kiszka <jan.kiszka@siemens.com>
Cc: Jackie Liu <liuyun01@kylinos.cn>
Cc: Jason Wessel <jason.wessel@windriver.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agopps: pps-gpio PPS ECHO implementation
Tom Burkart [Tue, 14 May 2019 22:45:46 +0000 (15:45 -0700)]
pps: pps-gpio PPS ECHO implementation

This patch implements the PPS ECHO functionality for pps-gpio, that
sysfs claims is available already.

Configuration is done via device tree bindings.

No changes are made to userspace interfaces.

This patch was originally written by Lukas Senger as part of a masters
thesis project and modified for inclusion into the linux kernel by Tom
Burkart.

Link: http://lkml.kernel.org/r/20190324043305.6627-4-tom@aussec.com
Signed-off-by: Tom Burkart <tom@aussec.com>
Acked-by: Rodolfo Giometti <giometti@enneenne.com>
Signed-off-by: Lukas Senger <lukas@fridolin.com>
Cc: Philipp Zabel <philipp.zabel@gmail.com>
Cc: Rob Herring <robh@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agodt-bindings: pps: pps-gpio PPS ECHO implementation
Tom Burkart [Tue, 14 May 2019 22:45:43 +0000 (15:45 -0700)]
dt-bindings: pps: pps-gpio PPS ECHO implementation

This patch implements the device tree binding changes required for the
PPS ECHO functionality for pps-gpio, that sysfs claims is available
already.

It adds two DT properties for configuring the PPS ECHO functionality.

This patch is provided separated from the rest of the patch per
Documentation/devicetree/bindings/submitting-patches.txt.

This patch was originally written by Lukas Senger as part of a masters
thesis project and modified for inclusion into the linux kernel by Tom
Burkart.

Link: http://lkml.kernel.org/r/20190324043305.6627-3-tom@aussec.com
Signed-off-by: Tom Burkart <tom@aussec.com>
Signed-off-by: Lukas Senger <lukas@fridolin.com>
Acked-by: Rodolfo Giometti <giometti@enneenne.com>
Reviewed-by: Rob Herring <robh@kernel.org>
Cc: Philipp Zabel <philipp.zabel@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agopps: descriptor-based gpio
Tom Burkart [Tue, 14 May 2019 22:45:40 +0000 (15:45 -0700)]
pps: descriptor-based gpio

This patch changes the GPIO access for the pps-gpio driver from the
integer based API to the descriptor based API.

The integer based API is considered deprecated and the descriptor based
API is the preferred way to access GPIOs as per
Documentation/driver-api/gpio/intro.rst

No changes are made to userspace interfaces.

Link: http://lkml.kernel.org/r/20190324043305.6627-2-tom@aussec.com
Signed-off-by: Tom Burkart <tom@aussec.com>
Acked-by: Rodolfo Giometti <giometti@enneenne.com>
Reviewed-by: Philipp Zabel <philipp.zabel@gmail.com>
Cc: Lukas Senger <lukas@fridolin.com>
Cc: Rob Herring <robh@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agopanic/reboot: allow specifying reboot_mode for panic only
Aaro Koskinen [Tue, 14 May 2019 22:45:37 +0000 (15:45 -0700)]
panic/reboot: allow specifying reboot_mode for panic only

Allow specifying reboot_mode for panic only.  This is needed on systems
where ramoops is used to store panic logs, and user wants to use warm
reset to preserve those, while still having cold reset on normal
reboots.

Link: http://lkml.kernel.org/r/20190322004735.27702-1-aaro.koskinen@iki.fi
Signed-off-by: Aaro Koskinen <aaro.koskinen@nokia.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agopanic: avoid the extra noise dmesg
Feng Tang [Tue, 14 May 2019 22:45:34 +0000 (15:45 -0700)]
panic: avoid the extra noise dmesg

When kernel panic happens, it will first print the panic call stack,
then the ending msg like:

[   35.743249] ---[ end Kernel panic - not syncing: Fatal exception
[   35.749975] ------------[ cut here ]------------

The above message are very useful for debugging.

But if system is configured to not reboot on panic, say the
"panic_timeout" parameter equals 0, it will likely print out many noisy
message like WARN() call stack for each and every CPU except the panic
one, messages like below:

WARNING: CPU: 1 PID: 280 at kernel/sched/core.c:1198 set_task_cpu+0x183/0x190
Call Trace:
<IRQ>
try_to_wake_up
default_wake_function
autoremove_wake_function
__wake_up_common
__wake_up_common_lock
__wake_up
wake_up_klogd_work_func
irq_work_run_list
irq_work_tick
update_process_times
tick_sched_timer
__hrtimer_run_queues
hrtimer_interrupt
smp_apic_timer_interrupt
apic_timer_interrupt

For people working in console mode, the screen will first show the panic
call stack, but immediately overridden by these noisy extra messages,
which makes debugging much more difficult, as the original context gets
lost on screen.

Also these noisy messages will confuse some users, as I have seen many bug
reporters posted the noisy message into bugzilla, instead of the real
panic call stack and context.

Adding a flag "suppress_printk" which gets set in panic() to avoid those
noisy messages, without changing current kernel behavior that both panic
blinking and sysrq magic key can work as is, suggested by Petr Mladek.

To verify this, make sure kernel is not configured to reboot on panic and
in console
 # echo c > /proc/sysrq-trigger
to see if console only prints out the panic call stack.

Link: http://lkml.kernel.org/r/1551430186-24169-1-git-send-email-feng.tang@intel.com
Signed-off-by: Feng Tang <feng.tang@intel.com>
Suggested-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Kees Cook <keescook@chromium.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jiri Slaby <jslaby@suse.com>
Cc: Sasha Levin <sashal@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agogcov: clang support
Greg Hackmann [Tue, 14 May 2019 22:45:31 +0000 (15:45 -0700)]
gcov: clang support

LLVM uses profiling data that's deliberately similar to GCC, but has a
very different way of exporting that data.  LLVM calls llvm_gcov_init()
once per module, and provides a couple of callbacks that we can use to
ask for more data.

We care about the "writeout" callback, which in turn calls back into
compiler-rt/this module to dump all the gathered coverage data to disk:

   llvm_gcda_start_file()
     llvm_gcda_emit_function()
     llvm_gcda_emit_arcs()
     llvm_gcda_emit_function()
     llvm_gcda_emit_arcs()
     [... repeats for each function ...]
   llvm_gcda_summary_info()
   llvm_gcda_end_file()

This design is much more stateless and unstructured than gcc's, and is
intended to run at process exit.  This forces us to keep some local
state about which module we're dealing with at the moment.  On the other
hand, it also means we don't depend as much on how LLVM represents
profiling data internally.

See LLVM's lib/Transforms/Instrumentation/GCOVProfiling.cpp for more
details on how this works, particularly GCOVProfiler::emitProfileArcs(),
GCOVProfiler::insertCounterWriteout(), and GCOVProfiler::insertFlush().

[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/20190417225328.208129-1-trong@android.com
Signed-off-by: Greg Hackmann <ghackmann@android.com>
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Tri Vo <trong@android.com>
Co-developed-by: Nick Desaulniers <ndesaulniers@google.com>
Co-developed-by: Tri Vo <trong@android.com>
Tested-by: Trilok Soni <tsoni@quicinc.com>
Tested-by: Prasad Sodagudi <psodagud@quicinc.com>
Tested-by: Tri Vo <trong@android.com>
Tested-by: Daniel Mentz <danielmentz@google.com>
Tested-by: Petri Gynther <pgynther@google.com>
Reviewed-by: Peter Oberparleiter <oberpar@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agogcov: docs: add a note on GCC vs Clang differences
Tri Vo [Tue, 14 May 2019 22:45:28 +0000 (15:45 -0700)]
gcov: docs: add a note on GCC vs Clang differences

Document some things of note to gcov users:
1. GCC gcov and Clang llvm-cov tools are not compatible.
2. The use of GCC vs Clang is transparent at build-time.

Also adjust the documentation to account for the removal of config symbol
CONFIG_GCOV_FORMAT_AUTODETECT by commit 6a61b70b43c9 ("gcov: remove
CONFIG_GCOV_FORMAT_AUTODETECT").

Link: http://lkml.kernel.org/r/20190318025411.98014-4-trong@android.com
Signed-off-by: Tri Vo <trong@android.com>
Reviewed-by: Peter Oberparleiter <oberpar@linux.ibm.com>
Cc: Daniel Mentz <danielmentz@google.com>
Cc: Greg Hackmann <ghackmann@android.com>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Petri Gynther <pgynther@google.com>
Cc: Prasad Sodagudi <psodagud@quicinc.com>
Cc: Trilok Soni <tsoni@quicinc.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agogcov: clang: move common GCC code into gcc_base.c
Greg Hackmann [Tue, 14 May 2019 22:45:25 +0000 (15:45 -0700)]
gcov: clang: move common GCC code into gcc_base.c

Patch series "gcov: add Clang support", v4.

This patch (of 3):

base.c contains a few callbacks specific to GCC's gcov implementation.
Move these into their own module in preparation for Clang support.

Link: http://lkml.kernel.org/r/20190318025411.98014-2-trong@android.com
Signed-off-by: Greg Hackmann <ghackmann@android.com>
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Tri Vo <trong@android.com>
Tested-by: Trilok Soni <tsoni@quicinc.com>
Tested-by: Prasad Sodagudi <psodagud@quicinc.com>
Tested-by: Tri Vo <trong@android.com>
Reviewed-by: Peter Oberparleiter <oberpar@linux.ibm.com>
Cc: Daniel Mentz <danielmentz@google.com>
Cc: Petri Gynther <pgynther@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agofs/eventfd.c: make eventfd_ida static
YueHaibing [Tue, 14 May 2019 22:45:22 +0000 (15:45 -0700)]
fs/eventfd.c: make eventfd_ida static

Fix sparse warning:

fs/eventfd.c:26:1: warning:
 symbol 'eventfd_ida' was not declared. Should it be static?

Link: http://lkml.kernel.org/r/20190413142348.34716-1-yuehaibing@huawei.com
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agoeventfd: present id to userspace via fdinfo
Masatake YAMATO [Tue, 14 May 2019 22:45:19 +0000 (15:45 -0700)]
eventfd: present id to userspace via fdinfo

Finding endpoints of an IPC channel is one of essential task to
understand how a user program works.  Procfs and netlink socket provide
enough hints to find endpoints for IPC channels like pipes, unix
sockets, and pseudo terminals.  However, there is no simple way to find
endpoints for an eventfd file from userland.  An inode number doesn't
hint.  Unlike pipe, all eventfd files share the same inode object.

To provide the way to find endpoints of an eventfd file, this patch adds
"eventfd-id" field to /proc/PID/fdinfo of eventfd as identifier.
Integers managed by an IDA are used as ids.

A tool like lsof can utilize the information to print endpoints.

Link: http://lkml.kernel.org/r/20190327181823.20222-1-yamato@redhat.com
Signed-off-by: Masatake YAMATO <yamato@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agokernel/pid.c: remove unneeded hash header file
Timmy Li [Tue, 14 May 2019 22:45:16 +0000 (15:45 -0700)]
kernel/pid.c: remove unneeded hash header file

Hash functions are not needed since idr is used now.  Let's remove hash
header file for cleanup.

Link: http://lkml.kernel.org/r/20190430053319.95913-1-scuttimmy@gmail.com
Signed-off-by: Timmy Li <scuttimmy@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: KJ Tsanaktsidis <ktsanaktsidis@zendesk.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agokernel/sysctl.c: fix proc_do_large_bitmap for large input buffers
Eric Sandeen [Tue, 14 May 2019 22:45:13 +0000 (15:45 -0700)]
kernel/sysctl.c: fix proc_do_large_bitmap for large input buffers

Today, proc_do_large_bitmap() truncates a large write input buffer to
PAGE_SIZE - 1, which may result in misparsed numbers at the (truncated)
end of the buffer.  Further, it fails to notify the caller that the
buffer was truncated, so it doesn't get called iteratively to finish the
entire input buffer.

Tell the caller if there's more work to do by adding the skipped amount
back to left/*lenp before returning.

To fix the misparsing, reset the position if we have completely consumed
a truncated buffer (or if just one char is left, which may be a "-" in a
range), and ask the caller to come back for more.

Link: http://lkml.kernel.org/r/20190320222831.8243-7-mcgrof@kernel.org
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agotools/testing/selftests/sysctl/sysctl.sh: add proc_do_large_bitmap() test case
Eric Sandeen [Tue, 14 May 2019 22:45:10 +0000 (15:45 -0700)]
tools/testing/selftests/sysctl/sysctl.sh: add proc_do_large_bitmap() test case

The kernel has only two users of proc_do_large_bitmap(), the kernel CPU
watchdog, and the ip_local_reserved_ports.  Refer to watchdog_cpumask
and ip_local_reserved_ports in Documentation for further details on
these.  When you input a large buffer into these, when it is larger than
PAGE_SIZE- 1, the input data gets misparsed, and the user get
incorrectly informed that the desired input value was set.  This commit
implements a test which mimics and exploits that use case, it uses a
bitmap size, as in the watchdog case.  The bitmap is used to test the
bitmap proc handler, proc_do_large_bitmap().

The next commit fixes this issue.

[akpm@linux-foundation.org: move proc_do_large_bitmap() export to EOF]
[mcgrof@kernel.org: use new target description for backward compatibility]
[mcgrof@kernel.org: augment test number to 50, ran into issues with bash string comparisons when testing up to 50 cases.]
[mcgrof@kernel.org: introduce and use verify_diff_proc_file() to use diff]
[mcgrof@kernel.org: use mktemp for tmp file]
[mcgrof@kernel.org: merge shell test and C code]
[mcgrof@kernel.org: commit log love]
[mcgrof@kernel.org: export proc_do_large_bitmap() to allow for the test
[mcgrof@kernel.org: check for the return value when writing to the proc file]
Link: http://lkml.kernel.org/r/20190320222831.8243-6-mcgrof@kernel.org
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agotools/testing/selftests/sysctl/sysctl.sh: allow graceful use on older kernels
Luis Chamberlain [Tue, 14 May 2019 22:45:07 +0000 (15:45 -0700)]
tools/testing/selftests/sysctl/sysctl.sh: allow graceful use on older kernels

On old kernels older new test knobs implemented on the test_sysctl
module may not be available.  This is expected, and the selftests test
scripts should be able to run without failures on older kernels.

Generalize a solution so that we test for each required test target file
for each test by requiring each test description to annotate their
respective test target file.  If the target file does not exist, we skip
the test gracefully.

Link: http://lkml.kernel.org/r/20190320222831.8243-5-mcgrof@kernel.org
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Eric Sandeen <sandeen@redhat.com>
Cc: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agotools/testing/selftests/sysctl/sysctl.sh: ignore diff output on verify_diff_w()
Luis Chamberlain [Tue, 14 May 2019 22:45:04 +0000 (15:45 -0700)]
tools/testing/selftests/sysctl/sysctl.sh: ignore diff output on verify_diff_w()

When verify_diff_w() is used we care about the result, not the verbose
output, and although we use -q, that still gives us a chatty message
about if the files differ or not.  Since verify_diff_w() uses stdinput
the chatty message says whether or not "-" matches the target file, and
this just seems rather odd.  Better to just ignore that messsage all
together, what we really care about i sthe results, the return value and
we check for that.

Link: http://lkml.kernel.org/r/20190320222831.8243-4-mcgrof@kernel.org
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Eric Sandeen <sandeen@redhat.com>
Cc: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agotools/testing/selftests/sysctl/sysctl.sh: load module before testing for it
Luis Chamberlain [Tue, 14 May 2019 22:45:01 +0000 (15:45 -0700)]
tools/testing/selftests/sysctl/sysctl.sh: load module before testing for it

Currently the test script checks for the existence of the sysctl test
module's directory path prior to loading it.  We must first try to load
the module prior to checking for that path.  This fixes the order for
the load / test.

Link: http://lkml.kernel.org/r/20190320222831.8243-3-mcgrof@kernel.org
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Eric Sandeen <sandeen@redhat.com>
Cc: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agotools/testing/selftests/sysctl/sysctl.sh: remove superfluous test_reqs()
Luis Chamberlain [Tue, 14 May 2019 22:44:58 +0000 (15:44 -0700)]
tools/testing/selftests/sysctl/sysctl.sh: remove superfluous test_reqs()

Patch series "sysctl: add pending proc_do_large_bitmap fix".

Eric sent a fix out for proc_do_large_bitmap() last month for when using
a large input buffer.  After patch review a test case for the issue was
built and submitted.  I noticed there were a few issues with the tests,
but instead of just asking Eric to address them I've taken care of them
and ammended the commit where necessary.  There's a few issues he
reported which I also address and fix in this series.

Since we *do* expect users of these scripts to also use them on older
kernels, I've also addressed not breaking calling the script for them,
and gives us an easy way to easily extend our tests cases for future
kernels as well.

Before anyone considers these for stable as minor fixes, I'd recommend
we also address the discrepancy on the read side of things: modify the
test script to use diff against the target file instead of using the
temp file.

This patch (of 6):

We already call test_reqs(), no need to call it twice.

Link: http://lkml.kernel.org/r/20190320222831.8243-2-mcgrof@kernel.org
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Eric Sandeen <sandeen@redhat.com>
Cc: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agosysctl: return -EINVAL if val violates minmax
Christian Brauner [Tue, 14 May 2019 22:44:55 +0000 (15:44 -0700)]
sysctl: return -EINVAL if val violates minmax

Currently when userspace gives us a values that overflow e.g.  file-max
and other callers of __do_proc_doulongvec_minmax() we simply ignore the
new value and leave the current value untouched.

This can be problematic as it gives the illusion that the limit has
indeed be bumped when in fact it failed.  This commit makes sure to
return EINVAL when an overflow is detected.  Please note that this is a
userspace facing change.

Link: http://lkml.kernel.org/r/20190210203943.8227-4-christian@brauner.io
Signed-off-by: Christian Brauner <christian@brauner.io>
Acked-by: Luis Chamberlain <mcgrof@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Dominik Brodowski <linux@dominikbrodowski.net>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Joe Lawrence <joe.lawrence@redhat.com>
Cc: Waiman Long <longman@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agokernel/sysctl.c: switch to bitmap_zalloc()
Andy Shevchenko [Tue, 14 May 2019 22:44:52 +0000 (15:44 -0700)]
kernel/sysctl.c: switch to bitmap_zalloc()

Switch to bitmap_zalloc() to show clearly what we are allocating.
Besides that it returns pointer of bitmap type instead of opaque void *.

Link: http://lkml.kernel.org/r/20190304094037.57756-1-andriy.shevchenko@linux.intel.com
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agorapidio: fix a NULL pointer dereference when create_workqueue() fails
Kangjie Lu [Tue, 14 May 2019 22:44:49 +0000 (15:44 -0700)]
rapidio: fix a NULL pointer dereference when create_workqueue() fails

In case create_workqueue fails, the fix releases resources and returns
-ENOMEM to avoid NULL pointer dereference.

Signed-off-by: Kangjie Lu <kjlu@umn.edu>
Acked-by: Alexandre Bounine <alex.bou9@gmail.com>
Cc: Matt Porter <mporter@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agoinclude/linux/cpumask.h: fix double string traverse in cpumask_parse
Yury Norov [Tue, 14 May 2019 22:44:46 +0000 (15:44 -0700)]
include/linux/cpumask.h: fix double string traverse in cpumask_parse

cpumask_parse() finds first occurrence of either or strchr() and
strlen().  We can do it better with a single call of strchrnul().

[akpm@linux-foundation.org: remove unneeded cast]
Link: http://lkml.kernel.org/r/20190409204208.12190-1-ynorov@marvell.com
Signed-off-by: Yury Norov <ynorov@marvell.com>
Acked-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agoexec selftests: test ->recursion_depth
Alexey Dobriyan [Tue, 14 May 2019 22:44:43 +0000 (15:44 -0700)]
exec selftests: test ->recursion_depth

Test that trivially recursing script onto itself doesn't work.

Note: this is different test from ELOOP tests in execveat.c Those test
that execveat(2) doesn't follow symlinks when told to do so.

Link: http://lkml.kernel.org/r/20190423192720.GA21433@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agoexec: move struct linux_binprm::buf
Alexey Dobriyan [Tue, 14 May 2019 22:44:40 +0000 (15:44 -0700)]
exec: move struct linux_binprm::buf

struct linux_binprm::buf is the first field and it is exactly 128 bytes
in size.  It means that on x86_64 all accesses to other fields will go
though [r64 + disp32] addressing mode which is 3 bytes bloatier than
[r64 + disp8] addressing mode.  Given that accesses to other fields
outnumber accesses to ->buf, move it down.

Space savings (x86_64 defconfig):
more on distro configs because LSMs actively dereference "bprm"
but do not care about first 128 bytes of the executable itself.

add/remove: 0/0 grow/shrink: 0/24 up/down: 0/-492 (-492)
Function                                     old     new   delta
selinux_bprm_committing_creds                552     549      -3
finalize_exec                                 94      91      -3
__audit_log_bprm_fcaps                       283     280      -3
__audit_bprm                                  39      36      -3
perf_trace_sched_process_exec                347     341      -6
install_exec_creds                           105      99      -6
cap_bprm_set_creds.cold                       60      54      -6
would_dump                                   137     128      -9
load_script                                  637     628      -9
bprm_change_interp                            61      52      -9
trace_event_raw_event_sched_process_exec     260     250     -10
search_binary_handler                        255     240     -15
remove_arg_zero                              295     277     -18
free_bprm                                    119     101     -18
prepare_binprm                               379     360     -19
setup_new_exec                               336     315     -21
flush_old_exec                              1638    1617     -21
copy_strings.isra                            746     724     -22
setup_arg_pages                              559     530     -29
load_misc_binary                            1151    1118     -33
selinux_bprm_set_creds                       792     753     -39
load_elf_binary                            11111   11072     -39
cap_bprm_set_creds                          1496    1454     -42
__do_execve_file.isra                       2395    2286    -109

Link: http://lkml.kernel.org/r/20190421165025.GA26843@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agofs/exec.c: move ->recursion_depth out of critical sections
Alexey Dobriyan [Tue, 14 May 2019 22:44:37 +0000 (15:44 -0700)]
fs/exec.c: move ->recursion_depth out of critical sections

->recursion_depth is changed only by current, therefore decrementing can
be done without taking any locks.

Link: http://lkml.kernel.org/r/20190417213150.GA26474@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agokernel/signal.c: annotate implicit fall through
Mathieu Malaterre [Tue, 14 May 2019 22:44:35 +0000 (15:44 -0700)]
kernel/signal.c: annotate implicit fall through

There is a plan to build the kernel with -Wimplicit-fallthrough and this
place in the code produced a warning (W=1).

This commit remove the following warning:

  kernel/signal.c:795:13: warning: this statement may fall through [-Wimplicit-fallthrough=]

Link: http://lkml.kernel.org/r/20190114203505.17875-1-malat@debian.org
Signed-off-by: Mathieu Malaterre <malat@debian.org>
Acked-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agofs/fat/file.c: issue flush after the writeback of FAT
Hou Tao [Tue, 14 May 2019 22:44:32 +0000 (15:44 -0700)]
fs/fat/file.c: issue flush after the writeback of FAT

fsync() needs to make sure the data & meta-data of file are persistent
after the return of fsync(), even when a power-failure occurs later.  In
the case of fat-fs, the FAT belongs to the meta-data of file, so we need
to issue a flush after the writeback of FAT instead before.

Also bail out early when any stage of fsync fails.

Link: http://lkml.kernel.org/r/20190409030158.136316-1-houtao1@huawei.com
Signed-off-by: Hou Tao <houtao1@huawei.com>
Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agoreiserfs: add comment to explain endianness issue in xattr_hash
Bharath Vedartham [Tue, 14 May 2019 22:44:29 +0000 (15:44 -0700)]
reiserfs: add comment to explain endianness issue in xattr_hash

csum_partial() gives different results for little-endian and big-endian
hosts.  This causes images created on little-endian hosts and mounted on
big endian hosts to see csum mismatches.  This causes an endianness bug.
Sparse gives a warning as csum_partial returns a restricted integer type
__wsum_t and xattr_hash expects __u32.  This warning acts as a reminder
for this bug and should not be suppressed.

This comment aims to convey these endianness issues.

[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/20190423161831.GA15387@bharath12345-Inspiron-5559
Signed-off-by: Bharath Vedartham <linux.bhar@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Jann Horn <jannh@google.com>
Cc: Jeff Mahoney <jeffm@suse.com>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agoautofs: add description of ignore pseudo mount option
Ian Kent [Tue, 14 May 2019 22:44:26 +0000 (15:44 -0700)]
autofs: add description of ignore pseudo mount option

Add a description of the "ignore" pseudo mount option that can be used
to provide a generic indicator to applications that the mount entry
should be ignored when displaying mount information.

Link: http://lkml.kernel.org/r/155287084617.12593.812733161112154904.stgit@pluto.themaw.net
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agoautofs: update mount control expire desription with AUTOFS_EXP_FORCED
Ian Kent [Tue, 14 May 2019 22:44:23 +0000 (15:44 -0700)]
autofs: update mount control expire desription with AUTOFS_EXP_FORCED

Describe AUTOFS_EXP_FORCED in addition to AUTOFS_EXP_IMMEDIATE in the
description of the AUTOFS_DEV_IOCTL_EXPIRE_CMD ioctl.

Link: http://lkml.kernel.org/r/155287084078.12593.15000931045413195778.stgit@pluto.themaw.net
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agoautofs: update AUTOFS_EXP_LEAVES description
Ian Kent [Tue, 14 May 2019 22:44:20 +0000 (15:44 -0700)]
autofs: update AUTOFS_EXP_LEAVES description

Update the description of AUTOFS_EXP_LEAVES to cover its possible future
use with amd format mount maps.

Link: http://lkml.kernel.org/r/155287083538.12593.18163159677020718048.stgit@pluto.themaw.net
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agoautofs: update autofs.txt for strictexpire mount option
Ian Kent [Tue, 14 May 2019 22:44:17 +0000 (15:44 -0700)]
autofs: update autofs.txt for strictexpire mount option

A "strictexpire" mount option has been added to the autofs file system.

It is meant to be used in cases where a GUI continually accesses or an
application frquently scans an automount directory tree causing an
accumulation of otherwise unused mounts.

Link: http://lkml.kernel.org/r/155287083000.12593.2722713092537666885.stgit@pluto.themaw.net
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agoautofs: fix some word usage oddities in autofs.txt
Ian Kent [Tue, 14 May 2019 22:44:14 +0000 (15:44 -0700)]
autofs: fix some word usage oddities in autofs.txt

Alter a few word usages in Documentation/filesystems/autofs.txt and
correct some spelling mistakes.

Link: http://lkml.kernel.org/r/155287082394.12593.6506084453911662450.stgit@pluto.themaw.net
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agonet: replace CONFIG_DEBUG_KERNEL with CONFIG_DEBUG_MISC
Sinan Kaya [Tue, 14 May 2019 22:44:11 +0000 (15:44 -0700)]
net: replace CONFIG_DEBUG_KERNEL with CONFIG_DEBUG_MISC

CONFIG_DEBUG_KERNEL should not impact code generation.  Use the newly
defined CONFIG_DEBUG_MISC instead to keep the current code.

Link: http://lkml.kernel.org/r/20190413224438.10802-6-okaya@kernel.org
Signed-off-by: Sinan Kaya <okaya@kernel.org>
Acked-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Florian Westphal <fw@strlen.de>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Anders Roxell <anders.roxell@linaro.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: Chris Zankel <chris@zankel.net>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: James Hogan <jhogan@kernel.org>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Thomas Bogendoerfer <tbogendoerfer@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agoxtensa: replace CONFIG_DEBUG_KERNEL with CONFIG_DEBUG_MISC
Sinan Kaya [Tue, 14 May 2019 22:44:07 +0000 (15:44 -0700)]
xtensa: replace CONFIG_DEBUG_KERNEL with CONFIG_DEBUG_MISC

CONFIG_DEBUG_KERNEL should not impact code generation.  Use the newly
defined CONFIG_DEBUG_MISC instead to keep the current code.

Link: http://lkml.kernel.org/r/20190413224438.10802-5-okaya@kernel.org
Signed-off-by: Sinan Kaya <okaya@kernel.org>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Max Filippov <jcmvbkbc@gmail.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: Anders Roxell <anders.roxell@linaro.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Florian Westphal <fw@strlen.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: James Hogan <jhogan@kernel.org>
Cc: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Thomas Bogendoerfer <tbogendoerfer@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agopowerpc: replace CONFIG_DEBUG_KERNEL with CONFIG_DEBUG_MISC
Sinan Kaya [Tue, 14 May 2019 22:44:04 +0000 (15:44 -0700)]
powerpc: replace CONFIG_DEBUG_KERNEL with CONFIG_DEBUG_MISC

CONFIG_DEBUG_KERNEL should not impact code generation.  Use the newly
defined CONFIG_DEBUG_MISC instead to keep the current code.

Link: http://lkml.kernel.org/r/20190413224438.10802-3-okaya@kernel.org
Signed-off-by: Sinan Kaya <okaya@kernel.org>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: Anders Roxell <anders.roxell@linaro.org>
Cc: Chris Zankel <chris@zankel.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Florian Westphal <fw@strlen.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: James Hogan <jhogan@kernel.org>
Cc: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Thomas Bogendoerfer <tbogendoerfer@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agoinit: introduce DEBUG_MISC option
Sinan Kaya [Tue, 14 May 2019 22:44:00 +0000 (15:44 -0700)]
init: introduce DEBUG_MISC option

Patch series "init: Do not select DEBUG_KERNEL by default", v5.

CONFIG_DEBUG_KERNEL has been designed to just enable Kconfig options.
Kernel code generatoin should not depend on CONFIG_DEBUG_KERNEL.

Proposed alternative plan: let's add a new symbol, something like
DEBUG_MISC ("Miscellaneous debug code that should be under a more
specific debug option but isn't"), make it depend on DEBUG_KERNEL and be
"default DEBUG_KERNEL" but allow itself to be turned off, and then
mechanically change the small handful of "#ifdef CONFIG_DEBUG_KERNEL" to
"#ifdef CONFIG_DEBUG_MISC".

This patch (of 5):

Introduce DEBUG_MISC ("Miscellaneous debug code that should be under a
more specific debug option but isn't"), make it depend on DEBUG_KERNEL
and be "default DEBUG_KERNEL" but allow itself to be turned off, and
then mechanically change the small handful of "#ifdef
CONFIG_DEBUG_KERNEL" to "#ifdef CONFIG_DEBUG_MISC".

Link: http://lkml.kernel.org/r/20190413224438.10802-2-okaya@kernel.org
Signed-off-by: Sinan Kaya <okaya@kernel.org>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Anders Roxell <anders.roxell@linaro.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: Chris Zankel <chris@zankel.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Florian Westphal <fw@strlen.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: James Hogan <jhogan@kernel.org>
Cc: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Thomas Bogendoerfer <tbogendoerfer@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agobinfmt_elf: move brk out of mmap when doing direct loader exec
Kees Cook [Tue, 14 May 2019 22:43:57 +0000 (15:43 -0700)]
binfmt_elf: move brk out of mmap when doing direct loader exec

Commmit eab09532d400 ("binfmt_elf: use ELF_ET_DYN_BASE only for PIE"),
made changes in the rare case when the ELF loader was directly invoked
(e.g to set a non-inheritable LD_LIBRARY_PATH, testing new versions of
the loader), by moving into the mmap region to avoid both ET_EXEC and
PIE binaries.  This had the effect of also moving the brk region into
mmap, which could lead to the stack and brk being arbitrarily close to
each other.  An unlucky process wouldn't get its requested stack size
and stack allocations could end up scribbling on the heap.

This is illustrated here.  In the case of using the loader directly, brk
(so helpfully identified as "[heap]") is allocated with the _loader_ not
the binary.  For example, with ASLR entirely disabled, you can see this
more clearly:

$ /bin/cat /proc/self/maps
555555554000-55555555c000 r-xp 00000000 ... /bin/cat
55555575b000-55555575c000 r--p 00007000 ... /bin/cat
55555575c000-55555575d000 rw-p 00008000 ... /bin/cat
55555575d000-55555577e000 rw-p 00000000 ... [heap]
...
7ffff7ff7000-7ffff7ffa000 r--p 00000000 ... [vvar]
7ffff7ffa000-7ffff7ffc000 r-xp 00000000 ... [vdso]
7ffff7ffc000-7ffff7ffd000 r--p 00027000 ... /lib/x86_64-linux-gnu/ld-2.27.so
7ffff7ffd000-7ffff7ffe000 rw-p 00028000 ... /lib/x86_64-linux-gnu/ld-2.27.so
7ffff7ffe000-7ffff7fff000 rw-p 00000000 ...
7ffffffde000-7ffffffff000 rw-p 00000000 ... [stack]

$ /lib/x86_64-linux-gnu/ld-2.27.so /bin/cat /proc/self/maps
...
7ffff7bcc000-7ffff7bd4000 r-xp 00000000 ... /bin/cat
7ffff7bd4000-7ffff7dd3000 ---p 00008000 ... /bin/cat
7ffff7dd3000-7ffff7dd4000 r--p 00007000 ... /bin/cat
7ffff7dd4000-7ffff7dd5000 rw-p 00008000 ... /bin/cat
7ffff7dd5000-7ffff7dfc000 r-xp 00000000 ... /lib/x86_64-linux-gnu/ld-2.27.so
7ffff7fb2000-7ffff7fd6000 rw-p 00000000 ...
7ffff7ff7000-7ffff7ffa000 r--p 00000000 ... [vvar]
7ffff7ffa000-7ffff7ffc000 r-xp 00000000 ... [vdso]
7ffff7ffc000-7ffff7ffd000 r--p 00027000 ... /lib/x86_64-linux-gnu/ld-2.27.so
7ffff7ffd000-7ffff7ffe000 rw-p 00028000 ... /lib/x86_64-linux-gnu/ld-2.27.so
7ffff7ffe000-7ffff8020000 rw-p 00000000 ... [heap]
7ffffffde000-7ffffffff000 rw-p 00000000 ... [stack]

The solution is to move brk out of mmap and into ELF_ET_DYN_BASE since
nothing is there in the direct loader case (and ET_EXEC is still far
away at 0x400000).  Anything that ran before should still work (i.e.
the ultimately-launched binary already had the brk very far from its
text, so this should be no different from a COMPAT_BRK standpoint).  The
only risk I see here is that if someone started to suddenly depend on
the entire memory space lower than the mmap region being available when
launching binaries via a direct loader execs which seems highly
unlikely, I'd hope: this would mean a binary would _not_ work when
exec()ed normally.

(Note that this is only done under CONFIG_ARCH_HAS_ELF_RANDOMIZATION
when randomization is turned on.)

Link: http://lkml.kernel.org/r/20190422225727.GA21011@beast
Link: https://lkml.kernel.org/r/CAGXu5jJ5sj3emOT2QPxQkNQk0qbU6zEfu9=Omfhx_p0nCKPSjA@mail.gmail.com
Fixes: eab09532d400 ("binfmt_elf: use ELF_ET_DYN_BASE only for PIE")
Signed-off-by: Kees Cook <keescook@chromium.org>
Reported-by: Ali Saidi <alisaidi@amazon.com>
Cc: Ali Saidi <alisaidi@amazon.com>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Jann Horn <jannh@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agoelf: init pt_regs pointer later
Alexey Dobriyan [Tue, 14 May 2019 22:43:54 +0000 (15:43 -0700)]
elf: init pt_regs pointer later

Get "current_pt_regs" pointer right before usage.

Space savings on x86_64:

add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-180 (-180)
Function                           old     new   delta
load_elf_binary                   5806    5626    -180 !!!

Looks like the compiler doesn't know that "current_pt_regs" is stable
pointer (because it doesn't know ->stack isn't) even though it knows
that "current" is stable pointer.  So it saves it in the very beginning
and then tries to carry it through a lot of code.

Here is what happens here:

load_elf_binary()
...
mov rax,QWORD PTR gs:0x14c00
mov r13,QWORD PTR [rax+0x18] r13 = current->stack
call kmem_cache_alloc # first kmalloc

[980 bytes later!]

# let's spill that sucker because we need a register
# for "load_bias" calculations at
#
# if (interpreter) {
# load_bias = ELF_ET_DYN_BASE;
# if (current->flags & PF_RANDOMIZE)
# load_bias += arch_mmap_rnd();
# elf_flags |= elf_fixed;
# }
mov QWORD PTR [rsp+0x68],r13

If this is not _the_ root cause it is still eeeeh.

After the patch things become much simpler:

mov rax, QWORD PTR gs:0x14c00 # current
mov rdx, QWORD PTR [rax+0x18] # current->stack
movq [rdx+0x3fb8], 0 # fill pt_regs
...
call finalize_exec

Link: http://lkml.kernel.org/r/20190419200343.GA19788@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Tested-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agofs/binfmt_elf.c: extract PROT_* calculations
Alexey Dobriyan [Tue, 14 May 2019 22:43:51 +0000 (15:43 -0700)]
fs/binfmt_elf.c: extract PROT_* calculations

There are two places where mapping protections are calculated: one for
executable, another one for interpreter -- take them out.

ELF read and execute permissions are interchanged with Linux PROT_READ
and PROT_EXEC, microoptimizations are welcome!

Link: http://lkml.kernel.org/r/20190417213413.GB26474@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agofs//binfmt_elf.c: move variables initialization closer to their usage
Alexey Dobriyan [Tue, 14 May 2019 22:43:48 +0000 (15:43 -0700)]
fs//binfmt_elf.c: move variables initialization closer to their usage

Link: http://lkml.kernel.org/r/20190416202002.GB24304@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agofs/binfmt_elf.c: save 1 indent level
Alexey Dobriyan [Tue, 14 May 2019 22:43:45 +0000 (15:43 -0700)]
fs/binfmt_elf.c: save 1 indent level

Rewrite

for (...) {
if (->p_type == PT_INTERP) {
...
break;
}
}

loop into

for (...) {
if (->p_type != PT_INTERP)
continue;
...
break;
}

Link: http://lkml.kernel.org/r/20190416201906.GA24304@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agofs/binfmt_elf.c: delete trailing "return;" in functions returning "void"
Alexey Dobriyan [Tue, 14 May 2019 22:43:42 +0000 (15:43 -0700)]
fs/binfmt_elf.c: delete trailing "return;" in functions returning "void"

Link: http://lkml.kernel.org/r/20190314205042.GE18143@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agofs/binfmt_elf.c: free PT_INTERP filename ASAP
Alexey Dobriyan [Tue, 14 May 2019 22:43:39 +0000 (15:43 -0700)]
fs/binfmt_elf.c: free PT_INTERP filename ASAP

There is no reason for PT_INTERP filename to linger till the end of the
whole loading process.

Link: http://lkml.kernel.org/r/20190314204953.GD18143@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Nikitas Angelinas <nikitas.angelinas@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Mukesh Ojha <mojha@codeaurora.org>
[nikitas.angelinas@gmail.com: fix GPF when dereferencing invalid interpreter]
Link: http://lkml.kernel.org/r/20190330140032.GA1527@vostro
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agofs/binfmt_elf.c: make scope of "pos" variable smaller
Alexey Dobriyan [Tue, 14 May 2019 22:43:36 +0000 (15:43 -0700)]
fs/binfmt_elf.c: make scope of "pos" variable smaller

Link: http://lkml.kernel.org/r/20190314204707.GC18143@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agofs/binfmt_elf.c: remove unneeded initialization of mm->start_stack
Andrew Morton [Tue, 14 May 2019 22:43:34 +0000 (15:43 -0700)]
fs/binfmt_elf.c: remove unneeded initialization of mm->start_stack

As pointed out by zoujc@lenovo.com, setup_arg_pages() already
initialized current->mm->start_stack.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=202881
Reported-by: <zoujc@lenovo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agolib/test_vmalloc.c:test_func(): eliminate local `ret'
Andrew Morton [Tue, 14 May 2019 22:43:30 +0000 (15:43 -0700)]
lib/test_vmalloc.c:test_func(): eliminate local `ret'

Local 'ret' is unneeded and was poorly named: the variable `ret'
generally means the "the value which this function will return".

Cc: Roman Gushchin <guro@fb.com>
Cc: Uladzislau Rezki <urezki@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Thomas Garnier <thgarnie@google.com>
Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agoinclude/linux/bitops.h: sanitize rotate primitives
Rasmus Villemoes [Tue, 14 May 2019 22:43:27 +0000 (15:43 -0700)]
include/linux/bitops.h: sanitize rotate primitives

The ror32 implementation (word >> shift) | (word << (32 - shift) has
undefined behaviour if shift is outside the [1, 31] range.  Similarly
for the 64 bit variants.  Most callers pass a compile-time constant
(naturally in that range), but there's an UBSAN report that these may
actually be called with a shift count of 0.

Instead of special-casing that, we can make them DTRT for all values of
shift while also avoiding UB.  For some reason, this was already partly
done for rol32 (which was well-defined for [0, 31]).  gcc 8 recognizes
these patterns as rotates, so for example

  __u32 rol32(__u32 word, unsigned int shift)
  {
return (word << (shift & 31)) | (word >> ((-shift) & 31));
  }

compiles to

0000000000000020 <rol32>:
  20:   89 f8                   mov    %edi,%eax
  22:   89 f1                   mov    %esi,%ecx
  24:   d3 c0                   rol    %cl,%eax
  26:   c3                      retq

Older compilers unfortunately do not do as well, but this only affects
the small minority of users that don't pass constants.

Due to integer promotions, ro[lr]8 were already well-defined for shifts
in [0, 8], and ro[lr]16 were mostly well-defined for shifts in [0, 16]
(only mostly - u16 gets promoted to _signed_ int, so if bit 15 is set,
word << 16 is undefined).  For consistency, update those as well.

Link: http://lkml.kernel.org/r/20190410211906.2190-1-linux@rasmusvillemoes.dk
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Reported-by: Ido Schimmel <idosch@mellanox.com>
Tested-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Will Deacon <will.deacon@arm.com>
Cc: Vadim Pasternak <vadimp@mellanox.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Jacek Anaszewski <jacek.anaszewski@gmail.com>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agolib/test_bitmap: add tests for bitmap_parselist_user()
Yury Norov [Tue, 14 May 2019 22:43:24 +0000 (15:43 -0700)]
lib/test_bitmap: add tests for bitmap_parselist_user()

Propagate existing bitmap_parselist() tests to bitmap_parselist_user().

Link: http://lkml.kernel.org/r/20190405173211.11373-6-ynorov@marvell.com
Signed-off-by: Yury Norov <ynorov@marvell.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Kees Cook <keescook@chromium.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mike Travis <travis@sgi.com>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agolib/test_bitmap: add testcases for bitmap_parselist()
Yury Norov [Tue, 14 May 2019 22:43:21 +0000 (15:43 -0700)]
lib/test_bitmap: add testcases for bitmap_parselist()

Add tests for non-number character, empty regions, integer overflow.

[ynorov@marvell.com: v5]
Link: http://lkml.kernel.org/r/20190416063801.20134-5-ynorov@marvell.com
Link: http://lkml.kernel.org/r/20190405173211.11373-5-ynorov@marvell.com
Signed-off-by: Yury Norov <ynorov@marvell.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Kees Cook <keescook@chromium.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mike Travis <travis@sgi.com>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agolib/test_bitmap: switch test_bitmap_parselist to ktime_get()
Yury Norov [Tue, 14 May 2019 22:43:18 +0000 (15:43 -0700)]
lib/test_bitmap: switch test_bitmap_parselist to ktime_get()

test_bitmap_parselist currently uses get_cycles which is not implemented
on some platforms, so use ktime_get() instead.

Link: http://lkml.kernel.org/r/20190405173211.11373-4-ynorov@marvell.com
Signed-off-by: Yury Norov <ynorov@marvell.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Kees Cook <keescook@chromium.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mike Travis <travis@sgi.com>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agolib: rework bitmap_parselist
Yury Norov [Tue, 14 May 2019 22:43:14 +0000 (15:43 -0700)]
lib: rework bitmap_parselist

Remove __bitmap_parselist helper and split the function to logical
parts.

[ynorov@marvell.com: v5]
Link: http://lkml.kernel.org/r/20190416063801.20134-3-ynorov@marvell.com
Link: http://lkml.kernel.org/r/20190405173211.11373-3-ynorov@marvell.com
Signed-off-by: Yury Norov <ynorov@marvell.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Kees Cook <keescook@chromium.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mike Travis <travis@sgi.com>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agolib: make bitmap_parselist_user() a wrapper on bitmap_parselist()
Yury Norov [Tue, 14 May 2019 22:43:11 +0000 (15:43 -0700)]
lib: make bitmap_parselist_user() a wrapper on bitmap_parselist()

Patch series "lib: rework bitmap_parselist and tests", v5.

bitmap_parselist has been evolved from a pretty simple idea for long and
now lacks for refactoring.  It is not structured, has nested loops and a
set of opaque-named variables.

Things are more complicated because bitmap_parselist() is a part of user
interface, and its behavior should not change.

In this patchset
 - bitmap_parselist_user() made a wrapper on bitmap_parselist();
 - bitmap_parselist() reworked (patch 2);
 - time measurement in test_bitmap_parselist switched to ktime_get
   (patch 3);
 - new tests introduced (patch 4), and
 - bitmap_parselist_user() testing enabled with the same testset as
   bitmap_parselist() (patch 5).

This patch (of 5):

Currently we parse user data byte after byte which leads to
overcomplification of parsing algorithm.  The only user of
bitmap_parselist_user() is not performance-critical, and so we can
duplicate user data to kernel buffer and simply call bitmap_parselist().
This rework lets us unify and simplify bitmap_parselist() and
bitmap_parselist_user(), which is done in the following patch.

Link: http://lkml.kernel.org/r/20190405173211.11373-2-ynorov@marvell.com
Signed-off-by: Yury Norov <ynorov@marvell.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Kees Cook <keescook@chromium.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Mike Travis <travis@sgi.com>
Cc: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agolib/math: move int_pow() from pwm_bl.c for wider use
Andy Shevchenko [Tue, 14 May 2019 22:43:08 +0000 (15:43 -0700)]
lib/math: move int_pow() from pwm_bl.c for wider use

The integer exponentiation is used in few places and might be used in
the future by other call sites.  Move it to wider use.

Link: http://lkml.kernel.org/r/20190323172531.80025-2-andriy.shevchenko@linux.intel.com
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Daniel Thompson <daniel.thompson@linaro.org>
Cc: Lee Jones <lee.jones@linaro.org>
Cc: Ray Jui <rjui@broadcom.com>
Cc: Thierry Reding <thierry.reding@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agolib: Move mathematic helpers to separate folder
Andy Shevchenko [Tue, 14 May 2019 22:43:05 +0000 (15:43 -0700)]
lib: Move mathematic helpers to separate folder

For better maintenance and expansion move the mathematic helpers to the
separate folder.

No functional change intended.

Note, the int_sqrt() is not used as a part of lib, so, moved to regular
obj.

Link: http://lkml.kernel.org/r/20190323172531.80025-1-andriy.shevchenko@linux.intel.com
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Thierry Reding <thierry.reding@gmail.com>
Cc: Lee Jones <lee.jones@linaro.org>
Cc: Daniel Thompson <daniel.thompson@linaro.org>
Cc: Ray Jui <rjui@broadcom.com>
[mchehab+samsung@kernel.org: fix broken doc references for div64.c and gcd.c]
Link: http://lkml.kernel.org/r/734f49bae5d4052b3c25691dfefad59bea2e5843.1555580999.git.mchehab+samsung@kernel.org
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agolib/list_sort: optimize number of calls to comparison function
George Spelvin [Tue, 14 May 2019 22:43:02 +0000 (15:43 -0700)]
lib/list_sort: optimize number of calls to comparison function

CONFIG_RETPOLINE has severely degraded indirect function call
performance, so it's worth putting some effort into reducing the number
of times cmp() is called.

This patch avoids badly unbalanced merges on unlucky input sizes.  It
slightly increases the code size, but saves an average of 0.2*n calls to
cmp().

x86-64 code size 739 -> 803 bytes (+64)

Unfortunately, there's not a lot of low-hanging fruit in a merge sort;
it already performs only n*log2(n) - K*n + O(1) compares.  The leading
coefficient is already at the theoretical limit (log2(n!) corresponds to
K=1.4427), so we're fighting over the linear term, and the best
mergesort can do is K=1.2645, achieved when n is a power of 2.

The differences between mergesort variants appear when n is *not* a
power of 2; K is a function of the fractional part of log2(n).  Top-down
mergesort does best of all, achieving a minimum K=1.2408, and an average
(over all sizes) K=1.248.  However, that requires knowing the number of
entries to be sorted ahead of time, and making a full pass over the
input to count it conflicts with a second performance goal, which is
cache blocking.

Obviously, we have to read the entire list into L1 cache at some point,
and performance is best if it fits.  But if it doesn't fit, each full
pass over the input causes a cache miss per element, which is
undesirable.

While textbooks explain bottom-up mergesort as a succession of merging
passes, practical implementations do merging in depth-first order: as
soon as two lists of the same size are available, they are merged.  This
allows as many merge passes as possible to fit into L1; only the final
few merges force cache misses.

This cache-friendly depth-first merge order depends on us merging the
beginning of the input as much as possible before we've even seen the
end of the input (and thus know its size).

The simple eager merge pattern causes bad performance when n is just
over a power of 2.  If n=1028, the final merge is between 1024- and
4-element lists, which is wasteful of comparisons.  (This is actually
worse on average than n=1025, because a 1204:1 merge will, on average,
end after 512 compares, while 1024:4 will walk 4/5 of the list.)

Because of this, bottom-up mergesort achieves K < 0.5 for such sizes,
and has an average (over all sizes) K of around 1.  (My experiments show
K=1.01, while theory predicts K=0.965.)

There are "worst-case optimal" variants of bottom-up mergesort which
avoid this bad performance, but the algorithms given in the literature,
such as queue-mergesort and boustrodephonic mergesort, depend on the
breadth-first multi-pass structure that we are trying to avoid.

This implementation is as eager as possible while ensuring that all
merge passes are at worst 1:2 unbalanced.  This achieves the same
average K=1.207 as queue-mergesort, which is 0.2*n better then
bottom-up, and only 0.04*n behind top-down mergesort.

Specifically, defers merging two lists of size 2^k until it is known
that there are 2^k additional inputs following.  This ensures that the
final uneven merges triggered by reaching the end of the input will be
at worst 2:1.  This will avoid cache misses as long as 3*2^k elements
fit into the cache.

(I confess to being more than a little bit proud of how clean this code
turned out.  It took a lot of thinking, but the resultant inner loop is
very simple and efficient.)

Refs:
  Bottom-up Mergesort: A Detailed Analysis
  Wolfgang Panny, Helmut Prodinger
  Algorithmica 14(4):340--354, October 1995
  https://doi.org/10.1007/BF01294131
  https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.6.5260

  The cost distribution of queue-mergesort, optimal mergesorts, and
  power-of-two rules
  Wei-Mei Chen, Hsien-Kuei Hwang, Gen-Huey Chen
  Journal of Algorithms 30(2); Pages 423--448, February 1999
  https://doi.org/10.1006/jagm.1998.0986
  https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.4.5380

  Queue-Mergesort
  Mordecai J. Golin, Robert Sedgewick
  Information Processing Letters, 48(5):253--259, 10 December 1993
  https://doi.org/10.1016/0020-0190(93)90088-q
  https://sci-hub.tw/10.1016/0020-0190(93)90088-Q

Feedback from Rasmus Villemoes <linux@rasmusvillemoes.dk>.

Link: http://lkml.kernel.org/r/fd560853cc4dca0d0f02184ffa888b4c1be89abc.1552704200.git.lkml@sdf.org
Signed-off-by: George Spelvin <lkml@sdf.org>
Acked-by: Andrey Abramov <st5pub@yandex.ru>
Acked-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Daniel Wagner <daniel.wagner@siemens.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Don Mullis <don.mullis@gmail.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agolib/list_sort: simplify and remove MAX_LIST_LENGTH_BITS
George Spelvin [Tue, 14 May 2019 22:42:58 +0000 (15:42 -0700)]
lib/list_sort: simplify and remove MAX_LIST_LENGTH_BITS

Rather than a fixed-size array of pending sorted runs, use the ->prev
links to keep track of things.  This reduces stack usage, eliminates
some ugly overflow handling, and reduces the code size.

Also:
* merge() no longer needs to handle NULL inputs, so simplify.
* The same applies to merge_and_restore_back_links(), which is renamed
  to the less ponderous merge_final().  (It's a static helper function,
  so we don't need a super-descriptive name; comments will do.)
* Document the actual return value requirements on the (*cmp)()
  function; some callers are already using this feature.

x86-64 code size 1086 -> 739 bytes (-347)

(Yes, I see checkpatch complaining about no space after comma in
"__attribute__((nonnull(2,3,4,5)))".  Checkpatch is wrong.)

Feedback from Rasmus Villemoes, Andy Shevchenko and Geert Uytterhoeven.

[akpm@linux-foundation.org: remove __pure usage due to mysterious warning]
Link: http://lkml.kernel.org/r/f63c410e0ff76009c9b58e01027e751ff7fdb749.1552704200.git.lkml@sdf.org
Signed-off-by: George Spelvin <lkml@sdf.org>
Acked-by: Andrey Abramov <st5pub@yandex.ru>
Acked-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Daniel Wagner <daniel.wagner@siemens.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Don Mullis <don.mullis@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agolib/sort: avoid indirect calls to built-in swap
George Spelvin [Tue, 14 May 2019 22:42:55 +0000 (15:42 -0700)]
lib/sort: avoid indirect calls to built-in swap

Similar to what's being done in the net code, this takes advantage of
the fact that most invocations use only a few common swap functions, and
replaces indirect calls to them with (highly predictable) conditional
branches.  (The downside, of course, is that if you *do* use a custom
swap function, there are a few extra predicted branches on the code
path.)

This actually *shrinks* the x86-64 code, because it inlines the various
swap functions inside do_swap, eliding function prologues & epilogues.

x86-64 code size 767 -> 703 bytes (-64)

Link: http://lkml.kernel.org/r/d10c5d4b393a1847f32f5b26f4bbaa2857140e1e.1552704200.git.lkml@sdf.org
Signed-off-by: George Spelvin <lkml@sdf.org>
Acked-by: Andrey Abramov <st5pub@yandex.ru>
Acked-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Daniel Wagner <daniel.wagner@siemens.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Don Mullis <don.mullis@gmail.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agolib/sort: use more efficient bottom-up heapsort variant
George Spelvin [Tue, 14 May 2019 22:42:52 +0000 (15:42 -0700)]
lib/sort: use more efficient bottom-up heapsort variant

This uses fewer comparisons than the previous code (approaching half as
many for large random inputs), but produces identical results; it
actually performs the exact same series of swap operations.

Specifically, it reduces the average number of compares from
  2*n*log2(n) - 3*n + o(n)
to
    n*log2(n) + 0.37*n + o(n).

This is still 1.63*n worse than glibc qsort() which manages n*log2(n) -
1.26*n, but at least the leading coefficient is correct.

Standard heapsort, when sifting down, performs two comparisons per
level: one to find the greater child, and a second to see if the current
node should be exchanged with that child.

Bottom-up heapsort observes that it's better to postpone the second
comparison and search for the leaf where -infinity would be sent to,
then search back *up* for the current node's destination.

Since sifting down usually proceeds to the leaf level (that's where half
the nodes are), this does O(1) second comparisons rather than log2(n).
That saves a lot of (expensive since Spectre) indirect function calls.

The one time it's worse than the previous code is if there are large
numbers of duplicate keys, when the top-down algorithm is O(n) and
bottom-up is O(n log n).  For distinct keys, it's provably always
better, doing 1.5*n*log2(n) + O(n) in the worst case.

(The code is not significantly more complex.  This patch also merges the
heap-building and -extracting sift-down loops, resulting in a net code
size savings.)

x86-64 code size 885 -> 767 bytes (-118)

(I see the checkpatch complaint about "else if (n -= size)".  The
alternative is significantly uglier.)

Link: http://lkml.kernel.org/r/2de8348635a1a421a72620677898c7fd5bd4b19d.1552704200.git.lkml@sdf.org
Signed-off-by: George Spelvin <lkml@sdf.org>
Acked-by: Andrey Abramov <st5pub@yandex.ru>
Acked-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Daniel Wagner <daniel.wagner@siemens.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Don Mullis <don.mullis@gmail.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agolib/sort: make swap functions more generic
George Spelvin [Tue, 14 May 2019 22:42:49 +0000 (15:42 -0700)]
lib/sort: make swap functions more generic

Patch series "lib/sort & lib/list_sort: faster and smaller", v2.

Because CONFIG_RETPOLINE has made indirect calls much more expensive, I
thought I'd try to reduce the number made by the library sort functions.

The first three patches apply to lib/sort.c.

Patch #1 is a simple optimization.  The built-in swap has special cases
for aligned 4- and 8-byte objects.  But those are almost never used;
most calls to sort() work on larger structures, which fall back to the
byte-at-a-time loop.  This generalizes them to aligned *multiples* of 4
and 8 bytes.  (If nothing else, it saves an awful lot of energy by not
thrashing the store buffers as much.)

Patch #2 grabs a juicy piece of low-hanging fruit.  I agree that nice
simple solid heapsort is preferable to more complex algorithms (sorry,
Andrey), but it's possible to implement heapsort with far fewer
comparisons (50% asymptotically, 25-40% reduction for realistic sizes)
than the way it's been done up to now.  And with some care, the code
ends up smaller, as well.  This is the "big win" patch.

Patch #3 adds the same sort of indirect call bypass that has been added
to the net code of late.  The great majority of the callers use the
builtin swap functions, so replace the indirect call to sort_func with a
(highly preditable) series of if() statements.  Rather surprisingly,
this decreased code size, as the swap functions were inlined and their
prologue & epilogue code eliminated.

lib/list_sort.c is a bit trickier, as merge sort is already close to
optimal, and we don't want to introduce triumphs of theory over
practicality like the Ford-Johnson merge-insertion sort.

Patch #4, without changing the algorithm, chops 32% off the code size
and removes the part[MAX_LIST_LENGTH+1] pointer array (and the
corresponding upper limit on efficiently sortable input size).

Patch #5 improves the algorithm.  The previous code is already optimal
for power-of-two (or slightly smaller) size inputs, but when the input
size is just over a power of 2, there's a very unbalanced final merge.

There are, in the literature, several algorithms which solve this, but
they all depend on the "breadth-first" merge order which was replaced by
commit 835cc0c8477f with a more cache-friendly "depth-first" order.
Some hard thinking came up with a depth-first algorithm which defers
merges as little as possible while avoiding bad merges.  This saves
0.2*n compares, averaged over all sizes.

The code size increase is minimal (64 bytes on x86-64, reducing the net
savings to 26%), but the comments expanded significantly to document the
clever algorithm.

TESTING NOTES: I have some ugly user-space benchmarking code which I
used for testing before moving this code into the kernel.  Shout if you
want a copy.

I'm running this code right now, with CONFIG_TEST_SORT and
CONFIG_TEST_LIST_SORT, but I confess I haven't rebooted since the last
round of minor edits to quell checkpatch.  I figure there will be at
least one round of comments and final testing.

This patch (of 5):

Rather than having special-case swap functions for 4- and 8-byte
objects, special-case aligned multiples of 4 or 8 bytes.  This speeds up
most users of sort() by avoiding fallback to the byte copy loop.

Despite what ca96ab859ab4 ("lib/sort: Add 64 bit swap function") claims,
very few users of sort() sort pointers (or pointer-sized objects); most
sort structures containing at least two words.  (E.g.
drivers/acpi/fan.c:acpi_fan_get_fps() sorts an array of 40-byte struct
acpi_fan_fps.)

The functions also got renamed to reflect the fact that they support
multiple words.  In the great tradition of bikeshedding, the names were
by far the most contentious issue during review of this patch series.

x86-64 code size 872 -> 886 bytes (+14)

With feedback from Andy Shevchenko, Rasmus Villemoes and Geert
Uytterhoeven.

Link: http://lkml.kernel.org/r/f24f932df3a7fa1973c1084154f1cea596bcf341.1552704200.git.lkml@sdf.org
Signed-off-by: George Spelvin <lkml@sdf.org>
Acked-by: Andrey Abramov <st5pub@yandex.ru>
Acked-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Daniel Wagner <daniel.wagner@siemens.com>
Cc: Don Mullis <don.mullis@gmail.com>
Cc: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agolib/plist: rename DEBUG_PI_LIST to DEBUG_PLIST
Davidlohr Bueso [Tue, 14 May 2019 22:42:46 +0000 (15:42 -0700)]
lib/plist: rename DEBUG_PI_LIST to DEBUG_PLIST

This is a lot more appropriate than PI_LIST, which in the kernel one
would assume that it has to do with priority-inheritance; which is not
-- furthermore futexes make use of plists so this can be even more
confusing, albeit the debug nature of the config option.

Link: http://lkml.kernel.org/r/20190317185434.1626-1-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agolib/bitmap.c: guard exotic bitmap functions by CONFIG_NUMA
Rasmus Villemoes [Tue, 14 May 2019 22:42:43 +0000 (15:42 -0700)]
lib/bitmap.c: guard exotic bitmap functions by CONFIG_NUMA

The bitmap_remap, _bitremap, _onto and _fold functions are only used,
via their node_ wrappers, in mm/mempolicy.c, which is only built for
CONFIG_NUMA.  The helper bitmap_ord_to_pos used by these functions is
global, but its only external caller is node_random() in lib/nodemask.c,
which is also guarded by CONFIG_NUMA.

For !CONFIG_NUMA:

add/remove: 0/6 grow/shrink: 0/0 up/down: 0/-621 (-621)
Function                                     old     new   delta
bitmap_pos_to_ord                             20       -     -20
bitmap_ord_to_pos                             70       -     -70
bitmap_bitremap                               81       -     -81
bitmap_fold                                  113       -    -113
bitmap_onto                                  123       -    -123
bitmap_remap                                 214       -    -214
Total: Before=4776, After=4155, chg -13.00%

Link: http://lkml.kernel.org/r/20190329205353.6010-2-linux@rasmusvillemoes.dk
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Andy Shevchenko <andy.shevchenko@gmail.com>
Cc: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agolib/bitmap.c: remove unused EXPORT_SYMBOLs
Rasmus Villemoes [Tue, 14 May 2019 22:42:40 +0000 (15:42 -0700)]
lib/bitmap.c: remove unused EXPORT_SYMBOLs

AFAICT, there have never been any callers of these functions outside
mm/mempolicy.c (via their nodemask.h wrappers).  In particular, no
modular code has ever used them, and given their somewhat exotic
semantics, I highly doubt they will ever find such a use.  In any case,
no need to export them currently.

Link: http://lkml.kernel.org/r/20190329205353.6010-1-linux@rasmusvillemoes.dk
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Andy Shevchenko <andy.shevchenko@gmail.com>
Cc: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agokernel/user.c: clean up some leftover code
Rasmus Villemoes [Tue, 14 May 2019 22:42:37 +0000 (15:42 -0700)]
kernel/user.c: clean up some leftover code

The out_unlock label is misleading; no unlocking happens after it, so
just return NULL directly.

Also, nothing between the kmem_cache_zalloc() that creates new and the
two key_put() can initialize new->uid_keyring or new->session_keyring,
so those calls are no-ops.

Link: http://lkml.kernel.org/r/20190424200404.9114-1-linux@rasmusvillemoes.dk
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agokernel/latencytop.c: rename clear_all_latency_tracing to clear_tsk_latency_tracing
Lin Feng [Tue, 14 May 2019 22:42:34 +0000 (15:42 -0700)]
kernel/latencytop.c: rename clear_all_latency_tracing to clear_tsk_latency_tracing

The name clear_all_latency_tracing is misleading, in fact which only
clear per task's latency_record[], and we do have another function named
clear_global_latency_tracing which clear the global latency_record[]
buffer.

Link: http://lkml.kernel.org/r/20190226114602.16902-1-linf@wangsu.com
Signed-off-by: Lin Feng <linf@wangsu.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Fabian Frederick <fabf@skynet.be>
Cc: Arjan van de Ven <arjan@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agokernel/latencytop.c: remove unnecessary checks for latencytop_enabled
Lin Feng [Tue, 14 May 2019 22:42:31 +0000 (15:42 -0700)]
kernel/latencytop.c: remove unnecessary checks for latencytop_enabled

1. In latencytop source codes, we only have such calling chain:

account_scheduler_latency(struct task_struct *task, int usecs, int inter)
{
        if (unlikely(latencytop_enabled)) /* the outtermost check */
                __account_scheduler_latency(task, usecs, inter);
}
__account_scheduler_latency
    account_global_scheduler_latency
        if (!latencytop_enabled)

So, the inner check for latencytop_enabled is not necessary at all.

2. In clear_all_latency_tracing and now is called
   clear_tsk_latency_tracing the check for latencytop_enabled is redundant
   and buggy to some extent.

   We have no reason to refuse clearing the /proc/$pid/latency if
   latencytop_enabled is set to 0, considering that if we use latencytop
   manually by echo 0 > /proc/sys/kernel/latencytop, then we want to clear
   /proc/$pid/latency and failed.

   Also we don't have such check in brother function
   clear_global_latency_tracing.

Notes: These changes are only visible to users who set
   CONFIG_LATENCYTOP and won't change user tool latencytop's behavior.

Link: http://lkml.kernel.org/r/20190226114602.16902-2-linf@wangsu.com
Signed-off-by: Lin Feng <linf@wangsu.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Fabian Frederick <fabf@skynet.be>
Cc: Arjan van de Ven <arjan@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agokernel/notifier.c: double register detection
Vasily Averin [Tue, 14 May 2019 22:42:28 +0000 (15:42 -0700)]
kernel/notifier.c: double register detection

By design notifiers can be registerd once only, 2nd register attempt
called by mistake silently corrupts notifiers list.

A few years ago I investigated described problem, the host was power
cycled because of notifier list corruption.  I've prepared this patch
and applied it to the OpenVZ kernel and sent this patch but nobody
commented on it.  Later it helped us to detect a similar problem in the
OpenVz kernel.

Mistakes with notifier registration can happen for example during
subsystem initialization from different namespaces, or because of a lost
unregister in the roll-back path on initialization failures.

The proposed check cannot prevent the described problem, however it
allows us to detect its reason quickly without coredump analysis.

Link: http://lkml.kernel.org/r/04127e71-4782-9bbb-fe5a-7c01e93a99b0@virtuozzo.com
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agocompiler: allow all arches to enable CONFIG_OPTIMIZE_INLINING
Masahiro Yamada [Tue, 14 May 2019 22:42:25 +0000 (15:42 -0700)]
compiler: allow all arches to enable CONFIG_OPTIMIZE_INLINING

Commit 60a3cdd06394 ("x86: add optimized inlining") introduced
CONFIG_OPTIMIZE_INLINING, but it has been available only for x86.

The idea is obviously arch-agnostic.  This commit moves the config entry
from arch/x86/Kconfig.debug to lib/Kconfig.debug so that all
architectures can benefit from it.

This can make a huge difference in kernel image size especially when
CONFIG_OPTIMIZE_FOR_SIZE is enabled.

For example, I got 3.5% smaller arm64 kernel for v5.1-rc1.

  dec       file
  18983424  arch/arm64/boot/Image.before
  18321920  arch/arm64/boot/Image.after

This also slightly improves the "Kernel hacking" Kconfig menu as
e61aca5158a8 ("Merge branch 'kconfig-diet' from Dave Hansen') suggested;
this config option would be a good fit in the "compiler option" menu.

Link: http://lkml.kernel.org/r/20190423034959.13525-12-yamada.masahiro@socionext.com
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Acked-by: Borislav Petkov <bp@suse.de>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Boris Brezillon <bbrezillon@kernel.org>
Cc: Brian Norris <computersforpeace@gmail.com>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Marek Vasut <marek.vasut@gmail.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Malaterre <malat@debian.org>
Cc: Miquel Raynal <miquel.raynal@bootlin.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Russell King <rmk+kernel@arm.linux.org.uk>
Cc: Stefan Agner <stefan@agner.ch>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agopowerpc/mm/radix: mark as __tlbie_pid() and friends as__always_inline
Masahiro Yamada [Tue, 14 May 2019 22:42:21 +0000 (15:42 -0700)]
powerpc/mm/radix: mark as __tlbie_pid() and friends as__always_inline

This prepares to move CONFIG_OPTIMIZE_INLINING from x86 to a common
place.  We need to eliminate potential issues beforehand.

If it is enabled for powerpc, the following errors are reported:

  arch/powerpc/mm/tlb-radix.c: In function '__tlbie_lpid':
  arch/powerpc/mm/tlb-radix.c:148:2: warning: asm operand 3 probably doesn't match constraints
    asm volatile(PPC_TLBIE_5(%0, %4, %3, %2, %1)
    ^~~
  arch/powerpc/mm/tlb-radix.c:148:2: error: impossible constraint in 'asm'
  arch/powerpc/mm/tlb-radix.c: In function '__tlbie_pid':
  arch/powerpc/mm/tlb-radix.c:118:2: warning: asm operand 3 probably doesn't match constraints
    asm volatile(PPC_TLBIE_5(%0, %4, %3, %2, %1)
    ^~~
  arch/powerpc/mm/tlb-radix.c: In function '__tlbiel_pid':
  arch/powerpc/mm/tlb-radix.c:104:2: warning: asm operand 3 probably doesn't match constraints
    asm volatile(PPC_TLBIEL(%0, %4, %3, %2, %1)
    ^~~

Link: http://lkml.kernel.org/r/20190423034959.13525-11-yamada.masahiro@socionext.com
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Boris Brezillon <bbrezillon@kernel.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Brian Norris <computersforpeace@gmail.com>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Marek Vasut <marek.vasut@gmail.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Malaterre <malat@debian.org>
Cc: Miquel Raynal <miquel.raynal@bootlin.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Russell King <rmk+kernel@arm.linux.org.uk>
Cc: Stefan Agner <stefan@agner.ch>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agopowerpc/mm/radix: mark __radix__flush_tlb_range_psize() as __always_inline
Masahiro Yamada [Tue, 14 May 2019 22:42:17 +0000 (15:42 -0700)]
powerpc/mm/radix: mark __radix__flush_tlb_range_psize() as __always_inline

This prepares to move CONFIG_OPTIMIZE_INLINING from x86 to a common
place.  We need to eliminate potential issues beforehand.

If it is enabled for powerpc, the following error is reported:

  arch/powerpc/mm/tlb-radix.c: In function '__radix__flush_tlb_range_psize':
  arch/powerpc/mm/tlb-radix.c:104:2: error: asm operand 3 probably doesn't match constraints [-Werror]
    asm volatile(PPC_TLBIEL(%0, %4, %3, %2, %1)
    ^~~
  arch/powerpc/mm/tlb-radix.c:104:2: error: impossible constraint in 'asm'

Link: http://lkml.kernel.org/r/20190423034959.13525-10-yamada.masahiro@socionext.com
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Boris Brezillon <bbrezillon@kernel.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Brian Norris <computersforpeace@gmail.com>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Marek Vasut <marek.vasut@gmail.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Malaterre <malat@debian.org>
Cc: Miquel Raynal <miquel.raynal@bootlin.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Russell King <rmk+kernel@arm.linux.org.uk>
Cc: Stefan Agner <stefan@agner.ch>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>