mm/vmscan: don't mess with pgdat->flags in memcg reclaim
memcg reclaim may alter pgdat->flags based on the state of LRU lists in
cgroup and its children. PGDAT_WRITEBACK may force kswapd to sleep
congested_wait(), PGDAT_DIRTY may force kswapd to writeback filesystem
pages. But the worst here is PGDAT_CONGESTED, since it may force all
direct reclaims to stall in wait_iff_congested(). Note that only kswapd
have powers to clear any of these bits. This might just never happen if
cgroup limits configured that way. So all direct reclaims will stall as
long as we have some congested bdi in the system.
Leave all pgdat->flags manipulations to kswapd. kswapd scans the whole
pgdat, only kswapd can clear pgdat->flags once node is balanced, thus
it's reasonable to leave all decisions about node state to kswapd.
Why only kswapd? Why not allow to global direct reclaim change these
flags? It is because currently only kswapd can clear these flags. I'm
less worried about the case when PGDAT_CONGESTED falsely not set, and
more worried about the case when it falsely set. If direct reclaimer
sets PGDAT_CONGESTED, do we have guarantee that after the congestion
problem is sorted out, kswapd will be woken up and clear the flag? It
seems like there is no such guarantee. E.g. direct reclaimers may
eventually balance pgdat and kswapd simply won't wake up (see
wakeup_kswapd()).
Moving pgdat->flags manipulation to kswapd, means that cgroup2 recalim
now loses its congestion throttling mechanism. Add per-cgroup
congestion state and throttle cgroup2 reclaimers if memcg is in
congestion state.
Currently there is no need in per-cgroup PGDAT_WRITEBACK and PGDAT_DIRTY
bits since they alter only kswapd behavior.
The problem could be easily demonstrated by creating heavy congestion in
one cgroup:
echo "+memory" > /sys/fs/cgroup/cgroup.subtree_control
mkdir -p /sys/fs/cgroup/congester
echo 512M > /sys/fs/cgroup/congester/memory.max
echo $$ > /sys/fs/cgroup/congester/cgroup.procs
/* generate a lot of diry data on slow HDD */
while true; do dd if=/dev/zero of=/mnt/sdb/zeroes bs=1M count=1024; done &
....
while true; do dd if=/dev/zero of=/mnt/sdb/zeroes bs=1M count=1024; done &
mm/vmscan: don't change pgdat state on base of a single LRU list state
We have separate LRU list for each memory cgroup. Memory reclaim
iterates over cgroups and calls shrink_inactive_list() every inactive
LRU list. Based on the state of a single LRU shrink_inactive_list() may
flag the whole node as dirty,congested or under writeback. This is
obviously wrong and hurtful. It's especially hurtful when we have
possibly small congested cgroup in system. Than *all* direct reclaims
waste time by sleeping in wait_iff_congested(). And the more memcgs in
the system we have the longer memory allocation stall is, because
wait_iff_congested() called on each lru-list scan.
Sum reclaim stats across all visited LRUs on node and flag node as
dirty, congested or under writeback based on that sum. Also call
congestion_wait(), wait_iff_congested() once per pgdat scan, instead of
once per lru-list scan.
This only fixes the problem for global reclaim case. Per-cgroup reclaim
may alter global pgdat flags too, which is wrong. But that is separate
issue and will be addressed in the next patch.
This change will not have any effect on a systems with all workload
concentrated in a single cgroup.
[aryabinin@virtuozzo.com: check nr_writeback against all nr_taken, not just file] Link: http://lkml.kernel.org/r/20180406180254.8970-1-aryabinin@virtuozzo.com Link: http://lkml.kernel.org/r/20180323152029.11084-4-aryabinin@virtuozzo.com Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Tejun Heo <tj@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Only kswapd can have non-zero nr_immediate, and current_may_throttle()
is always true for kswapd (PF_LESS_THROTTLE bit is never set) thus it's
enough to check stat.nr_immediate only.
Roman Gushchin [Tue, 10 Apr 2018 23:27:47 +0000 (16:27 -0700)]
mm: treat indirectly reclaimable memory as free in overcommit logic
Indirectly reclaimable memory can consume a significant part of total
memory and it's actually reclaimable (it will be released under actual
memory pressure).
So, the overcommit logic should treat it as free.
Otherwise, it's possible to cause random system-wide memory allocation
failures by consuming a significant amount of memory by indirectly
reclaimable memory, e.g. dentry external names.
If overcommit policy GUESS is used, it might be used for denial of
service attack under some conditions.
The following program illustrates the approach. It causes the kernel to
allocate an unreclaimable kmalloc-256 chunk for each stat() call, so
that at some point the overcommit logic may start blocking large
allocation system-wide.
int main()
{
char buf[256];
unsigned long i;
struct stat statbuf;
buf[0] = '/';
for (i = 1; i < sizeof(buf); i++)
buf[i] = '_';
for (i = 0; 1; i++) {
sprintf(&buf[248], "%8lu", i);
stat(buf, &statbuf);
}
return 0;
}
This patch in combination with related indirectly reclaimable memory
patches closes this issue.
Link: http://lkml.kernel.org/r/20180313130041.8078-1-guro@fb.com Signed-off-by: Roman Gushchin <guro@fb.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Roman Gushchin [Tue, 10 Apr 2018 23:27:44 +0000 (16:27 -0700)]
dcache: account external names as indirectly reclaimable memory
I received a report about suspicious growth of unreclaimable slabs on
some machines. I've found that it happens on machines with low memory
pressure, and these unreclaimable slabs are external names attached to
dentries.
External names are allocated using generic kmalloc() function, so they
are accounted as unreclaimable. But they are held by dentries, which
are reclaimable, and they will be reclaimed under the memory pressure.
In particular, this breaks MemAvailable calculation, as it doesn't take
unreclaimable slabs into account. This leads to a silly situation, when
a machine is almost idle, has no memory pressure and therefore has a big
dentry cache. And the resulting MemAvailable is too low to start a new
workload.
To address the issue, the NR_INDIRECTLY_RECLAIMABLE_BYTES counter is
used to track the amount of memory, consumed by external names. The
counter is increased in the dentry allocation path, if an external name
structure is allocated; and it's decreased in the dentry freeing path.
To reproduce the problem I've used the following Python script:
import os
for iter in range (0, 10000000):
try:
name = ("/some_long_name_%d" % iter) + "_" * 220
os.stat(name)
except Exception:
pass
Roman Gushchin [Tue, 10 Apr 2018 23:27:36 +0000 (16:27 -0700)]
mm: introduce NR_INDIRECTLY_RECLAIMABLE_BYTES
Patch series "indirectly reclaimable memory", v2.
This patchset introduces the concept of indirectly reclaimable memory
and applies it to fix the issue of when a big number of dentries with
external names can significantly affect the MemAvailable value.
This patch (of 3):
Introduce a concept of indirectly reclaimable memory and adds the
corresponding memory counter and /proc/vmstat item.
Indirectly reclaimable memory is any sort of memory, used by the kernel
(except of reclaimable slabs), which is actually reclaimable, i.e. will
be released under memory pressure.
The counter is in bytes, as it's not always possible to count such
objects in pages. The name contains BYTES by analogy to
NR_KERNEL_STACK_KB.
Link: http://lkml.kernel.org/r/20180305133743.12746-2-guro@fb.com Signed-off-by: Roman Gushchin <guro@fb.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Merge tag 'mips_4.17' of git://git.kernel.org/pub/scm/linux/kernel/git/jhogan/mips
Pull MIPS updates from James Hogan:
"These are the main MIPS changes for 4.17. Rough overview:
(1) generic platform: Add support for Microsemi Ocelot SoCs
(2) crypto: Add CRC32 and CRC32C HW acceleration module
(3) Various cleanups and misc improvements
More detailed summary:
Miscellaneous:
- hang more efficiently on halt/powerdown/restart
- pm-cps: Block system suspend when a JTAG probe is present
- expand make help text for generic defconfigs
- refactor handling of legacy defconfigs
- determine the entry point from the ELF file header to fix microMIPS
for certain toolchains
- introduce isa-rev.h for MIPS_ISA_REV and use to simplify other code
Minor cleanups:
- DTS: boston/ci20: Unit name cleanups and correction
- kdump: Make the default for PHYSICAL_START always 64-bit
- constify gpio_led in Alchemy, AR7, and TXX9
- silence a couple of W=1 warnings
- remove duplicate includes
Platform support:
Generic platform:
- add support for Microsemi Ocelot
- dt-bindings: Add vendor prefix for Microsemi Corporation
- dt-bindings: Add bindings for Microsemi SoCs
- add ocelot SoC & PCB123 board DTS files
- MAINTAINERS: Add entry for Microsemi MIPS SoCs
- enable crc32-mips on r6 configs
ath79:
- fix AR724X_PLL_REG_PCIE_CONFIG offset
BCM47xx:
- firmware: Use mac_pton() for MAC address parsing
- add Luxul XAP1500/XWR1750 WiFi LEDs
- use standard reset button for Luxul XWR-1750
BMIPS:
- enable CONFIG_BRCMSTB_PM in bmips_stb_defconfig for build coverage
- add STB PM, wake-up timer, watchdog DT nodes
Octeon:
- drop '.' after newlines in printk calls
ralink:
- pci-mt7621: Enable PCIe on MT7688"
* tag 'mips_4.17' of git://git.kernel.org/pub/scm/linux/kernel/git/jhogan/mips: (37 commits)
MIPS: BCM47XX: Use standard reset button for Luxul XWR-1750
MIPS: BCM47XX: Add Luxul XAP1500/XWR1750 WiFi LEDs
MIPS: Make the default for PHYSICAL_START always 64-bit
MIPS: Use the entry point from the ELF file header
MAINTAINERS: Add entry for Microsemi MIPS SoCs
MIPS: generic: Add support for Microsemi Ocelot
MIPS: mscc: Add ocelot PCB123 device tree
MIPS: mscc: Add ocelot dtsi
dt-bindings: mips: Add bindings for Microsemi SoCs
dt-bindings: Add vendor prefix for Microsemi Corporation
MIPS: ath79: Fix AR724X_PLL_REG_PCIE_CONFIG offset
MIPS: pci-mt7620: Enable PCIe on MT7688
MIPS: pm-cps: Block system suspend when a JTAG probe is present
MIPS: VDSO: Replace __mips_isa_rev with MIPS_ISA_REV
MIPS: BPF: Replace __mips_isa_rev with MIPS_ISA_REV
MIPS: cpu-features.h: Replace __mips_isa_rev with MIPS_ISA_REV
MIPS: Introduce isa-rev.h to define MIPS_ISA_REV
MIPS: Hang more efficiently on halt/powerdown/restart
FIRMWARE: bcm47xx_nvram: Replace mac address parsing
MIPS: BMIPS: Add Broadcom STB watchdog nodes
...
Merge tag 'trace-v4.17' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
"New features:
- Tom Zanussi's extended histogram work.
This adds the synthetic events to have histograms from multiple
event data Adds triggers "onmatch" and "onmax" to call the
synthetic events Several updates to the histogram code from this
- Allow way to nest ring buffer calls in the same context
- Allow absolute time stamps in ring buffer
- Rewrite of filter code parsing based on Al Viro's suggestions
- Setting of trace_clock to global if TSC is unstable (on boot)
- Better OOM handling when allocating large ring buffers
- Added initcall tracepoints (consolidated initcall_debug code with
them)
And other various fixes and clean ups"
* tag 'trace-v4.17' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (68 commits)
init: Have initcall_debug still work without CONFIG_TRACEPOINTS
init, tracing: Have printk come through the trace events for initcall_debug
init, tracing: instrument security and console initcall trace events
init, tracing: Add initcall trace events
tracing: Add rcu dereference annotation for test func that touches filter->prog
tracing: Add rcu dereference annotation for filter->prog
tracing: Fixup logic inversion on setting trace_global_clock defaults
tracing: Hide global trace clock from lockdep
ring-buffer: Add set/clear_current_oom_origin() during allocations
ring-buffer: Check if memory is available before allocation
lockdep: Add print_irqtrace_events() to __warn
vsprintf: Do not preprocess non-dereferenced pointers for bprintf (%px and %pK)
tracing: Uninitialized variable in create_tracing_map_fields()
tracing: Make sure variable string fields are NULL-terminated
tracing: Add action comparisons when testing matching hist triggers
tracing: Don't add flag strings when displaying variable references
tracing: Fix display of hist trigger expressions containing timestamps
ftrace: Drop a VLA in module_exists()
tracing: Mention trace_clock=global when warning about unstable clocks
tracing: Default to using trace_global_clock if sched_clock is unstable
...
Merge tag 'libnvdimm-for-4.17' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm updates from Dan Williams:
"This cycle was was not something I ever want to repeat as there were
several late changes that have only now just settled.
Half of the branch up to commit d2c997c0f145 ("fs, dax: use
page->mapping to warn...") have been in -next for several releases.
The of_pmem driver and the address range scrub rework were late
arrivals, and the dax work was scaled back at the last moment.
The of_pmem driver missed a previous merge window due to an oversight.
A sense of obligation to rectify that miss is why it is included for
4.17. It has acks from PowerPC folks. Stephen reported a build failure
that only occurs when merging it with your latest tree, for now I have
fixed that up by disabling modular builds of of_pmem. A test merge
with your tree has received a build success report from the 0day robot
over 156 configs.
An initial version of the ARS rework was submitted before the merge
window. It is self contained to libnvdimm, a net code reduction, and
passing all unit tests.
The filesystem-dax changes are based on the wait_var_event()
functionality from tip/sched/core. However, late review feedback
showed that those changes regressed truncate performance to a large
degree. The branch was rewound to drop the truncate behavior change
and now only includes preparation patches and cleanups (with full acks
and reviews). The finalization of this dax-dma-vs-trnucate work will
need to wait for 4.18.
Summary:
- A rework of the filesytem-dax implementation provides for detection
of unmap operations (truncate / hole punch) colliding with
in-progress device-DMA. A fix for these collisions remains a
work-in-progress pending resolution of truncate latency and
starvation regressions.
- The of_pmem driver expands the users of libnvdimm outside of x86
and ACPI to describe an implementation of persistent memory on
PowerPC with Open Firmware / Device tree.
- Address Range Scrub (ARS) handling is completely rewritten to
account for the fact that ARS may run for 100s of seconds and there
is no platform defined way to cancel it. ARS will now no longer
block namespace initialization.
- The NVDIMM Namespace Label implementation is updated to handle
label areas as small as 1K, down from 128K.
- Miscellaneous cleanups and updates to unit test infrastructure"
* tag 'libnvdimm-for-4.17' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (39 commits)
libnvdimm, of_pmem: workaround OF_NUMA=n build error
nfit, address-range-scrub: add module option to skip initial ars
nfit, address-range-scrub: rework and simplify ARS state machine
nfit, address-range-scrub: determine one platform max_ars value
powerpc/powernv: Create platform devs for nvdimm buses
doc/devicetree: Persistent memory region bindings
libnvdimm: Add device-tree based driver
libnvdimm: Add of_node to region and bus descriptors
libnvdimm, region: quiet region probe
libnvdimm, namespace: use a safe lookup for dimm device name
libnvdimm, dimm: fix dpa reservation vs uninitialized label area
libnvdimm, testing: update the default smart ctrl_temperature
libnvdimm, testing: Add emulation for smart injection commands
nfit, address-range-scrub: introduce nfit_spa->ars_state
libnvdimm: add an api to cast a 'struct nd_region' to its 'struct device'
nfit, address-range-scrub: fix scrub in-progress reporting
dax, dm: allow device-mapper to operate without dax support
dax: introduce CONFIG_DAX_DRIVER
fs, dax: use page->mapping to warn if truncate collides with a busy page
ext2, dax: introduce ext2_dax_aops
...
Merge tag 'rtc-4.17' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux
Pull RTC updates from Alexandre Belloni:
"This contains a few series that have been in preparation for a while
and that will help systems with RTCs that will fail in 2038, 2069 or
2100.
Subsystem:
- Add tracepoints
- Rework of the RTC/nvmem API to allow drivers to discard struct
nvmem_config after registration
- New range API, drivers can now expose the useful range of the RTC
- New offset API the core is now able to add an offset to the RTC
time, modifying the supported range.
- Multiple rtc_time64_to_tm fixes
- Handle time_t overflow on 32 bit platforms in the core instead of
letting drivers do crazy things.
- remove rtc_control API
New driver:
- Intersil ISL12026
Drivers:
- Drivers exposing the RTC non volatile memory have been converted to
use nvmem
- Removed useless time and date validation
- Removed an indirection pattern that was a cargo cult from ancient
drivers
- Removed VLA usage
- Fixed a possible race condition in probe functions
- AB8540 support is dropped from ab8500
- pcf85363 now has alarm support"
* tag 'rtc-4.17' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux: (128 commits)
rtc: snvs: Fix usage of snvs_rtc_enable
rtc: mt7622: fix module autoloading for OF platform drivers
rtc: isl12022: use true and false for boolean values
rtc: ab8500: Drop AB8540 support
rtc: remove a warning during scripts/kernel-doc step
rtc: 88pm860x: remove artificial limitation
rtc: 88pm80x: remove artificial limitation
rtc: st-lpc: remove artificial limitation
rtc: mrst: remove artificial limitation
rtc: mv: remove artificial limitation
rtc: hctosys: Ensure system time doesn't overflow time_t
parisc: time: stop validating rtc_time in .read_time
rtc: pcf85063: fix clearing bits in pcf85063_start_clock
rtc: at91sam9: Set name of regmap_config
rtc: s5m: Remove VLA usage
rtc: s5m: Move enum from rtc.h to rtc-s5m.c
rtc: remove VLA usage
rtc: Add useful timestamp definitions
rtc: Add one offset seconds to expand RTC range
rtc: Factor out the RTC range validation into rtc_valid_range()
...
Merge tag 'fbdev-v4.17' of git://github.com/bzolnier/linux
Pull fbdev updates from Bartlomiej Zolnierkiewicz:
"There is nothing really major here, just a couple of small bugfixes,
improvements and cleanups:
- make it possible to load radeonfb driver when offb driver is loaded
first (Mathieu Malaterre)
- fix memory leak in offb driver (Mathieu Malaterre)
- fix unaligned access in udlfb driver (Ladislav Michl)
- convert atmel_lcdfb driver to use GPIO descriptors (Ludovic
Desroches)
- avoid mismatched prototypes in sisfb driver (Arnd Bergmann)
- remove VLA usage from viafb driver (Gustavo A. R. Silva)
- add missing help text to FB_I810_I2 config option (Ulf Magnusson)
- misc fixes (Gustavo A. R. Silva, Colin Ian King, Markus Elfring)
- remove dead code from s3c-fb driver for Exynos and S5PV210
platforms
- misc cleanups (Corentin Labbe, Ladislav Michl, Ulf Magnusson,
Vladimir Zapolskiy, Markus Elfring)"
* tag 'fbdev-v4.17' of git://github.com/bzolnier/linux: (32 commits)
video: fbdev: s3c-fb: remove dead platform code for Exynos and S5PV210 platforms
video: au1100fb: Delete an unnecessary variable initialisation in au1100fb_drv_probe()
video: au1100fb: Improve a size determination in au1100fb_drv_probe()
video: au1100fb: Delete an error message for a failed memory allocation in au1100fb_drv_probe()
video/console/sticore: Delete an error message for a failed memory allocation in sti_try_rom_generic()
video: ARM CLCD: Improve a size determination in clcdfb_probe()
video: ARM CLCD: Delete an error message for a failed memory allocation in clcdfb_probe()
video: matroxfb: Delete an error message for a failed memory allocation in matroxfb_crtc2_probe()
video: s3c-fb: Improve a size determination in s3c_fb_probe()
video: s3c-fb: Delete an error message for a failed memory allocation in s3c_fb_probe()
video: fsl-diu-fb: Delete an error message for a failed memory allocation in fsl_diu_init()
video: ssd1307fb: Improve a size determination in ssd1307fb_probe()
video: smscufx: Delete an error message for a failed memory allocation in ufx_realloc_framebuffer()
video: smscufx: Return an error code only as a constant in ufx_realloc_framebuffer()
video: smscufx: Less checks in ufx_usb_probe() after error detection
video: udlfb: Return an error code only as a constant in dlfb_realloc_framebuffer()
video/fbdev/stifb: Delete an error message for a failed memory allocation in stifb_init_fb()
video/fbdev/stifb: Return -ENOMEM after a failed kzalloc() in stifb_init_fb()
video: fbdev: aty128fb: use true and false for boolean values
fbdev: aty: fix missing indentation in if statement
...
Merge tag 'sound-fix-4.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
"The main purpose of this pull request is a fix for a regression in the
recent PCM OSS emulation code that may lead to RCU stall. Since
syzkaller hits this too often, I send the pull request now with a
minimal collection. Possibly another pull request may follow before
RC1.
The other fixes here are for USB-audio class 2 and 3 to improve the
parser for the clock descriptors. These are rather cleanups but good
for security, too.
Last but not least, another included fix is the trivial one to remove
superfluous WARN_ON() that annoyed syzbot"
* tag 'sound-fix-4.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
ALSA: pcm: Remove WARN_ON() at snd_pcm_hw_params() error
ALSA: pcm: Fix endless loop for XRUN recovery in OSS emulation
ALSA: usb-audio: Add sanity checks in UAC3 clock parsers
ALSA: usb-audio: More strict sanity checks for clock parsers
ALSA: usb-audio: Refactor clock finder helpers
Merge tag 'media/v4.17-2' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media
Pull media fixes from Mauro Carvalho Chehab:
"A series of media updates/fixes for 4.17.
There are two important core fix patches in this series:
- A regression fix on Kernel 4.16 with causes it to not work with
some input devices that depend on media core
- A fix at compat32 bits with causes it to OOPS on overlay, and
affects the Kernels where the CVE-2017-13166 was backported
The remaining ones are other random fixes at the documentation and on
drivers.
The biggest part of this series is a set of 18 patches for the Intel
atomisp driver. Currently, it produces hundreds of warnings/errors on
sparse/smatch, causing me to sometimes ignore new warnings on other
drivers that are not so broken. This driver is on really poor state,
even for staging standards: it has several layers of abstraction on
it, and it supports two different hardware. Selecting between them
require to add a define (there isn't even a Kconfig option for such
purpose). Just on this smatch cleanup, I could easily get rid of 8
"do-nothing" files. So, I'm seriously considering its removal from
upstream, if I don't see any real work on addressing the problems
there along this year"
* tag 'media/v4.17-2' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: (48 commits)
media: v4l2-core: fix size of devnode_nums[] bitarray
media: v4l2-compat-ioctl32: don't oops on overlay
media: i2c: adv748x: afe: fix sparse warning
media: extended-controls.rst: transmitter -> receiver
media: staging: atomisp: stop duplicating input format types
media: staging: atomisp: get rid of an unused var
media: staging: atomisp: stop mixing enum types
media: staging: atomisp: get rid of some static warnings
media: staging: atomisp: use %p to print pointers
media: staging: atomisp: remove an useless check
media: staging: atomisp: avoid a warning if 32 bits build
media: staging: atomisp: don't access a NULL var
media: staging: atomisp: Get rid of *default.host.[ch]
media: staging: atomisp: get rid of an unused function
media: staging: atomisp: remove unused set_pd_base()
media: staging: atomisp: fix endianess issues
media: staging: atomisp: add a missing include
media: staging: atomisp: get rid of stupid statements
media: staging: atomisp: declare static vars as such
media: staging: atomisp: ia_css_output.host: don't use var before check
...
c6x depends on the macro '_BIG_ENDIAN' being defined or not
to correctly select or define endian-specific macros, structures
or pieces of code.
This macro is predefined by the compiler but sparse knows nothing
about it and thus may pre-process files differently from what
gcc would.
Fix this by passing '-D_BIG_ENDIAN' when compiling a big-endian
kernel, like GCC would have done.
To: Mark Salter <msalter@redhat.com>
To: Aurelien Jacquiot <a-jacquiot@ti.com> CC: linux-c6x-dev@linux-c6x.org Signed-off-by: Luc Van Oostenryck <luc.vanoostenryck@gmail.com> Signed-off-by: Mark Salter <msalter@redhat.com>
Fix build error reported by the 0day bot by including the header
file for that macro.
Fixes this build error: (should fix; not tested)
arch/c6x/platforms/plldata.c: In function 'c6472_setup_clocks':
arch/c6x/platforms/plldata.c:279:33: error: implicit declaration of function 'get_coreid'; did you mean 'get_order'? [-Werror=implicit-function-declaration]
c6x_core_clk.parent = &sysclks[get_coreid() + 1];
Reported-by: Fengguang Wu <fengguang.wu@intel.com> Cc: Mark Salter <msalter@redhat.com> Cc: Aurelien Jacquiot <jacquiot.aurelien@gmail.com> Cc: linux-c6x-dev@linux-c6x.org Cc: Ingo Molnar <mingo@kernel.org> Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Mark Salter <msalter@redhat.com>
1) The sockmap code has to free socket memory on close if there is
corked data, from John Fastabend.
2) Tunnel names coming from userspace need to be length validated. From
Eric Dumazet.
3) arp_filter() has to take VRFs properly into account, from Miguel
Fadon Perlines.
4) Fix oops in error path of tcf_bpf_init(), from Davide Caratti.
5) Missing idr_remove() in u32_delete_key(), from Cong Wang.
6) More syzbot stuff. Several use of uninitialized value fixes all
over, from Eric Dumazet.
7) Do not leak kernel memory to userspace in sctp, also from Eric
Dumazet.
8) Discard frames from unused ports in DSA, from Andrew Lunn.
9) Fix DMA mapping and reset/failover problems in ibmvnic, from Thomas
Falcon.
10) Do not access dp83640 PHY registers prematurely after reset, from
Esben Haabendal.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (46 commits)
vhost-net: set packet weight of tx polling to 2 * vq size
net: thunderx: rework mac addresses list to u64 array
inetpeer: fix uninit-value in inet_getpeer
dp83640: Ensure against premature access to PHY registers after reset
devlink: convert occ_get op to separate registration
ARM: dts: ls1021a: Specify TBIPA register address
net/fsl_pq_mdio: Allow explicit speficition of TBIPA address
ibmvnic: Do not reset CRQ for Mobility driver resets
ibmvnic: Fix failover case for non-redundant configuration
ibmvnic: Fix reset scheduler error handling
ibmvnic: Zero used TX descriptor counter on reset
ibmvnic: Fix DMA mapping mistakes
tipc: use the right skb in tipc_sk_fill_sock_diag()
sctp: sctp_sockaddr_af must check minimal addr length for AF_INET6
net: dsa: Discard frames from unused ports
sctp: do not leak kernel memory to user space
soreuseport: initialise timewait reuseport field
ipv4: fix uninit-value in ip_route_output_key_hash_rcu()
dccp: initialize ireq->ir_mark
net: fix uninit-value in __hw_addr_add_ex()
...
Merge branch 'work.namei' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs namei updates from Al Viro:
- make lookup_one_len() safe with parent locked only shared(incoming
afs series wants that)
- fix of getname_kernel() regression from 2015 (-stable fodder, that
one).
* 'work.namei' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
getname_kernel() needs to make sure that ->name != ->iname in long case
make lookup_one_len() safe to use with directory locked shared
new helper: __lookup_slow()
merge common parts of lookup_one_len{,_unlocked} into common helper
Merge tag 'for-linus-4.17-ofs' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux
Pull orangefs updates from Mike Marshall:
"Fixes and cleanups:
- Documentation cleanups
- removal of unused code
- make some structs static
- implement Orangefs vm_operations fault callout
- eliminate two single-use functions and put their cleaned up code in
line.
- replace a vmalloc/memset instance with vzalloc
- fix a race condition bug in wait code"
* tag 'for-linus-4.17-ofs' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux:
Orangefs: documentation updates
orangefs: document package install and xfstests procedure
orangefs: remove unused code
orangefs: make several *_operations structs static
orangefs: implement vm_ops->fault
orangefs: open code short single-use functions
orangefs: replace vmalloc and memset with vzalloc
orangefs: bug fix for a race condition when getting a slot
Stephen Smalley [Mon, 9 Apr 2018 18:36:05 +0000 (14:36 -0400)]
selinux: fix missing dput() before selinuxfs unmount
Commit 0619f0f5e36f ("selinux: wrap selinuxfs state") triggers a BUG
when SELinux is runtime-disabled (i.e. systemd or equivalent disables
SELinux before initial policy load via /sys/fs/selinux/disable based on
/etc/selinux/config SELINUX=disabled).
This does not manifest if SELinux is disabled via kernel command line
argument or if SELinux is enabled (permissive or enforcing).
Before:
SELinux: Disabled at runtime.
BUG: Dentry 000000006d77e5c7{i=17,n=null} still in use (1) [unmount of selinuxfs selinuxfs]
PPC:
- improvements for the radix page fault handler for HV KVM on POWER9
s390:
- more kvm stat counters
- virtio gpu plumbing
- documentation
- facilities improvements
x86:
- support for VMware magic I/O port and pseudo-PMCs
- AMD pause loop exiting
- support for AMD core performance extensions
- support for synchronous register access
- expose nVMX capabilities to userspace
- support for Hyper-V signaling via eventfd
- use Enlightened VMCS when running on Hyper-V
- allow userspace to disable MWAIT/HLT/PAUSE vmexits
- usual roundup of optimizations and nested virtualization bugfixes
Generic:
- API selftest infrastructure (though the only tests are for x86 as
of now)"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (174 commits)
kvm: x86: fix a prototype warning
kvm: selftests: add sync_regs_test
kvm: selftests: add API testing infrastructure
kvm: x86: fix a compile warning
KVM: X86: Add Force Emulation Prefix for "emulate the next instruction"
KVM: X86: Introduce handle_ud()
KVM: vmx: unify adjacent #ifdefs
x86: kvm: hide the unused 'cpu' variable
KVM: VMX: remove bogus WARN_ON in handle_ept_misconfig
Revert "KVM: X86: Fix SMRAM accessing even if VM is shutdown"
kvm: Add emulation for movups/movupd
KVM: VMX: raise internal error for exception during invalid protected mode state
KVM: nVMX: Optimization: Dont set KVM_REQ_EVENT when VMExit with nested_run_pending
KVM: nVMX: Require immediate-exit when event reinjected to L2 and L1 event pending
KVM: x86: Fix misleading comments on handling pending exceptions
KVM: x86: Rename interrupt.pending to interrupt.injected
KVM: VMX: No need to clear pending NMI/interrupt on inject realmode interrupt
x86/kvm: use Enlightened VMCS when running on Hyper-V
x86/hyper-v: detect nested features
x86/hyper-v: define struct hv_enlightened_vmcs and clean field bits
...
Fix subtle macro variable shadowing in min_not_zero()
Commit 3c8ba0d61d04 ("kernel.h: Retain constant expression output for
max()/min()") rewrote our min/max macros to be very clever, but in the
meantime resurrected a variable name shadow issue that we had had
previously fixed in commit 589a9785ee3a ("min/max: remove sparse
warnings when they're nested").
That commit talks about the sparse warnings that this shadowing causes,
which we ignored as just a minor annoyance. But it turns out that the
sparse warning is the least of our problems. We actually have a real
bug due to the shadowing through the interaction with "min_not_zero()",
which ends up doing
min(__x, __y)
internally, and then the new declaration of "__x" and "__y" as new
variables in __cmp_once() results in a complete mess of an expression,
and "min_not_zero()" doesn't work at all.
For some odd reason, this only ever caused (reported) problems on s390,
even though it is a generic issue and most of the (obviously successful)
testing of the problematic commit had happened on other architectures.
Quoting Sebastian Ott:
"What happened is that the bio build by the partition detection code
was attempted to be split by the block layer because the block queue
had a max_sector setting of 0. blk_queue_max_hw_sectors uses
min_not_zero."
So re-introduce the use of __UNIQUE_ID() to make sure that the min/max
macros do not have these kinds of clashes.
[ That said, __UNIQUE_ID() itself has several issues that make it less
than wonderful.
In particular, the "uniqueness" has a fallback on the line number,
which means that it's not actually unique in more complex cases if you
don't build with gcc or clang (which have working unique counters that
aren't tied to line numbers).
That historical broken fallback also means that we have that pointless
"prefix" argument that doesn't actually make much sense _except_ for
the known-broken case. Oh well. ]
Fixes: 3c8ba0d61d04 ("kernel.h: Retain constant expression output for max()/min()") Reported-and-tested-by: Sebastian Ott <sebott@linux.vnet.ibm.com> Cc: Kees Cook <keescook@chromium.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Merge branch 'for-linus-sa1100' of git://git.armlinux.org.uk/~rmk/linux-arm
Pull ARM SA1100 updates from Russell King:
"We have support for arbitary MMIO registers providing platform GPIOs,
which allows us to abstract some of the SA11x0 CF support.
This set of updates makes that change"
* 'for-linus-sa1100' of git://git.armlinux.org.uk/~rmk/linux-arm:
ARM: sa1100/simpad: switch simpad CF to use gpiod APIs
ARM: sa1100/shannon: convert to generic CF sockets
ARM: sa1100/nanoengine: convert to generic CF sockets
ARM: sa1100/h3xxx: switch h3xxx PCMCIA to use gpiod APIs
ARM: sa1100/cerf: convert to generic CF sockets
ARM: sa1100/assabet: convert to generic CF sockets
ARM: sa1100: provide infrastructure to support generic CF sockets
pcmcia: sa1100: provide generic CF support
Merge branch 'for-linus' of git://git.armlinux.org.uk/~rmk/linux-arm
Pull ARM updates from Russell King:
"A number of core ARM changes:
- Refactoring linker script by Nicolas Pitre
- Enable source fortification
- Add support for Cortex R8"
* 'for-linus' of git://git.armlinux.org.uk/~rmk/linux-arm:
ARM: decompressor: fix warning introduced in fortify patch
ARM: 8751/1: Add support for Cortex-R8 processor
ARM: 8749/1: Kconfig: Add ARCH_HAS_FORTIFY_SOURCE
ARM: simplify and fix linker script for TCM
ARM: linker script: factor out TCM bits
ARM: linker script: factor out vectors and stubs
ARM: linker script: factor out unwinding table sections
ARM: linker script: factor out stuff for the .text section
ARM: linker script: factor out stuff for the DISCARD section
ARM: linker script: factor out some common definitions between XIP and non-XIP
Stephen reports that an x86 allmodconfig build fails to build the
of_pmem driver due to a missing definition of of_node_to_nid(). That
helper is currently only exported in the OF_NUMA=y case. In other cases,
ppc and sparc, it is a weak symbol, and outside of those platforms it is
a static inline.
Until an OF_NUMA=n configuration can reliably support usage of
of_node_to_nid() in modules across architectures, mark this driver as
'bool' instead of 'tristate'.
Cc: Rob Herring <robh@kernel.org> Cc: Oliver O'Halloran <oohall@gmail.com> Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 updates from Martin Schwidefsky:
- Improvements for the spectre defense:
* The spectre related code is consolidated to a single file
nospec-branch.c
* Automatic enable/disable for the spectre v2 defenses (expoline vs.
nobp)
* Syslog messages for specve v2 are added
* Enable CONFIG_GENERIC_CPU_VULNERABILITIES and define the attribute
functions for spectre v1 and v2
- Add helper macros for assembler alternatives and use them to shorten
the code in entry.S.
- Add support for persistent configuration data via the SCLP Store Data
interface. The H/W interface requires a page table that uses 4K pages
only, the code to setup such an address space is added as well.
- Enable virtio GPU emulation in QEMU. To do this the depends
statements for a few common Kconfig options are modified.
- Add support for format-3 channel path descriptors and add a binary
sysfs interface to export the associated utility strings.
- Add a sysfs attribute to control the IFCC handling in case of
constant channel errors.
- The vfio-ccw changes from Cornelia.
- Bug fixes and cleanups.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (40 commits)
s390/kvm: improve stack frame constants in entry.S
s390/lpp: use assembler alternatives for the LPP instruction
s390/entry.S: use assembler alternatives
s390: add assembler macros for CPU alternatives
s390: add sysfs attributes for spectre
s390: report spectre mitigation via syslog
s390: add automatic detection of the spectre defense
s390: move nobp parameter functions to nospec-branch.c
s390/cio: add util_string sysfs attribute
s390/chsc: query utility strings via fmt3 channel path descriptor
s390/cio: rename struct channel_path_desc
s390/cio: fix unbind of io_subchannel_driver
s390/qdio: split up CCQ handling for EQBS / SQBS
s390/qdio: don't retry EQBS after CCQ 96
s390/qdio: restrict buffer merging to eligible devices
s390/qdio: don't merge ERROR output buffers
s390/qdio: simplify math in get_*_buffer_frontier()
s390/decompressor: trim uncompressed image head during the build
s390/crypto: Fix kernel crash on aes_s390 module remove.
s390/defkeymap: fix global init to zero
...
ALSA: pcm: Remove WARN_ON() at snd_pcm_hw_params() error
snd_pcm_hw_params() (more exactly snd_pcm_hw_params_choose()) contains
a check of the return error from snd_pcm_hw_param_first() and _last()
with snd_BUG_ON() -- i.e. it may trigger WARN_ON() depending on the
kconfig.
This was a valid check in the past, as these functions shouldn't
return any error if the parameters have been already refined via
snd_pcm_hw_refine() beforehand. However, the recent rewrite
introduced a kmalloc() in snd_pcm_hw_refine() for removing VLA, and
this brought a possibility to trigger an error. As a result, syzbot
caught lots of superfluous kernel WARN_ON() and paniced via fault
injection.
As the WARN_ON() is no longer valid with the introduction of
kmalloc(), let's drop snd_BUG_ON() check, in order to make the world
peaceful place again.
vhost-net: set packet weight of tx polling to 2 * vq size
handle_tx will delay rx for tens or even hundreds of milliseconds when tx busy
polling udp packets with small length(e.g. 1byte udp payload), because setting
VHOST_NET_WEIGHT takes into account only sent-bytes but no single packet length.
Ping-Latencies shown below were tested between two Virtual Machines using
netperf (UDP_STREAM, len=1), and then another machine pinged the client:
Seems pretty consistent, a small dip at 2 VQ sizes.
Ring size is a hint from device about a burst size it can tolerate. Based on
benchmarks, set the weight to 2 * vq size.
To evaluate this change, another tests were done using netperf(RR, TX) between
two machines with Intel(R) Xeon(R) Gold 6133 CPU @ 2.50GHz, and vq size was
tweaked through qemu. Results shown below does not show obvious changes.
Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Haibin Zhang <haibinzhang@tencent.com> Signed-off-by: Yunfang Tai <yunfangtai@tencent.com> Signed-off-by: Lidong Chen <lidongchen@tencent.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net: thunderx: rework mac addresses list to u64 array
It is too expensive to pass u64 values via linked list, instead
allocate array for them by overall number of mac addresses from netdev.
This eventually removes multiple kmalloc() calls, aviod memory
fragmentation and allow to put single null check on kmalloc
return value in order to prevent a potential null pointer dereference.
Addresses-Coverity-ID: 1467429 ("Dereference null return value") Fixes: 37c3347eb247 ("net: thunderx: add ndo_set_rx_mode callback implementation for VF") Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Vadim Lomovtsev <Vadim.Lomovtsev@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
This is a false positive, because the inetpeer wont be erased
from rb-tree if the refcount_dec_if_one(&p->refcnt) does not
succeed. And this wont happen before first inet_putpeer() call
for this inetpeer has been done, and ->dtime field is written
exactly before the refcount_dec_and_test(&p->refcnt).
Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net>
dp83640: Ensure against premature access to PHY registers after reset
The datasheet specifies a 3uS pause after performing a software
reset. The default implementation of genphy_soft_reset() does not
provide this, so implement soft_reset with the needed pause.
Signed-off-by: Esben Haabendal <eha@deif.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
The following pull-request contains BPF updates for your *net* tree.
The main changes are:
1) Two sockmap fixes: i) fix a potential warning when a socket with
pending cork data is closed by freeing the memory right when the
socket is closed instead of seeing still outstanding memory at
garbage collector time, ii) fix a NULL pointer deref in case of
duplicates release calls, so make sure to only reset the sk_prot
pointer when it's in a valid state to do so, both from John.
2) Fix a compilation warning in bpf_prog_attach_check_attach_type()
by moving the function under CONFIG_CGROUP_BPF ifdef since only
used there, from Anders.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
devlink: convert occ_get op to separate registration
This resolves race during initialization where the resources with
ops are registered before driver and the structures used by occ_get
op is initialized. So keep occ_get callbacks registered only when
all structs are initialized.
The example flows, as it is in mlxsw:
1) driver load/asic probe:
mlxsw_core
-> mlxsw_sp_resources_register
-> mlxsw_sp_kvdl_resources_register
-> devlink_resource_register IDX
mlxsw_spectrum
-> mlxsw_sp_kvdl_init
-> mlxsw_sp_kvdl_parts_init
-> mlxsw_sp_kvdl_part_init
-> devlink_resource_size_get IDX (to get the current setup
size from devlink)
-> devlink_resource_occ_get_register IDX (register current
occupancy getter)
2) reload triggered by devlink command:
-> mlxsw_devlink_core_bus_device_reload
-> mlxsw_sp_fini
-> mlxsw_sp_kvdl_fini
-> devlink_resource_occ_get_unregister IDX
(struct mlxsw_sp *mlxsw_sp is freed at this point, call to occ get
which is using mlxsw_sp would cause use-after free)
-> mlxsw_sp_init
-> mlxsw_sp_kvdl_init
-> mlxsw_sp_kvdl_parts_init
-> mlxsw_sp_kvdl_part_init
-> devlink_resource_size_get IDX (to get the current setup
size from devlink)
-> devlink_resource_occ_get_register IDX (register current
occupancy getter)
Fixes: d9f9b9a4d05f ("devlink: Add support for resource abstraction") Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
The current (mildly evil) fsl_pq_mdio code uses an undocumented shadow of
the TBIPA register on LS1021A, which happens to be read-only.
Changing TBI PHY address therefore does not work on LS1021A.
The real (and documented) address of the TBIPA registere lies in the eTSEC
block and not in MDIO/MII, which is read/write, so using that fixes
the problem.
Signed-off-by: Esben Haabendal <eha@deif.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net/fsl_pq_mdio: Allow explicit speficition of TBIPA address
This introduces a simpler and generic method for for finding (and mapping)
the TBIPA register.
Instead of relying of complicated logic for finding the TBIPA register
address based on the MDIO or MII register block base
address, which even in some cases relies on undocumented shadow registers,
a second "reg" entry for the mdio bus devicetree node specifies the TBIPA
register.
Backwards compatibility is kept, as the existing logic is applied when
only a single "reg" mapping is specified.
Signed-off-by: Esben Haabendal <eha@deif.com> Signed-off-by: David S. Miller <davem@davemloft.net>
====================
ibmvnic: Fix driver reset and DMA bugs
This patch series introduces some fixes to the driver reset
routines and a patch that fixes mistakes caught by the kernel
DMA debugger.
The reset fixes include a fix to reset TX queue counters properly
after a reset as well as updates to driver reset error-handling code.
It also provides updates to the reset handling routine for redundant
backing VF failover and partition migration cases.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
ibmvnic: Do not reset CRQ for Mobility driver resets
When resetting the ibmvnic driver after a partition migration occurs
there is no requirement to do a reset of the main CRQ. The current
driver code does the required re-enable of the main CRQ, then does
a reset of the main CRQ later.
What we should be doing for a driver reset after a migration is to
re-enable the main CRQ, release all the sub-CRQs, and then allocate
new sub-CRQs after capability negotiation.
This patch updates the handling of mobility resets to do the proper
work and not reset the main CRQ. To do this the initialization/reset
of the main CRQ had to be moved out of the ibmvnic_init routine
and in to the ibmvnic_probe and do_reset routines.
Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com> Signed-off-by: Thomas Falcon <tlfalcon@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Thomas Falcon [Fri, 6 Apr 2018 23:37:05 +0000 (18:37 -0500)]
ibmvnic: Fix failover case for non-redundant configuration
There is a failover case for a non-redundant pseries VNIC
configuration that was not being handled properly. The current
implementation assumes that the driver will always have a redandant
device to communicate with following a failover notification. There
are cases, however, when a non-redundant configuration can receive
a failover request. If that happens, the driver should wait until
it receives a signal that the device is ready for operation.
The driver is agnostic of its backing hardware configuration,
so this fix necessarily affects all device failover management.
The driver needs to wait until it receives a signal that the device
is ready for resetting. A flag is introduced to track this intermediary
state where the driver is waiting for an active device.
Signed-off-by: Thomas Falcon <tlfalcon@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Thomas Falcon [Fri, 6 Apr 2018 23:37:04 +0000 (18:37 -0500)]
ibmvnic: Fix reset scheduler error handling
In some cases, if the driver is waiting for a reset following
a device parameter change, failure to schedule a reset can result
in a hang since a completion signal is never sent.
If the device configuration is being altered by a tool such
as ethtool or ifconfig, it could cause the console to hang
if the reset request does not get scheduled. Add some additional
error handling code to exit the wait_for_completion if there is
one in progress.
Signed-off-by: Thomas Falcon <tlfalcon@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Thomas Falcon [Fri, 6 Apr 2018 23:37:03 +0000 (18:37 -0500)]
ibmvnic: Zero used TX descriptor counter on reset
The counter that tracks used TX descriptors pending completion
needs to be zeroed as part of a device reset. This change fixes
a bug causing transmit queues to be stopped unnecessarily and in
some cases a transmit queue stall and timeout reset. If the counter
is not reset, the remaining descriptors will not be "removed",
effectively reducing queue capacity. If the queue is over half full,
it will cause the queue to stall if stopped.
Signed-off-by: Thomas Falcon <tlfalcon@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Thomas Falcon [Fri, 6 Apr 2018 23:37:02 +0000 (18:37 -0500)]
ibmvnic: Fix DMA mapping mistakes
Fix some mistakes caught by the DMA debugger. The first change
fixes a unnecessary unmap that should have been removed in an
earlier update. The next hunk fixes another bad unmap by zeroing
the bit checked to determine that an unmap is needed. The final
change fixes some buffers that are unmapped with the wrong
direction specified.
Signed-off-by: Thomas Falcon <tlfalcon@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Cong Wang [Sat, 7 Apr 2018 01:54:52 +0000 (18:54 -0700)]
tipc: use the right skb in tipc_sk_fill_sock_diag()
Commit 4b2e6877b879 ("tipc: Fix namespace violation in tipc_sk_fill_sock_diag")
tried to fix the crash but failed, the crash is still 100% reproducible
with it.
In tipc_sk_fill_sock_diag(), skb is the diag dump we are filling, it is not
correct to retrieve its NETLINK_CB(), instead, like other protocol diag,
we should use NETLINK_CB(cb->skb).sk here.
Reported-by: <syzbot+326e587eff1074657718@syzkaller.appspotmail.com> Fixes: 4b2e6877b879 ("tipc: Fix namespace violation in tipc_sk_fill_sock_diag") Fixes: c30b70deb5f4 (tipc: implement socket diagnostics for AF_TIPC) Cc: GhantaKrishnamurthy MohanKrishna <mohan.krishna.ghanta.krishnamurthy@ericsson.com> Cc: Jon Maloy <jon.maloy@ericsson.com> Cc: Ying Xue <ying.xue@windriver.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Local variable description: ----address@SYSC_bind
Variable was created at:
SYSC_bind+0x6f/0x4b0 net/socket.c:1461
SyS_bind+0x54/0x80 net/socket.c:1460
Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Vlad Yasevich <vyasevich@gmail.com> Cc: Neil Horman <nhorman@tuxdriver.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Andrew Lunn [Sat, 7 Apr 2018 18:37:40 +0000 (20:37 +0200)]
net: dsa: Discard frames from unused ports
The Marvell switches under some conditions will pass a frame to the
host with the port being the CPU port. Such frames are invalid, and
should be dropped. Not dropping them can result in a crash when
incrementing the receive statistics for an invalid port.
Reported-by: Chris Healy <cphealy@gmail.com> Fixes: 91da11f870f0 ("net: Distributed Switch Architecture protocol support") Signed-off-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Local variable description: ----addr@___sys_recvmsg
Variable was created at:
___sys_recvmsg+0xd5/0x810 net/socket.c:2172
__sys_recvmmsg+0x54e/0xdb0 net/socket.c:2313
Bytes 8-15 of 16 are uninitialized
==================================================================
Kernel panic - not syncing: panic_on_warn set ...
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Vlad Yasevich <vyasevich@gmail.com>
Cc: Neil Horman <nhorman@tuxdriver.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Fixes: da5e36308d9f ("soreuseport: TCP/IPv4 implementation") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Local variable description: ----res.i.i@ip_route_output_flow
Variable was created at:
ip_route_output_flow+0x75/0x3c0 net/ipv4/route.c:2576
raw_sendmsg+0x1861/0x3ed0 net/ipv4/raw.c:653
Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Fixes: f001fde5eadd ("net: introduce a list of device addresses dev_addr_list (v6)") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Sat, 7 Apr 2018 20:42:39 +0000 (13:42 -0700)]
net: initialize skb->peeked when cloning
syzbot reported __skb_try_recv_from_queue() was using skb->peeked
while it was potentially unitialized.
We need to clear it in __skb_clone()
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Sat, 7 Apr 2018 20:42:38 +0000 (13:42 -0700)]
net: fix rtnh_ok()
syzbot reported :
BUG: KMSAN: uninit-value in rtnh_ok include/net/nexthop.h:11 [inline]
BUG: KMSAN: uninit-value in fib_count_nexthops net/ipv4/fib_semantics.c:469 [inline]
BUG: KMSAN: uninit-value in fib_create_info+0x554/0x8d20 net/ipv4/fib_semantics.c:1091
@remaining is an integer, coming from user space.
If it is negative we want rtnh_ok() to return false.
Fixes: 4e902c57417c ("[IPv4]: FIB configuration using struct fib_config") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Sat, 7 Apr 2018 20:42:37 +0000 (13:42 -0700)]
netlink: fix uninit-value in netlink_sendmsg
syzbot reported :
BUG: KMSAN: uninit-value in ffs arch/x86/include/asm/bitops.h:432 [inline]
BUG: KMSAN: uninit-value in netlink_sendmsg+0xb26/0x1310 net/netlink/af_netlink.c:1851
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Sat, 7 Apr 2018 20:42:36 +0000 (13:42 -0700)]
crypto: af_alg - fix possible uninit-value in alg_bind()
syzbot reported :
BUG: KMSAN: uninit-value in alg_bind+0xe3/0xd90 crypto/af_alg.c:162
We need to check addr_len before dereferencing sa (or uaddr)
Fixes: bb30b8848c85 ("crypto: af_alg - whitelist mask and type") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Cc: Stephan Mueller <smueller@chronox.de> Cc: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
Merge branch 'next-integrity' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security
Pull integrity updates from James Morris:
"A mixture of bug fixes, code cleanup, and continues to close
IMA-measurement, IMA-appraisal, and IMA-audit gaps.
Also note the addition of a new cred_getsecid LSM hook by Matthew
Garrett:
For IMA purposes, we want to be able to obtain the prepared secid
in the bprm structure before the credentials are committed. Add a
cred_getsecid hook that makes this possible.
which is used by a new CREDS_CHECK target in IMA:
In ima_bprm_check(), check with both the existing process
credentials and the credentials that will be committed when the new
process is started. This will not change behaviour unless the
system policy is extended to include CREDS_CHECK targets -
BPRM_CHECK will continue to check the same credentials that it did
previously"
* 'next-integrity' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security:
ima: Fallback to the builtin hash algorithm
ima: Add smackfs to the default appraise/measure list
evm: check for remount ro in progress before writing
ima: Improvements in ima_appraise_measurement()
ima: Simplify ima_eventsig_init()
integrity: Remove unused macro IMA_ACTION_RULE_FLAGS
ima: drop vla in ima_audit_measurement()
ima: Fix Kconfig to select TPM 2.0 CRB interface
evm: Constify *integrity_status_msg[]
evm: Move evm_hmac and evm_hash from evm_main.c to evm_crypto.c
fuse: define the filesystem as untrusted
ima: fail signature verification based on policy
ima: clear IMA_HASH
ima: re-evaluate files on privileged mounted filesystems
ima: fail file signature verification on non-init mounted filesystems
IMA: Support using new creds in appraisal policy
security: Add a cred_getsecid hook
Merge branch 'next-tpm' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security
Pull TPM updates from James Morris:
"This release contains only bug fixes. There are no new major features
added"
* 'next-tpm' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security:
tpm: fix intermittent failure with self tests
tpm: add retry logic
tpm: self test failure should not cause suspend to fail
tpm2: add longer timeouts for creation commands.
tpm_crb: use __le64 annotated variable for response buffer address
tpm: fix buffer type in tpm_transmit_cmd
tpm: tpm-interface: fix tpm_transmit/_cmd kdoc
tpm: cmd_ready command can be issued only after granting locality
Sinan Kaya [Mon, 2 Apr 2018 17:48:00 +0000 (13:48 -0400)]
alpha: io: reorder barriers to guarantee writeX() and iowriteX() ordering
memory-barriers.txt has been updated with the following requirement.
"When using writel(), a prior wmb() is not needed to guarantee that the
cache coherent memory writes have completed before writing to the MMIO
region."
Current writeX() and iowriteX() implementations on alpha are not
satisfying this requirement as the barrier is after the register write.
Move mb() in writeX() and iowriteX() functions to guarantee that HW
observes memory changes before performing register operations.
Signed-off-by: Sinan Kaya <okaya@codeaurora.org> Reported-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Matt Turner <mattst88@gmail.com>
Michael Cree [Mon, 26 Feb 2018 09:02:12 +0000 (22:02 +1300)]
alpha: Implement CPU vulnerabilities sysfs functions.
Implement the CPU vulnerabilty show functions for meltdown, spectre_v1
and spectre_v2 on Alpha.
Tests on XP1000 (EV67/667MHz) and ES45 (EV68CB/1.25GHz) show them
to be vulnerable to Meltdown and Spectre V1. In the case of
Meltdown I saw a 1 to 2% success rate in reading bytes on the
XP1000 and 50 to 60% success rate on the ES45. (This compares to
99.97% success reported for Intel CPUs.) Report EV6 and later
CPUs as vulnerable.
Tests on PWS600au (EV56/600MHz) for Spectre V1 attack were
unsuccessful (though I did not try particularly hard) so mark EV4
through to EV56 as not vulnerable.
Signed-off-by: Michael Cree <mcree@orcon.net.nz> Signed-off-by: Matt Turner <mattst88@gmail.com>
Merge branch 'misc.compat' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull alpha syscall cleanups from Al Viro:
"A couple of SYSCALL_DEFINE conversions and removal of pointless (and
bitrotted) piece stuck in ret_from_kernel_thread since the
kernel_exceve/kernel_thread conversions six years ago"
* 'misc.compat' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
alpha: get rid of pointless insn in ret_from_kernel_thread
alpha: switch pci syscalls to SYSCALL_DEFINE
Merge branch 'misc.sparc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull sparc syscall cleanups from Al Viro:
"sparc syscall stuff - killing pointless wrappers, conversions to
{COMPAT_,}SYSCALL_DEFINE"
* 'misc.sparc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
sparc: get rid of asm wrapper for nis_syscall()
sparc: switch compat {f,}truncate64() to COMPAT_SYSCALL_DEFINE
sparc: switch compat pread64 and pwrite64 to COMPAT_SYSCALL_DEFINE
convert compat sync_file_range() to COMPAT_SYSCALL_DEFINE
switch sparc_remap_file_pages() to SYSCALL_DEFINE
sparc: get rid of memory_ordering(2) wrapper
sparc: trivial conversions to {COMPAT_,}SYSCALL_DEFINE()
sparc: bury a zombie extern that had been that way for twenty years
sparc: get rid of remaining SIGN... wrappers
sparc: kill useless SIGN... wrappers
sparc: get rid of sys_sparc_pipe() wrappers
treewide: fix up files incorrectly marked executable
Joe Perches noted that we have a few source files that for some
inexplicable reason (read: I'm too lazy to even go look at the history)
are marked executable:
Merge branch 'i2c/for-4.17' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux
Pull i2c updates from Wolfram Sang:
-I2C core now reports proper OF style module alias. I'd like to repeat
the note from the commit msg here (Thanks, Javier!):
NOTE: This patch may break out-of-tree drivers that were relying
on this behavior, and only had an I2C device ID table even
when the device was registered via OF.
There are no remaining drivers in mainline that do this, but
out-of-tree drivers have to be fixed and define a proper OF
device ID table to have module auto-loading working.
- new driver for the SynQuacer I2C controller
- major refactoring of the QUP driver
- the piix4 driver now uses request_muxed_region which should fix a
long standing resource conflict with the sp5100_tco watchdog
- a bunch of small core & driver improvements
* 'i2c/for-4.17' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux: (53 commits)
i2c: add support for Socionext SynQuacer I2C controller
dt-bindings: i2c: add binding for Socionext SynQuacer I2C
i2c: Update i2c_trace_msg static key to modern api
i2c: fix parameter of trace_i2c_result
i2c: imx: avoid taking clk_prepare mutex in PM callbacks
i2c: imx: use clk notifier for rate changes
i2c: make i2c_check_addr_validity() static
i2c: rcar: fix mask value of prohibited bit
dt-bindings: i2c: document R8A77965 bindings
i2c: pca-platform: drop gpio from platform data
i2c: pca-platform: use device_property_read_u32
i2c: pca-platform: unconditionally use devm_gpiod_get_optional
sh: sh7785lcr: add GPIO lookup table for i2c controller reset
i2c: qup: reorganization of driver code to remove polling for qup v2
i2c: qup: reorganization of driver code to remove polling for qup v1
i2c: qup: send NACK for last read sub transfers
i2c: qup: fix buffer overflow for multiple msg of maximum xfer len
i2c: qup: change completion timeout according to transfer length
i2c: qup: use the complete transfer length to choose DMA mode
i2c: qup: proper error handling for i2c error in BAM mode
...
Merge tag 'powerpc-4.17-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates from Michael Ellerman:
"Notable changes:
- Support for 4PB user address space on 64-bit, opt-in via mmap().
- Removal of POWER4 support, which was accidentally broken in 2016
and no one noticed, and blocked use of some modern instructions.
- Workarounds so that the hypervisor can enable Transactional Memory
on Power9.
- A series to disable the DAWR (Data Address Watchpoint Register) on
Power9.
- More information displayed in the meltdown/spectre_v1/v2 sysfs
files.
- A vpermxor (Power8 Altivec) implementation for the raid6 Q
Syndrome.
- A big series to make the allocation of our pacas (per cpu area),
kernel page tables, and per-cpu stacks NUMA aware when using the
Radix MMU on Power9.
And as usual many fixes, reworks and cleanups.
Thanks to: Aaro Koskinen, Alexandre Belloni, Alexey Kardashevskiy,
Alistair Popple, Andy Shevchenko, Aneesh Kumar K.V, Anshuman Khandual,
Balbir Singh, Benjamin Herrenschmidt, Christophe Leroy, Christophe
Lombard, Cyril Bur, Daniel Axtens, Dave Young, Finn Thain, Frederic
Barrat, Gustavo Romero, Horia Geantă, Jonathan Neuschäfer, Kees Cook,
Larry Finger, Laurent Dufour, Laurent Vivier, Logan Gunthorpe,
Madhavan Srinivasan, Mark Greer, Mark Hairgrove, Markus Elfring,
Mathieu Malaterre, Matt Brown, Matt Evans, Mauricio Faria de Oliveira,
Michael Neuling, Naveen N. Rao, Nicholas Piggin, Paul Mackerras,
Philippe Bergheaud, Ram Pai, Rob Herring, Sam Bobroff, Segher
Boessenkool, Simon Guo, Simon Horman, Stewart Smith, Sukadev
Bhattiprolu, Suraj Jitindar Singh, Thiago Jung Bauermann, Vaibhav
Jain, Vaidyanathan Srinivasan, Vasant Hegde, Wei Yongjun"
* tag 'powerpc-4.17-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (207 commits)
powerpc/64s/idle: Fix restore of AMOR on POWER9 after deep sleep
powerpc/64s: Fix POWER9 DD2.2 and above in cputable features
powerpc/64s: Fix pkey support in dt_cpu_ftrs, add CPU_FTR_PKEY bit
powerpc/64s: Fix dt_cpu_ftrs to have restore_cpu clear unwanted LPCR bits
Revert "powerpc/64s/idle: POWER9 ESL=0 stop avoid save/restore overhead"
powerpc: iomap.c: introduce io{read|write}64_{lo_hi|hi_lo}
powerpc: io.h: move iomap.h include so that it can use readq/writeq defs
cxl: Fix possible deadlock when processing page faults from cxllib
powerpc/hw_breakpoint: Only disable hw breakpoint if cpu supports it
powerpc/mm/radix: Update command line parsing for disable_radix
powerpc/mm/radix: Parse disable_radix commandline correctly.
powerpc/mm/hugetlb: initialize the pagetable cache correctly for hugetlb
powerpc/mm/radix: Update pte fragment count from 16 to 256 on radix
powerpc/mm/keys: Update documentation and remove unnecessary check
powerpc/64s/idle: POWER9 ESL=0 stop avoid save/restore overhead
powerpc/64s/idle: Consolidate power9_offline_stop()/power9_idle_stop()
powerpc/powernv: Always stop secondaries before reboot/shutdown
powerpc: hard disable irqs in smp_send_stop loop
powerpc: use NMI IPI for smp_send_stop
powerpc/powernv: Fix SMT4 forcing idle code
...
Merge tag 'leaks-4.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tobin/leaks
Pull leaking-addresses updates from Tobin Harding:
"This set represents improvements to the scripts/leaking_addresses.pl
script.
The major improvement is that with this set applied the script
actually runs in a reasonable amount of time (less than a minute on a
standard stock Ubuntu user desktop). Also, we have a second maintainer
now and a tree hosted on kernel.org
We do a few code clean ups. We fix the command help output. Handling
of the vsyscall address range is fixed to check the whole range
instead of just the start/end addresses. We add support for 5 page
table levels (suggested on LKML). We use a system command to get the
machine architecture instead of using Perl. Calling this command for
every regex comparison is what previously choked the script, caching
the result of this call gave the major speed improvement. We add
support for scanning 32-bit kernels using the user/kernel memory
split. Path skipping code refactored and simplified (meaning easier
script configuration). We remove version numbering. We add a variable
name to improve readability of a regex and finally we check filenames
for leaking addresses.
Currently script scans /proc/PID for all PID. With this set applied we
only scan for PID==1. It was observed that on an idle system files
under /proc/PID are predominantly the same for all processes. Also it
was noted that the script does not scan _all_ the kernel since it only
scans active processes. Scanning only for PID==1 makes explicit the
inherent flaw in the script that the scan is only partial and also
speeds things up"
* tag 'leaks-4.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tobin/leaks:
MAINTAINERS: Update LEAKING_ADDRESSES
leaking_addresses: check if file name contains address
leaking_addresses: explicitly name variable used in regex
leaking_addresses: remove version number
leaking_addresses: skip '/proc/1/syscall'
leaking_addresses: skip all /proc/PID except /proc/1
leaking_addresses: cache architecture name
leaking_addresses: simplify path skipping
leaking_addresses: do not parse binary files
leaking_addresses: add 32-bit support
leaking_addresses: add is_arch() wrapper subroutine
leaking_addresses: use system command to get arch
leaking_addresses: add support for 5 page table levels
leaking_addresses: add support for kernel config file
leaking_addresses: add range check for vsyscall memory
leaking_addresses: indent dependant options
leaking_addresses: remove command examples
leaking_addresses: remove mention of kptr_restrict
leaking_addresses: fix typo function not called
Merge tag 'linux-kselftest-4.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
Pull kselftest update from Shuah Khan:
"This Kselftest update for 4.17-rc1 consists of:
- Test build error fixes
- Fixes to prevent intel_pstate from building on non-x86 systems.
- New test for ion with vgem driver.
- Change to print the test name to /dev/kmsg to add context to kernel
failures if any uncovered from running the test.
- Kselftest framework enhancements to add KSFT_TAP_LEVEL environment
variable to prevent nested TAP headers being printed in the
Kselftest output.
Nested TAP13 headers could cause problems for some parsers. This
change suppresses the nested headers from test programs and test
shell scripts with changes to framework and Makefiles without
changing the tests"
* tag 'linux-kselftest-4.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
selftests/intel_pstate: Fix build rule for x86
selftests: Print the test we're running to /dev/kmsg
selftests/seccomp: Allow get_metadata to XFAIL
selftests/android/ion: Makefile: fix build error
selftests: futex Makefile add top level TAP header echo to RUN_TESTS
selftests: Makefile set KSFT_TAP_LEVEL to prevent nested TAP headers
selftests: lib.mk set KSFT_TAP_LEVEL to prevent nested TAP headers
selftests: kselftest framework: add handling for TAP header level
selftests: ion: Add simple test with the vgem driver
selftests: ion: Remove some prints
Merge branch 'next-general' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security
Pull general security layer updates from James Morris:
- Convert security hooks from list to hlist, a nice cleanup, saving
about 50% of space, from Sargun Dhillon.
- Only pass the cred, not the secid, to kill_pid_info_as_cred and
security_task_kill (as the secid can be determined from the cred),
from Stephen Smalley.
- Close a potential race in kernel_read_file(), by making the file
unwritable before calling the LSM check (vs after), from Kees Cook.
* 'next-general' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security:
security: convert security hooks to use hlist
exec: Set file unwritable before LSM check
usb, signal, security: only pass the cred, not the secid, to kill_pid_info_as_cred and security_task_kill
Cong Wang [Sat, 7 Apr 2018 00:19:41 +0000 (17:19 -0700)]
net_sched: fix a missing idr_remove() in u32_delete_key()
When we delete a u32 key via u32_delete_key(), we forget to
call idr_remove() to remove its handle from IDR.
Fixes: e7614370d6f0 ("net_sched: use idr to allocate u32 filter handles") Reported-by: Marcin Kabiesz <admin@hostcenter.eu> Tested-by: Marcin Kabiesz <admin@hostcenter.eu> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Merge tag 'fscache-next-20180406' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
Pull fscache updates from David Howells:
"Three patches that fix some of AFS's usage of fscache:
(1) Need to invalidate the cache if a foreign data change is detected
on the server.
(2) Move the vnode ID uniquifier (equivalent to i_generation) from
the auxiliary data to the index key to prevent a race between
file delete and a subsequent file create seeing the same index
key.
(3) Need to retire cookies that correspond to files that we think got
deleted on the server.
Four patches to fix some things in fscache and cachefiles:
(4) Fix a couple of checker warnings.
(5) Correctly indicate to the end-of-operation callback whether an
operation completed or was cancelled.
(6) Add a check for multiple cookie relinquishment.
(7) Fix a path through the asynchronous write that doesn't wake up a
waiter for a page if the cache decides not to write that page,
but discards it instead.
A couple of patches to add tracepoints to fscache and cachefiles:
(8) Add tracepoints for cookie operators, object state machine
execution, cachefiles object management and cachefiles VFS
operations.
(9) Add tracepoints for fscache operation management and page
wrangling.
And then three development patches:
(10) Attach the index key and auxiliary data to the cookie, pass this
information through various fscache-netfs API functions and get
rid of the callbacks to the netfs to get it.
This means that the cache can get at this information, even if
the netfs goes away. It also means that the cache can be lazy in
updating the coherency data.
(11) Pass the object data size through various fscache-netfs API
rather than calling back to the netfs for it, and store the value
in the object.
This makes it easier to correctly resize the object, as the size
is updated on writes to the cache, rather than calling back out
to the netfs.
(12) Maintain a catalogue of allocated cookies. This makes it possible
to catch cookie collision up front rather than down in the bowels
of the cache being run from a service thread from the object
state machine.
This will also make it possible in the future to reconnect to a
cookie that's not gone dead yet because it's waiting for
finalisation of the storage and also make it possible to bring
cookies online if the cache is added after the cookie has been
obtained"
* tag 'fscache-next-20180406' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
fscache: Maintain a catalogue of allocated cookies
fscache: Pass object size in rather than calling back for it
fscache: Attach the index key and aux data to the cookie
fscache: Add more tracepoints
fscache: Add tracepoints
fscache: Fix hanging wait on page discarded by writeback
fscache: Detect multiple relinquishment of a cookie
fscache: Pass the correct cancelled indications to fscache_op_complete()
fscache, cachefiles: Fix checker warnings
afs: Be more aggressive in retiring cached vnodes
afs: Use the vnode ID uniquifier in the cache key not the aux data
afs: Invalidate cache on server data change
Dan Williams [Mon, 2 Apr 2018 22:28:03 +0000 (15:28 -0700)]
nfit, address-range-scrub: add module option to skip initial ars
After attempting to quickly retrieve known errors the kernel proceeds to
kick off a long running ARS. Add a module option to disable this
behavior at initialization time, or at new region discovery time.
Otherwise, ARS can be started manually regardless of the state of this
setting.
Co-developed-by: Dave Jiang <dave.jiang@intel.com> Signed-off-by: Dave Jiang <dave.jiang@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Dan Williams [Thu, 5 Apr 2018 23:18:55 +0000 (16:18 -0700)]
nfit, address-range-scrub: rework and simplify ARS state machine
ARS is an operation that can take 10s to 100s of seconds to find media
errors that should rarely be present. If the platform crashes due to
media errors in persistent memory, the expectation is that the BIOS will
report those known errors in a 'short' ARS request.
A 'short' ARS request asks platform firmware to return an ARS payload
with all known errors, but without issuing a 'long' scrub. At driver
init a short request is issued to all PMEM ranges before registering
regions. Then, in the background, a long ARS is scheduled for each
region.
The ARS implementation is simplified to centralize ARS completion work
in the ars_complete() helper. The timeout is removed since there is no
facility to cancel ARS, and this otherwise arranges for system init to
never be blocked waiting for a 'long' ARS. The ars_state flags are used
to coordinate ARS requests from driver init, ARS requests from
userspace, and ARS requests in response to media error notifications.
Given that there is no notification of ARS completion the implementation
still needs to poll. It backs off exponentially to a maximum poll period
of 30 minutes.
Suggested-by: Toshi Kani <toshi.kani@hpe.com> Co-developed-by: Dave Jiang <dave.jiang@intel.com> Signed-off-by: Dave Jiang <dave.jiang@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Dan Williams [Thu, 5 Apr 2018 08:25:02 +0000 (01:25 -0700)]
nfit, address-range-scrub: determine one platform max_ars value
acpi_nfit_query_poison() is awkward in that it requires an nfit_spa
argument in order to determine what max_ars value to use. Instead probe
for the minimum max_ars across all scrub-capable ranges in the system
and drop the nfit_spa argument.
This enables a larger rework / simplification of the ARS state machine
whereby the status can be retrieved once and then iterated over all
address ranges to reap completions.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Add device-tree binding documentation for the nvdimm region driver.
Cc: devicetree@vger.kernel.org Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Acked-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
This patch adds peliminary device-tree bindings for persistent memory
regions. The driver registers a libnvdimm bus for each pmem-region
node and each address range under the node is converted to a region
within that bus.
Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
libnvdimm: Add of_node to region and bus descriptors
We want to be able to cross reference the region and bus devices
with the device tree node that they were spawned from. libNVDIMM
handles creating the actual devices for these internally, so we
need to pass in a pointer to the relevant node in the descriptor.
Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Acked-by: Dan Williams <dan.j.williams@intel.com> Acked-by: Balbir Singh <bsingharora@gmail.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Dan Williams [Sat, 7 Apr 2018 14:47:10 +0000 (07:47 -0700)]
libnvdimm, region: quiet region probe
The message about constraining number of online cpus to be less than or
equal to ND_MAX_LANES (256) is only useful for block-aperture
configurations and BTT. Make it debug since it is only relevant when
debugging performance.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
ALSA: pcm: Fix endless loop for XRUN recovery in OSS emulation
The commit 02a5d6925cd3 ("ALSA: pcm: Avoid potential races between OSS
ioctls and read/write") split the PCM preparation code to a locked
version, and it added a sanity check of runtime->oss.prepare flag
along with the change. This leaded to an endless loop when the stream
gets XRUN: namely, snd_pcm_oss_write3() and co call
snd_pcm_oss_prepare() without setting runtime->oss.prepare flag and
the loop continues until the PCM state reaches to another one.
As the function is supposed to execute the preparation
unconditionally, drop the invalid state check there.
ALSA: usb-audio: Add sanity checks in UAC3 clock parsers
The UAC3 clock parser codes lack of the sanity checks for malformed
descriptors like UAC2 parser does. Without it, the driver may lead to
a potential crash.
Fixes: 9a2fe9b801f5 ("ALSA: usb: initial USB Audio Device Class 3.0 support") Tested-by: Ruslan Bilovol <ruslan.bilovol@gmail.com> Reviewed-by: Ruslan Bilovol <ruslan.bilovol@gmail.com> Signed-off-by: Takashi Iwai <tiwai@suse.de>
ALSA: usb-audio: More strict sanity checks for clock parsers
The sanity checks introduced for malformed descriptors loosely check
the given descriptor size, although the size greater than the defined
description is invalid. It was due to a concern of any funky firmware
in the actual products. But this doesn't look hitting, and any sane
products must have the defined descriptors.
So in this patch, we make the validators more strict, allowing only
with the defined descriptor sizes. The value in clock selector
validator is corrected from 5 to 7 to count the two unlisted fields
after baCSourceID[].
Suggested-by: Ruslan Bilovol <ruslan.bilovol@gmail.com> Reviewed-by: Ruslan Bilovol <ruslan.bilovol@gmail.com> Signed-off-by: Takashi Iwai <tiwai@suse.de>
There are lots of open-coded functions to find a clock source,
selector and multiplier. Now there are both v2 and v3, so six
variants.
This patch refactors the code to use a common helper for the main
loop, and define each validator function for each target.
There is no functional change.
Fixes: 9a2fe9b801f5 ("ALSA: usb: initial USB Audio Device Class 3.0 support") Reviewed-by: Ruslan Bilovol <ruslan.bilovol@gmail.com> Signed-off-by: Takashi Iwai <tiwai@suse.de>
/*
* Give up if we don't find an instance of a uuid at each
* position (from 0 to nd_region->ndr_mappings - 1), or if we
* find a dimm with two instances of the same uuid.
*/
dev_err(&nd_region->dev, "%s missing label for %pUb\n",
dev_name(ndd->dev), nd_label->uuid);
Dan Williams [Fri, 6 Apr 2018 18:25:38 +0000 (11:25 -0700)]
libnvdimm, dimm: fix dpa reservation vs uninitialized label area
At initialization time the 'dimm' driver caches a copy of the memory
device's label area and reserves address space for each of the
namespaces defined.
However, as can be seen below, the reservation occurs even when the
index blocks are invalid:
Gate dpa reservation on the presence of valid index blocks.
Cc: <stable@vger.kernel.org> Fixes: 4a826c83db4e ("libnvdimm: namespace indices: read and validate") Reported-by: Krzysztof Rusocki <krzysztof.rusocki@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
- Tag new vfio-platform sub-maintainer (Alex Williamson)
* tag 'vfio-v4.17-rc1' of git://github.com/awilliam/linux-vfio:
MAINTAINERS: vfio/platform: Update sub-maintainer
vfio/pci: Add ioeventfd support
vfio/pci: Use endian neutral helpers
vfio/pci: Pull BAR mapping setup from read-write path
vfio/type1: Improve memory pinning process for raw PFN mapping
vfio-mdev/samples: change RDI interrupt condition
vfio/type1: Adopt fast IOTLB flush interface when unmap IOVAs