Jimmy Assarsson [Mon, 6 Aug 2018 13:14:50 +0000 (15:14 +0200)]
can: kvaser_usb: Fix potential uninitialized variable use
If alloc_can_err_skb() fails, cf is never initialized.
Move assignment of cf inside check.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Jimmy Assarsson <jimmyassarsson@gmail.com> Cc: linux-stable <stable@vger.kernel.org> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
Oliver Hartkopp [Wed, 24 Oct 2018 08:27:12 +0000 (10:27 +0200)]
can: raw: check for CAN FD capable netdev in raw_sendmsg()
When the socket is CAN FD enabled it can handle CAN FD frame
transmissions. Add an additional check in raw_sendmsg() as a CAN2.0 CAN
driver (non CAN FD) should never see a CAN FD frame. Due to the commonly
used can_dropped_invalid_skb() function the CAN 2.0 driver would drop
that CAN FD frame anyway - but with this patch the user gets a proper
-EINVAL return code.
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Cc: linux-stable <stable@vger.kernel.org> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
Stefan Wahren [Thu, 8 Nov 2018 19:38:26 +0000 (20:38 +0100)]
net: smsc95xx: Fix MTU range
The commit f77f0aee4da4 ("net: use core MTU range checking in USB NIC
drivers") introduce a common MTU handling for usbnet. But it's missing
the necessary changes for smsc95xx. So set the MTU range accordingly.
This patch has been tested on a Raspberry Pi 3.
Fixes: f77f0aee4da4 ("net: use core MTU range checking in USB NIC drivers") Signed-off-by: Stefan Wahren <stefan.wahren@i2se.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Tested on SoCFPGA Stratix10 with ping sweep from 100 to 8300 byte packets.
Fixes: 286a83721720 ("stmmac: add CHAINED descriptor mode support (V4)") Suggested-by: Jose Abreu <jose.abreu@synopsys.com> Signed-off-by: Thor Thayer <thor.thayer@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
This patch series fixes several bugs in the SPQ mechanism.
It deals with SPQ entries management, preventing resource leaks, memory
corruptions and handles error cases throughout the driver.
Please consider applying to net.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Sagiv Ozeri [Thu, 8 Nov 2018 14:46:11 +0000 (16:46 +0200)]
qed: Fix potential memory corruption
A stuck ramrod should be deleted from the completion_pending list,
otherwise it will be added again in the future and corrupt the list.
Return error value to inform that ramrod is stuck and should be deleted.
Signed-off-by: Sagiv Ozeri <sagiv.ozeri@cavium.com> Signed-off-by: Denis Bolotin <denis.bolotin@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Denis Bolotin [Thu, 8 Nov 2018 14:46:10 +0000 (16:46 +0200)]
qed: Fix SPQ entries not returned to pool in error flows
qed_sp_destroy_request() API was added for SPQ users that need to
free/return the entry they acquired in their error flows.
Signed-off-by: Denis Bolotin <denis.bolotin@cavium.com> Signed-off-by: Michal Kalderon <michal.kalderon@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Denis Bolotin [Thu, 8 Nov 2018 14:46:09 +0000 (16:46 +0200)]
qed: Fix blocking/unlimited SPQ entries leak
When there are no SPQ entries left in the free_pool, new entries are
allocated and are added to the unlimited list. When an entry in the pool
is available, the content is copied from the original entry, and the new
entry is sent to the device. qed_spq_post() is not aware of that, so the
additional entry is stored in the original entry as p_post_ent, which can
later be returned to the pool.
Signed-off-by: Denis Bolotin <denis.bolotin@cavium.com> Signed-off-by: Michal Kalderon <michal.kalderon@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Denis Bolotin [Thu, 8 Nov 2018 14:46:08 +0000 (16:46 +0200)]
qed: Fix memory/entry leak in qed_init_sp_request()
Free the allocated SPQ entry or return the acquired SPQ entry to the free
list in error flows.
Signed-off-by: Denis Bolotin <denis.bolotin@cavium.com> Signed-off-by: Michal Kalderon <michal.kalderon@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Fri, 9 Nov 2018 01:34:27 +0000 (17:34 -0800)]
inet: frags: better deal with smp races
Multiple cpus might attempt to insert a new fragment in rhashtable,
if for example RPS is buggy, as reported by 배석진 in
https://patchwork.ozlabs.org/patch/994601/
We use rhashtable_lookup_get_insert_key() instead of
rhashtable_insert_fast() to let cpus losing the race
free their own inet_frag_queue and use the one that
was inserted by another cpu.
Fixes: 648700f76b03 ("inet: frags: use rhashtables for reassembly units") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: 배석진 <soukjin.bae@samsung.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Huazhong Tan [Thu, 8 Nov 2018 02:13:19 +0000 (10:13 +0800)]
net: hns3: bugfix for not checking return value
hns3_reset_notify_init_enet() only return error early if the return
value of hns3_restore_vlan() is not 0.
This patch adds checking for the return value of hns3_restore_vlan.
Fixes: 7fa6be4fd2f6 ("net: hns3: fix incorrect return value/type of some functions") Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
====================
FDDI: defza: Fix a bunch of small issues
Here is a bunch of small fixes addressing issues that I missed in my
final round of testing. None of these affect run-time behaviour. One was
actually found by the kbuild bot, which turned out to be more pedantic
than my compiler. See individual change descriptions for details.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
FDDI: defza: Move SMT Tx data buffer declaration next to its skb
Move the temporary data buffer used when tapping into the SMT Tx queue
from the outer function level into the conditional block it's actually
used in and its containing skb is also declared, making the structure of
code better.
Signed-off-by: Maciej W. Rozycki <macro@linux-mips.org> Signed-off-by: David S. Miller <davem@davemloft.net>
The SPDX annotation for this driver does not match the license text,
which specifies GNU GPL 2 or later. Make the two match by correcting
the SPDX tag.
Signed-off-by: Maciej W. Rozycki <macro@linux-mips.org> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 8 Nov 2018 01:08:51 +0000 (17:08 -0800)]
Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-queue
Jeff Kirsher says:
====================
Intel Wired LAN Driver Updates 2018-11-07
This series contains fixes to igb, i40e and ice drivers.
Anirudh fixes an issue during rebuild of the ice driver, where we need
to set the carrier state, as well as start or stop the queues all based
on the link status. Removed functions that were duplicating current
functionality in the VSI rebuild/replay framework.
Dave fixes a potential resource collision during the remove path, so add
a check to see if we are in the middle of a reset. Fixed the remove
path to ensure we call netif_napi_del() to free vectors before we set
vsi->netdev to NULL.
Akeem fixes an issue when the receive or transmit pause parameter is
set, results in link loss on the interface. Fixed the spelling of
"Enabling" in error message.
Victor fixes potential memory leak by also freeing the related VSI
contexts in the unload path.
Md Fahad fixes a flag during port VLAN insertion, which was not being
set properly.
Brett fixes a transmit timeout during stress due to the hardware tail
and software tail were incorrectly out of sync.
Miroslav Lichvar fixes the igb PHC timecounter update interval to be
sure the timecounter is updated in time.
Chinh fixes the req_speeds variable to be u16 instead of u8 so that it
can handle all the link speeds.
Jake fixes i40e to add back the missing feature flags, which was causing
IP-in-IP offloads to be reported as not supported.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Jacob Keller [Mon, 29 Oct 2018 17:52:43 +0000 (10:52 -0700)]
i40e: enable NETIF_F_NTUPLE and NETIF_F_HW_TC at driver load
The assignment of the feature flag NETIF_F_NTUPLE and NETIF_F_HW_TC
occurs prior to the initial setup of the local hw_features variable.
This means the features are set as user-changeable, but are not set in
the currently active feature list. This results in the features being
disabled at the driver's initial load.
Move the assignment after the initial assignment of hw_features, and
assign to the local variable. This ensures that NETIF_F_NTUPLE and
NETIF_F_HW_TC are marked as user-changeable, and also enables them by
default when the driver loads.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Jacob Keller [Mon, 29 Oct 2018 17:52:42 +0000 (10:52 -0700)]
i40e: restore NETIF_F_GSO_IPXIP[46] to netdev features
Since commit bacd75cfac8a ("i40e/i40evf: Add capability exchange for
outer checksum", 2017-04-06) the i40e driver has not reported support
for IP-in-IP offloads. This likely occurred due to a bad rebase, as the
commit extracts hw_enc_features into its own variable. As part of this
change, it dropped the NETIF_F_FSO_IPXIP flags from the
netdev->hw_enc_features. This was unfortunately not caught during code
review.
Fix this by adding back the missing feature flags.
For reference, NETIF_F_GSO_IPXIP4 was added in commit 7e13318daa4a
("net: define gso types for IPx over IPv4 and IPv6", 2016-05-20),
replacing NETIF_F_GSO_IPIP and NETIF_F_GSO_SIT.
NETIF_F_GSO_IPXIP6 was added in commit bf2d1df39502 ("intel: Add support
for IPv6 IP-in-IP offload", 2016-05-20).
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Chinh T Cao [Mon, 5 Nov 2018 20:18:35 +0000 (12:18 -0800)]
ice: Change req_speeds to be u16
Since the req_speeds field in struct ice_link_status is a u8,
req_speeds & ICE_AQ_LINK_SPEED_40GB always returns 0. This was caught
by a coverity scan.
Fix this by changing req_speeds to be u16.
Reported-by: Bruce Allan <bruce.w.allan@intel.com> Signed-off-by: Chinh T Cao <chinh.t.cao@intel.com> Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Miroslav Lichvar [Fri, 26 Oct 2018 17:13:00 +0000 (19:13 +0200)]
igb: shorten maximum PHC timecounter update interval
The timecounter needs to be updated at least once per ~550 seconds in
order to avoid a 40-bit SYSTIM timestamp to be misinterpreted as an old
timestamp.
Since commit 500462a9de65 ("timers: Switch to a non-cascading wheel"),
scheduling of delayed work seems to be less accurate and a requested
delay of 540 seconds may actually be longer than 550 seconds. Also, the
PHC may be adjusted to run up to 6% faster than real time and the system
clock up to 10% slower. Shorten the delay to 360 seconds to be sure the
timecounter is updated in time.
This fixes an issue with HW timestamps on 82580/I350/I354 being off by
~1100 seconds for few seconds every ~9 minutes.
Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Miroslav Lichvar <mlichvar@redhat.com> Acked-by: Jacob Keller <jacob.e.keller@intel.com> Acked-by: Richard Cochran <richardcochran@gmail.com> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Brett Creeley [Fri, 26 Oct 2018 17:40:59 +0000 (10:40 -0700)]
ice: Fix the bytecount sent to netdev_tx_sent_queue
Currently if the driver does a TSO offload the bytecount sent to
netdev_tx_sent_queue will be incorrect. This is because in ice_tso we
overwrite the initial value that we set in ice_tx_map. This creates a
mismatch between the Tx and Tx clean flow. In the Tx clean flow we
calculate the bytecount (called total_bytes) as we clean the
descriptors so the value used in the Tx clean path is correct. Fix this
by using += in ice_tso instead of =. This fixes the mismatch in
bytecount mentioned above.
Signed-off-by: Brett Creeley <brett.creeley@intel.com> Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Brett Creeley [Fri, 26 Oct 2018 17:40:58 +0000 (10:40 -0700)]
ice: Fix tx_timeout in PF driver
Prior to this commit the driver was running into tx_timeouts when a
queue was stressed enough. This was happening because the HW tail
and SW tail (NTU) were incorrectly out of sync. Consequently this was
causing the HW head to collide with the HW tail, which to the hardware
means that all descriptors posted for Tx have been processed.
Due to the Tx logic used in the driver SW tail and HW tail are allowed
to be out of sync. This is done as an optimization because it allows the
driver to write HW tail as infrequently as possible, while still
updating the SW tail index to keep track. However, there are situations
where this results in the tail never getting updated, resulting in Tx
timeouts.
An issue was found in the Tx logic that was causing the afore mentioned
condition for updating HW tail to never happen, causing tx_timeouts.
In ice_xmit_frame_ring we calculate how many descriptors we need for the
Tx transaction based on the skb the kernel hands us. This is then passed
into ice_maybe_stop_tx along with some extra padding to determine if we
have enough descriptors available for this transaction. If we don't then
we return -EBUSY to the stack, otherwise we move on and eventually
prepare the Tx descriptors accordingly in ice_tx_map and set
next_to_watch. In ice_tx_map we make another call to ice_maybe_stop_tx
with a value of MAX_SKB_FRAGS + 4. The key here is that this value is
possibly less than the value we sent in the first call to
ice_maybe_stop_tx in ice_xmit_frame_ring. Now, if the number of unused
descriptors is between MAX_SKB_FRAGS + 4 and the value used in the first
call to ice_maybe_stop_tx in ice_xmit_frame_ring then we do not update
the HW tail because of the "Tx HW tail write condition" above. This is
because in ice_maybe_stop_tx we return success from ice_maybe_stop_tx
instead of calling __ice_maybe_stop_tx and subsequently calling
netif_stop_subqueue, which sets the __QUEUE_STATE_DEV_XOFF bit. This
bit is then checked in the "Tx HW tail write condition" by calling
netif_xmit_stopped and subsequently updating HW tail if the
afore mentioned bit is set.
In ice_clean_tx_irq, if next_to_watch is not NULL, we end up cleaning
the descriptors that HW sets the DD bit on and we have the budget. The
HW head will eventually run into the HW tail in response to the
description in the paragraph above.
The next time through ice_xmit_frame_ring we make the initial call to
ice_maybe_stop_tx with another skb from the stack. This time we do not
have enough descriptors available and we return NETDEV_TX_BUSY to the
stack and end up setting next_to_watch to NULL.
This is where we are stuck. In ice_clean_tx_irq we never clean anything
because next_to_watch is always NULL and in ice_xmit_frame_ring we never
update HW tail because we already return NETDEV_TX_BUSY to the stack and
eventually we hit a tx_timeout.
This issue was fixed by making sure that the second call to
ice_maybe_stop_tx in ice_tx_map is passed a value that is >= the value
that was used on the initial call to ice_maybe_stop_tx in
ice_xmit_frame_ring. This was done by adding the following defines to
make the logic more clear and to reduce the chance of mucking this up
again:
The ICE_CACHE_LINE_BYTES being 64 is an assumption being made so we
don't have to figure this out on every pass through the Tx path. Instead
I added a sanity check in ice_probe to verify cache line size and print
a message if it's not 64 Bytes. This will make it easier to file issues
if they are seen when the cache line size is not 64 Bytes when reading
from the GLPCI_CNF2 register.
Signed-off-by: Brett Creeley <brett.creeley@intel.com> Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Dave Ertman [Fri, 26 Oct 2018 17:40:57 +0000 (10:40 -0700)]
ice: Fix napi delete calls for remove
In the remove path, the vsi->netdev is being set to NULL before the call
to free vectors. This is causing the netif_napi_del call to never be made.
Add a call to ice_napi_del to the same location as the calls to
unregister_netdev and just prior to them. This will use the reverse flow
as the register and netif_napi_add calls.
Signed-off-by: Dave Ertman <david.m.ertman@intel.com> Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
ice: Remove duplicate addition of VLANs in replay path
ice_restore_vlan and active_vlans were originally put in place to
reprogram VLAN filters in the replay path. This is now done as part
of the much broader VSI rebuild/replay framework. So remove both
ice_restore_vlan and active_vlans
Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
The remove path does not currently check to see if a
reset is in progress before proceeding. This can cause
a resource collision resulting in various types of errors.
Check for reset in progress and wait for a reasonable
amount of time before allowing the remove to progress.
Signed-off-by: Dave Ertman <david.m.ertman@intel.com> Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
This patch allows users to enable/disable internal TX and/or RX clock
delay for BCM54616S PHYs so as to satisfy RGMII timing specifications.
On a particular platform, whether TX and/or RX clock delay is required
depends on how PHY connected to the MAC IP. This requirement can be
specified through "phy-mode" property in the platform device tree.
The patch is inspired by commit 733336262b28 ("net: phy: Allow BCM5481x
PHYs to setup internal TX/RX clock delay").
Signed-off-by: Tao Ren <taoren@fb.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Tue, 6 Nov 2018 16:12:10 +0000 (08:12 -0800)]
Merge tag 'trace-v4.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fix from Steven Rostedt:
"Masami found a slight bug in his code where he transposed the
arguments of a call to strpbrk.
The reason this wasn't detected in our tests is that the only way this
would transpire is when a kprobe event with a symbol offset is
attached to a function that belongs to a module that isn't loaded yet.
When the kprobe trace event is added, the offset would be truncated
after it was parsed, and when the module is loaded, it would use the
symbol without the offset (as the nul character added by the parsing
would not be replaced with the original character)"
* tag 'trace-v4.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing/kprobes: Fix strpbrk() argument order
Linus Torvalds [Tue, 6 Nov 2018 16:10:01 +0000 (08:10 -0800)]
Merge branch 'spectre' of git://git.armlinux.org.uk/~rmk/linux-arm
Pull ARM fix from Russell King:
"Ard spotted a typo in one of the assembly files which leads to a
kernel oops when that code path is executed. Fix this"
* 'spectre' of git://git.armlinux.org.uk/~rmk/linux-arm:
ARM: 8809/1: proc-v7: fix Thumb annotation of cpu_v7_hvc_switch_mm
1) Handle errors mid-stream of an all dump, from Alexey Kodanev.
2) Fix build of openvswitch with certain combinations of netfilter
options, from Arnd Bergmann.
3) Fix interactions between GSO and BQL, from Eric Dumazet.
4) Don't put a '/' in RTL8201F's sysfs file name, from Holger
Hoffstätte.
5) S390 qeth driver fixes from Julian Wiedmann.
6) Allow ipv6 link local addresses for netconsole when both source and
destination are link local, from Matwey V. Kornilov.
7) Fix the BPF program address seen in /proc/kallsyms, from Song Liu.
8) Initialize mutex before use in dsa microchip driver, from Tristram
Ha.
9) Out-of-bounds access in hns3, from Yunsheng Lin.
10) Various netfilter fixes from Stefano Brivio, Jozsef Kadlecsik, Jiri
Slaby, Florian Westphal, Eric Westbrook, Andrey Ryabinin, and Pablo
Neira Ayuso.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (50 commits)
net: alx: make alx_drv_name static
net: bpfilter: fix iptables failure if bpfilter_umh is disabled
sock_diag: fix autoloading of the raw_diag module
net: core: netpoll: Enable netconsole IPv6 link local address
ipv6: properly check return value in inet6_dump_all()
rtnetlink: restore handling of dumpit return value in rtnl_dump_all()
net/ipv6: Move anycast init/cleanup functions out of CONFIG_PROC_FS
bonding/802.3ad: fix link_failure_count tracking
net: phy: realtek: fix RTL8201F sysfs name
sctp: define SCTP_SS_DEFAULT for Stream schedulers
sctp: fix strchange_flags name for Stream Change Event
mlxsw: spectrum: Fix IP2ME CPU policer configuration
openvswitch: fix linking without CONFIG_NF_CONNTRACK_LABELS
qed: fix link config error handling
net: hns3: Fix for out-of-bounds access when setting pfc back pressure
net/mlx4_en: use __netdev_tx_sent_queue()
net: do not abort bulk send on BQL status
net: bql: add __netdev_tx_sent_queue()
s390/qeth: report 25Gbit link speed
s390/qeth: sanitize ARP requests
...
Ard Biesheuvel [Mon, 5 Nov 2018 13:54:56 +0000 (14:54 +0100)]
ARM: 8809/1: proc-v7: fix Thumb annotation of cpu_v7_hvc_switch_mm
Due to what appears to be a copy/paste error, the opening ENTRY()
of cpu_v7_hvc_switch_mm() lacks a matching ENDPROC(), and instead,
the one for cpu_v7_smc_switch_mm() is duplicated.
Given that it is ENDPROC() that emits the Thumb annotation, the
cpu_v7_hvc_switch_mm() routine will be called in ARM mode on a
Thumb2 kernel, resulting in the following splat:
Cc: <stable@vger.kernel.org> Fixes: 10115105cb3a ("ARM: spectre-v2: add firmware based hardening") Reviewed-by: Dave Martin <Dave.Martin@arm.com> Acked-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
The following patchset contains the first batch of Netfilter fixes for
your net tree:
1) Fix splat with IPv6 defragmenting locally generated fragments,
from Florian Westphal.
2) Fix Incorrect check for missing attribute in nft_osf.
3) Missing INT_MIN & INT_MAX definition for netfilter bridge uapi
header, from Jiri Slaby.
4) Revert map lookup in nft_numgen, this is already possible with
the existing infrastructure without this extension.
5) Fix wrong listing of set reference counter, make counter
synchronous again, from Stefano Brivio.
6) Fix CIDR 0 in hash:net,port,net, from Eric Westbrook.
7) Fix allocation failure with large set, use kvcalloc().
From Andrey Ryabinin.
8) No need to disable BH when fetch ip set comment, patch from
Jozsef Kadlecsik.
9) Sanity check for valid sysfs entry in xt_IDLETIMER, from
Taehee Yoo.
10) Fix suspicious rcu usage via ip_set() macro at netlink dump,
from Jozsef Kadlecsik.
11) Fix setting default timeout via nfnetlink_cttimeout, this
comes with preparation patch to add nf_{tcp,udp,...}_pernet()
helper.
12) Allow ebtables table nat to be of filter type via nft_compat.
From Florian Westphal.
13) Incorrect calculation of next bucket in early_drop, do no bump
hash value, update bucket counter instead. From Vasily Khoruzhick.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Taehee Yoo [Mon, 5 Nov 2018 13:31:41 +0000 (22:31 +0900)]
net: bpfilter: fix iptables failure if bpfilter_umh is disabled
When iptables command is executed, ip_{set/get}sockopt() try to upload
bpfilter.ko if bpfilter is enabled. if it couldn't find bpfilter.ko,
command is failed.
bpfilter.ko is generated if CONFIG_BPFILTER_UMH is enabled.
ip_{set/get}sockopt() only checks CONFIG_BPFILTER.
So that if CONFIG_BPFILTER is enabled and CONFIG_BPFILTER_UMH is disabled,
iptables command is always failed.
test config:
CONFIG_BPFILTER=y
# CONFIG_BPFILTER_UMH is not set
test command:
%iptables -L
iptables: No chain/target/match by that name.
Fixes: d2ba09c17a06 ("net: add skeleton of bpfilter kernel module") Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Andrei Vagin [Mon, 5 Nov 2018 06:37:15 +0000 (22:37 -0800)]
sock_diag: fix autoloading of the raw_diag module
IPPROTO_RAW isn't registred as an inet protocol, so
inet_protos[protocol] is always NULL for it.
Cc: Cyrill Gorcunov <gorcunov@gmail.com> Cc: Xin Long <lucien.xin@gmail.com> Fixes: bf2ae2e4bf93 ("sock_diag: request _diag module only when the family or proto has been registered") Signed-off-by: Andrei Vagin <avagin@gmail.com> Reviewed-by: Cyrill Gorcunov <gorcunov@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Alexey Kodanev [Fri, 2 Nov 2018 16:11:05 +0000 (19:11 +0300)]
ipv6: properly check return value in inet6_dump_all()
Make sure we call fib6_dump_end() if it happens that skb->len
is zero. rtnl_dump_all() can reset cb->args on the next loop
iteration there.
Fixes: 08e814c9e8eb ("net/ipv6: Bail early if user only wants cloned entries") Fixes: ae677bbb4441 ("net: Don't return invalid table id error when dumping all families") Signed-off-by: Alexey Kodanev <alexey.kodanev@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Alexey Kodanev [Fri, 2 Nov 2018 16:11:04 +0000 (19:11 +0300)]
rtnetlink: restore handling of dumpit return value in rtnl_dump_all()
For non-zero return from dumpit() we should break the loop
in rtnl_dump_all() and return the result. Otherwise, e.g.,
we could get the memory leak in inet6_dump_fib() [1]. The
pointer to the allocated struct fib6_walker there (saved
in cb->args) can be lost, reset on the next iteration.
Fix it by partially restoring the previous behavior before
commit c63586dc9b3e ("net: rtnl_dump_all needs to propagate
error from dumpit function"). The returned error from
dumpit() is still passed further.
Fixes: c63586dc9b3e ("net: rtnl_dump_all needs to propagate error from dumpit function") Signed-off-by: Alexey Kodanev <alexey.kodanev@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jeff Barnhill [Mon, 5 Nov 2018 20:36:45 +0000 (20:36 +0000)]
net/ipv6: Move anycast init/cleanup functions out of CONFIG_PROC_FS
Move the anycast.c init and cleanup functions which were inadvertently
added inside the CONFIG_PROC_FS definition.
Fixes: 2384d02520ff ("net/ipv6: Add anycast addresses to a global hashtable") Signed-off-by: Jeff Barnhill <0xeffeff@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
compiler: remove __no_sanitize_address_or_inline again
The __no_sanitize_address_or_inline and __no_kasan_or_inline defines
are almost identical. The only difference is that __no_kasan_or_inline
does not have the 'notrace' attribute.
To be able to replace __no_sanitize_address_or_inline with the older
definition, add 'notrace' to __no_kasan_or_inline and change to two
users of __no_sanitize_address_or_inline in the s390 code.
The 'notrace' option is necessary for e.g. the __load_psw_mask function
in arch/s390/include/asm/processor.h. Without the option it is possible
to trace __load_psw_mask which leads to kernel stack overflow.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Pointed-out-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix strpbrk()'s argument order, it must pass acceptable string
in 2nd argument. Note that this can cause a kernel panic where
it recovers backup character to code->data.
Jarod Wilson [Sun, 4 Nov 2018 19:59:46 +0000 (14:59 -0500)]
bonding/802.3ad: fix link_failure_count tracking
Commit 4d2c0cda07448ea6980f00102dc3964eb25e241c set slave->link to
BOND_LINK_DOWN for 802.3ad bonds whenever invalid speed/duplex values
were read, to fix a problem with slaves getting into weird states, but
in the process, broke tracking of link failures, as going straight to
BOND_LINK_DOWN when a link is indeed down (cable pulled, switch rebooted)
means we broke out of bond_miimon_inspect()'s BOND_LINK_DOWN case because
!link_state was already true, we never incremented commit, and never got
a chance to call bond_miimon_commit(), where slave->link_failure_count
would be incremented. I believe the simple fix here is to mark the slave
as BOND_LINK_FAIL, and let bond_miimon_inspect() transition the link from
_FAIL to either _UP or _DOWN, and in the latter case, we now get proper
incrementing of link_failure_count again.
Fixes: 4d2c0cda0744 ("bonding: speed/duplex update at NETDEV_UP event") CC: Mahesh Bandewar <maheshb@google.com> CC: David S. Miller <davem@davemloft.net> CC: netdev@vger.kernel.org CC: stable@vger.kernel.org Signed-off-by: Jarod Wilson <jarod@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Since 4.19 the following error in sysfs has appeared when using the
r8169 NIC driver:
$cd /sys/module/realtek/drivers
$ls -l
ls: cannot access 'mdio_bus:RTL8201F 10/100Mbps Ethernet': No such file or directory
[..garbled dir entries follow..]
Apparently the forward slash in "10/100Mbps Ethernet" is interpreted
as directory separator that leads nowhere, and was introduced in commit 513588dd44b ("net: phy: realtek: add RTL8201F phy-id and functions").
Fix this by removing the offending slash in the driver name.
Other drivers in net/phy seem to have the same problem, but I cannot
test/verify them.
Fixes: 513588dd44b ("net: phy: realtek: add RTL8201F phy-id and functions") Signed-off-by: Holger Hoffstätte <holger@applied-asynchrony.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Sun, 4 Nov 2018 22:46:04 +0000 (14:46 -0800)]
Merge tag 'tags/upstream-4.20-rc1' of git://git.infradead.org/linux-ubifs
Pull UBIFS updates from Richard Weinberger:
- Full filesystem authentication feature, UBIFS is now able to have the
whole filesystem structure authenticated plus user data encrypted and
authenticated.
- Minor cleanups
* tag 'tags/upstream-4.20-rc1' of git://git.infradead.org/linux-ubifs: (26 commits)
ubifs: Remove unneeded semicolon
Documentation: ubifs: Add authentication whitepaper
ubifs: Enable authentication support
ubifs: Do not update inode size in-place in authenticated mode
ubifs: Add hashes and HMACs to default filesystem
ubifs: authentication: Authenticate super block node
ubifs: Create hash for default LPT
ubfis: authentication: Authenticate master node
ubifs: authentication: Authenticate LPT
ubifs: Authenticate replayed journal
ubifs: Add auth nodes to garbage collector journal head
ubifs: Add authentication nodes to journal
ubifs: authentication: Add hashes to index nodes
ubifs: Add hashes to the tree node cache
ubifs: Create functions to embed a HMAC in a node
ubifs: Add helper functions for authentication support
ubifs: Add separate functions to init/crc a node
ubifs: Format changes for authentication support
ubifs: Store read superblock node
ubifs: Drop write_node
...
Linus Torvalds [Sun, 4 Nov 2018 16:20:09 +0000 (08:20 -0800)]
Merge tag 'nfs-for-4.20-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client bugfixes from Trond Myklebust:
"Highlights include:
Bugfix:
- Fix build issues on architectures that don't provide 64-bit cmpxchg
Cleanups:
- Fix a spelling mistake"
* tag 'nfs-for-4.20-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
NFS: fix spelling mistake, EACCESS -> EACCES
SUNRPC: Use atomic(64)_t for seq_send(64)
Linus Torvalds [Sun, 4 Nov 2018 16:12:44 +0000 (08:12 -0800)]
Merge tag 'ntb-4.20' of git://github.com/jonmason/ntb
Pull NTB updates from Jon Mason:
"Fairly minor changes and bug fixes:
NTB IDT thermal changes and hook into hwmon, ntb_netdev clean-up of
private struct, and a few bug fixes"
* tag 'ntb-4.20' of git://github.com/jonmason/ntb:
ntb: idt: Alter the driver info comments
ntb: idt: Discard temperature sensor IRQ handler
ntb: idt: Add basic hwmon sysfs interface
ntb: idt: Alter temperature read method
ntb_netdev: Simplify remove with client device drvdata
NTB: transport: Try harder to alloc an aligned MW buffer
ntb: ntb_transport: Mark expected switch fall-throughs
ntb: idt: Set PCIe bus address to BARLIMITx
NTB: ntb_hw_idt: replace IS_ERR_OR_NULL with regular NULL checks
ntb: intel: fix return value for ndev_vec_mask()
ntb_netdev: fix sleep time mismatch
Xin Long [Sat, 3 Nov 2018 06:01:31 +0000 (14:01 +0800)]
sctp: define SCTP_SS_DEFAULT for Stream schedulers
According to rfc8260#section-4.3.2, SCTP_SS_DEFAULT is required to
defined as SCTP_SS_FCFS or SCTP_SS_RR.
SCTP_SS_FCFS is used for SCTP_SS_DEFAULT's value in this patch.
Fixes: 5bbbbe32a431 ("sctp: introduce stream scheduler foundations") Reported-by: Jianwen Ji <jiji@redhat.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Xin Long [Sat, 3 Nov 2018 05:59:45 +0000 (13:59 +0800)]
sctp: fix strchange_flags name for Stream Change Event
As defined in rfc6525#section-6.1.3, SCTP_STREAM_CHANGE_DENIED
and SCTP_STREAM_CHANGE_FAILED should be used instead of
SCTP_ASSOC_CHANGE_DENIED and SCTP_ASSOC_CHANGE_FAILED.
To keep the compatibility, fix it by adding two macros.
Fixes: b444153fb5a6 ("sctp: add support for generating add stream change event notification") Reported-by: Jianwen Ji <jiji@redhat.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
The following pull-request contains BPF updates for your *net* tree.
The main changes are:
1) Fix BPF prog kallsyms and bpf_prog_get_info_by_fd() jited address export
to not use page start but actual start instead to work correctly with
profiling e.g. finding hot instructions from stack traces. Also fix latter
with regards to dumping address and jited len for !multi prog, from Song.
2) Fix bpf_prog_get_info_by_fd() for !root to zero info.nr_jited_func_lens
instead of leaving the user provided length, from Daniel.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Shalom Toledo [Fri, 2 Nov 2018 19:49:15 +0000 (19:49 +0000)]
mlxsw: spectrum: Fix IP2ME CPU policer configuration
The CPU policer used to police packets being trapped via a local route
(IP2ME) was incorrectly configured to police based on bytes per second
instead of packets per second.
Change the policer to police based on packets per second and avoid
packet loss under certain circumstances.
Fixes: 9148e7cf73ce ("mlxsw: spectrum: Add policers for trap groups") Signed-off-by: Shalom Toledo <shalomt@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Arnd Bergmann [Fri, 2 Nov 2018 15:36:24 +0000 (16:36 +0100)]
qed: fix link config error handling
gcc-8 notices that qed_mcp_get_transceiver_data() may fail to
return a result to the caller:
drivers/net/ethernet/qlogic/qed/qed_mcp.c: In function 'qed_mcp_trans_speed_mask':
drivers/net/ethernet/qlogic/qed/qed_mcp.c:1955:2: error: 'transceiver_type' may be used uninitialized in this function [-Werror=maybe-uninitialized]
When an error is returned by qed_mcp_get_transceiver_data(), we
should propagate that to the caller of qed_mcp_trans_speed_mask()
rather than continuing with uninitialized data.
Fixes: c56a8be7e7aa ("qed: Add supported link and advertise link to display in ethtool.") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Sun, 4 Nov 2018 01:37:09 +0000 (18:37 -0700)]
Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
"A memory (under-)allocation fix and a comment fix"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/topology: Fix off by one bug
sched/rt: Update comment in pick_next_task_rt()
Linus Torvalds [Sun, 4 Nov 2018 01:25:17 +0000 (18:25 -0700)]
Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Ingo Molnar:
"A number of fixes and some late updates:
- make in_compat_syscall() behavior on x86-32 similar to other
platforms, this touches a number of generic files but is not
intended to impact non-x86 platforms.
- objtool fixes
- PAT preemption fix
- paravirt fixes/cleanups
- cpufeatures updates for new instructions
- earlyprintk quirk
- make microcode version in sysfs world-readable (it is already
world-readable in procfs)
- minor cleanups and fixes"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
compat: Cleanup in_compat_syscall() callers
x86/compat: Adjust in_compat_syscall() to generic code under !COMPAT
objtool: Support GCC 9 cold subfunction naming scheme
x86/numa_emulation: Fix uniform-split numa emulation
x86/paravirt: Remove unused _paravirt_ident_32
x86/mm/pat: Disable preemption around __flush_tlb_all()
x86/paravirt: Remove GPL from pv_ops export
x86/traps: Use format string with panic() call
x86: Clean up 'sizeof x' => 'sizeof(x)'
x86/cpufeatures: Enumerate MOVDIR64B instruction
x86/cpufeatures: Enumerate MOVDIRI instruction
x86/earlyprintk: Add a force option for pciserial device
objtool: Support per-function rodata sections
x86/microcode: Make revision and processor flags world-readable
Linus Torvalds [Sun, 4 Nov 2018 01:13:43 +0000 (18:13 -0700)]
Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf updates and fixes from Ingo Molnar:
"These are almost all tooling updates: 'perf top', 'perf trace' and
'perf script' fixes and updates, an UAPI header sync with the merge
window versions, license marker updates, much improved Sparc support
from David Miller, and a number of fixes"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (66 commits)
perf intel-pt/bts: Calculate cpumode for synthesized samples
perf intel-pt: Insert callchain context into synthesized callchains
perf tools: Don't clone maps from parent when synthesizing forks
perf top: Start display thread earlier
tools headers uapi: Update linux/if_link.h header copy
tools headers uapi: Update linux/netlink.h header copy
tools headers: Sync the various kvm.h header copies
tools include uapi: Update linux/mmap.h copy
perf trace beauty: Use the mmap flags table generated from headers
perf beauty: Wire up the mmap flags table generator to the Makefile
perf beauty: Add a generator for MAP_ mmap's flag constants
tools include uapi: Update asound.h copy
tools arch uapi: Update asm-generic/unistd.h and arm64 unistd.h copies
tools include uapi: Update linux/fs.h copy
perf callchain: Honour the ordering of PERF_CONTEXT_{USER,KERNEL,etc}
perf cs-etm: Correct CPU mode for samples
perf unwind: Take pgoff into account when reporting elf to libdwfl
perf top: Do not use overwrite mode by default
perf top: Allow disabling the overwrite mode
perf trace: Beautify mount's first pathname arg
...
Linus Torvalds [Sun, 4 Nov 2018 01:12:09 +0000 (18:12 -0700)]
Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fixes from Ingo Molnar:
"An irqchip driver fix and a memory (over-)allocation fix"
* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqchip/irq-mvebu-sei: Fix a NULL vs IS_ERR() bug in probe function
irq/matrix: Fix memory overallocation
Yunsheng Lin [Fri, 2 Nov 2018 09:47:48 +0000 (17:47 +0800)]
net: hns3: Fix for out-of-bounds access when setting pfc back pressure
The vport should be initialized to hdev->vport for each bp group,
otherwise it will cause out-of-bounds access and bp setting not
correct problem.
[ 35.254124] BUG: KASAN: slab-out-of-bounds in hclge_pause_setup_hw+0x2a0/0x3f8 [hclge]
[ 35.254126] Read of size 2 at addr ffff803b6651581a by task kworker/0:1/14
[ 35.282014] Memory state around the buggy address:
[ 35.282017] ffff803b66515700: fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe
[ 35.282019] ffff803b66515780: fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe
[ 35.282021] >ffff803b66515800: fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe
[ 35.282022] ^
[ 35.282024] ffff803b66515880: fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe
[ 35.282026] ffff803b66515900: fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe
[ 35.282028] ==================================================================
[ 35.282029] Disabling lock debugging due to kernel taint
[ 35.282747] hclge driver initialization finished.
Fixes: 67bf2541f4b9 ("net: hns3: Fixes the back pressure setting when sriov is enabled") Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Sat, 3 Nov 2018 22:40:01 +0000 (15:40 -0700)]
Merge branch 'net-bql-better-deal-with-GSO'
Eric Dumazet says:
====================
net: bql: better deal with GSO
While BQL bulk dequeue works well for TSO packets, it is
not very efficient as soon as GSO is involved.
On a GSO only workload (UDP or TCP), this patch series
can save about 8 % of cpu cycles on a 40Gbit mlx4 NIC,
by keeping optimal batching, and avoiding expensive
doorbells, qdisc requeues and reschedules.
This patch series :
- Add __netdev_tx_sent_queue() so that drivers
can implement efficient BQL and xmit_more support.
- Implement a work around in dev_hard_start_xmit()
for drivers not using __netdev_tx_sent_queue()
- changes mlx4 to use __netdev_tx_sent_queue()
v2: Tariq and Willem feedback addressed.
added __netdev_tx_sent_queue() (Willem suggestion)
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Wed, 31 Oct 2018 15:39:14 +0000 (08:39 -0700)]
net/mlx4_en: use __netdev_tx_sent_queue()
doorbell only depends on xmit_more and netif_tx_queue_stopped()
Using __netdev_tx_sent_queue() avoids messing with BQL stop flag,
and is more generic.
This patch increases performance on GSO workload by keeping
doorbells to the minimum required.
Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Tariq Toukan <tariqt@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Wed, 31 Oct 2018 15:39:13 +0000 (08:39 -0700)]
net: do not abort bulk send on BQL status
Before calling dev_hard_start_xmit(), upper layers tried
to cook optimal skb list based on BQL budget.
Problem is that GSO packets can end up comsuming more than
the BQL budget.
Breaking the loop is not useful, since requeued packets
are ahead of any packets still in the qdisc.
It is also more expensive, since next TX completion will
push these packets later, while skbs are not in cpu caches.
It is also a behavior difference with TSO packets, that can
break the BQL limit by a large amount.
Note that drivers should use __netdev_tx_sent_queue()
in order to have optimal xmit_more support, and avoid
useless atomic operations as shown in the following patch.
Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Wed, 31 Oct 2018 15:39:12 +0000 (08:39 -0700)]
net: bql: add __netdev_tx_sent_queue()
When qdisc_run() tries to use BQL budget to bulk-dequeue a batch
of packets, GSO can later transform this list in another list
of skbs, and each skb is sent to device ndo_start_xmit(),
one at a time, with skb->xmit_more being set to one but
for last skb.
Problem is that very often, BQL limit is hit in the middle of
the packet train, forcing dev_hard_start_xmit() to stop the
bulk send and requeue the end of the list.
BQL role is to avoid head of line blocking, making sure
a qdisc can deliver high priority packets before low priority ones.
But there is no way requeued packets can be bypassed by fresh
packets in the qdisc.
Aborting the bulk send increases TX softirqs, and hot cache
lines (after skb_segment()) are wasted.
Note that for TSO packets, we never split a packet in the middle
because of BQL limit being hit.
Drivers should be able to update BQL counters without
flipping/caring about BQL status, if the current skb
has xmit_more set.
Upper layers are ultimately responsible to stop sending another
packet train when BQL limit is hit.
Code template in a driver might look like the following :
Linus Torvalds [Sat, 3 Nov 2018 17:55:23 +0000 (10:55 -0700)]
Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull more arm64 updates from Catalin Marinas:
- fix W+X page (mark RO) allocated by the arm64 kprobes code
- Makefile fix for .i files in out of tree modules
* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
arm64: kprobe: make page to RO mode when allocate it
arm64: kdump: fix small typo
arm64: makefile fix build of .i file in external module case
Linus Torvalds [Sat, 3 Nov 2018 17:45:55 +0000 (10:45 -0700)]
Merge tag '4.20-rc1-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes and updates from Steve French:
"Three small fixes (one Kerberos related, one for stable, and another
fixes an oops in xfstest 377), two helpful debugging improvements,
three patches for cifs directio and some minor cleanup"
* tag '4.20-rc1-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
cifs: fix signed/unsigned mismatch on aio_read patch
cifs: don't dereference smb_file_target before null check
CIFS: Add direct I/O functions to file_operations
CIFS: Add support for direct I/O write
CIFS: Add support for direct I/O read
smb3: missing defines and structs for reparse point handling
smb3: allow more detailed protocol info on open files for debugging
smb3: on kerberos mount if server doesn't specify auth type use krb5
smb3: add trace point for tree connection
cifs: fix spelling mistake, EACCESS -> EACCES
cifs: fix return value for cifs_listxattr
David S. Miller [Sat, 3 Nov 2018 17:44:06 +0000 (10:44 -0700)]
Merge branch 's390-qeth-fixes'
Julian Wiedmann says:
====================
s390/qeth: fixes 2018-11-02
please apply one round of qeth fixes for -net.
Patch 1 is rather large and removes a use-after-free hazard from many of our
debug trace entries.
Patch 2 is yet another fix-up for the L3 subdriver's new IP address management
code.
Patch 3 and 4 resolve some fallout from the recent changes wrt how/when qeth
allocates its net_device.
Patch 5 makes sure we don't set reserved bits when building HW commands from
user-provided data.
And finally, patch 6 allows ethtool to play nice with new HW.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Julian Wiedmann [Fri, 2 Nov 2018 18:04:12 +0000 (19:04 +0100)]
s390/qeth: sanitize ARP requests
The ARP_{ADD,REMOVE}_ENTRY cmd structs contain reserved fields.
Introduce a common helper that doesn't raw-copy the user-provided data
into the cmd, but only sets those fields that are strictly needed for
the command.
This also sets the correct command length for ARP_REMOVE_ENTRY.
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Julian Wiedmann [Fri, 2 Nov 2018 18:04:11 +0000 (19:04 +0100)]
s390/qeth: fix initial operstate
Setting the carrier 'on' for an unregistered netdevice doesn't update
its operstate. Fix this by delaying the update until the netdevice has
been registered.
Fixes: 91cc98f51e3d ("s390/qeth: remove duplicated carrier state tracking") Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Julian Wiedmann [Fri, 2 Nov 2018 18:04:10 +0000 (19:04 +0100)]
s390/qeth: unregister netdevice only when registered
qeth only registers its netdevice when the qeth device is first set
online. Thus a device that has never been set online will trigger
a WARN ("network todo 'hsi%d' but state 0") in unregister_netdev() when
removed.
Fix this by protecting the unregister step, just like we already protect
against repeated registering of the netdevice.
Fixes: d3d1b205e89f ("s390/qeth: allocate netdevice early") Reported-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Julian Wiedmann [Fri, 2 Nov 2018 18:04:09 +0000 (19:04 +0100)]
s390/qeth: fix HiperSockets sniffer
Sniffing mode for L3 HiperSockets requires that no IP addresses are
registered with the HW. The preferred way to achieve this is for
userspace to delete all the IPs on the interface. But qeth is expected
to also tolerate a configuration where that is not the case, by skipping
the IP registration when in sniffer mode.
Since commit 5f78e29ceebf ("qeth: optimize IP handling in rx_mode callback")
reworked the IP registration logic in the L3 subdriver, this no longer
works. When the qeth device is set online, qeth_l3_recover_ip() now
unconditionally registers all unicast addresses from our internal
IP table.
While we could fix this particular problem by skipping
qeth_l3_recover_ip() on a sniffer device, the more future-proof change
is to skip the IP address registration at the lowest level. This way we
a) catch any future code path that attempts to register an IP address
without considering the sniffer scenario, and
b) continue to build up our internal IP table, so that if sniffer mode
is switched off later we can operate just like normal.
Fixes: 5f78e29ceebf ("qeth: optimize IP handling in rx_mode callback") Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Julian Wiedmann [Fri, 2 Nov 2018 18:04:08 +0000 (19:04 +0100)]
s390/qeth: sanitize strings in debug messages
As Documentation/s390/s390dbf.txt states quite clearly, using any
pointer in sprinf-formatted s390dbf debug entries is dangerous.
The pointers are dereferenced whenever the trace file is read from.
So if the referenced data has a shorter life-time than the trace file,
any read operation can result in a use-after-free.
So rip out all hazardous use of indirect data, and replace any usage of
dev_name() and such by the Bus ID number.
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Sat, 3 Nov 2018 17:35:52 +0000 (10:35 -0700)]
Merge branch 'work.afs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull 9p fix from Al Viro:
"Regression fix for net/9p handling of iov_iter; broken by braino when
switching to iov_iter_is_kvec() et.al., spotted and fixed by Marc"
* 'work.afs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
iov_iter: Fix 9p virtio breakage
Linus Torvalds [Sat, 3 Nov 2018 17:34:03 +0000 (10:34 -0700)]
Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull more SCSI updates from James Bottomley:
"This is a set of minor small (and safe changes) that didn't make the
initial pull request plus some bug fixes"
* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: mvsas: Remove set but not used variable 'id'
scsi: qla2xxx: Remove two arguments from qlafx00_error_entry()
scsi: qla2xxx: Make sure that qlafx00_ioctl_iosb_entry() initializes 'res'
scsi: qla2xxx: Remove a set-but-not-used variable
scsi: qla2xxx: Make qla2x00_sysfs_write_nvram() easier to analyze
scsi: qla2xxx: Declare local functions 'static'
scsi: qla2xxx: Improve several kernel-doc headers
scsi: qla2xxx: Modify fall-through annotations
scsi: 3w-sas: 3w-9xxx: Use unsigned char for cdb
scsi: mvsas: Use dma_pool_zalloc
scsi: target: Don't request modules that aren't even built
scsi: target: Set response length for REPORT TARGET PORT GROUPS
Linus Torvalds [Sat, 3 Nov 2018 17:21:43 +0000 (10:21 -0700)]
Merge branch 'akpm' (patches from Andrew)
Merge more updates from Andrew Morton:
- more ocfs2 work
- various leftovers
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
memory_hotplug: cond_resched in __remove_pages
bfs: add sanity check at bfs_fill_super()
kernel/sysctl.c: remove duplicated include
kernel/kexec_file.c: remove some duplicated includes
mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask
ocfs2: fix clusters leak in ocfs2_defrag_extent()
ocfs2: dlmglue: clean up timestamp handling
ocfs2: don't put and assigning null to bh allocated outside
ocfs2: fix a misuse a of brelse after failing ocfs2_check_dir_entry
ocfs2: don't use iocb when EIOCBQUEUED returns
ocfs2: without quota support, avoid calling quota recovery
ocfs2: remove ocfs2_is_o2cb_active()
mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings
include/linux/notifier.h: SRCU: fix ctags
mm: handle no memcg case in memcg_kmem_charge() properly
It has been reported on an older (4.12) kernel but the current upstream
code doesn't cond_resched in the hot remove code at all and the given
range to remove might be really large. Fix the issue by calling
cond_resched once per memory section.
Link: http://lkml.kernel.org/r/20181031125840.23982-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Thumshirn <jthumshirn@suse.de> Cc: Dan Williams <dan.j.williams@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Tetsuo Handa [Fri, 2 Nov 2018 22:48:42 +0000 (15:48 -0700)]
bfs: add sanity check at bfs_fill_super()
syzbot is reporting too large memory allocation at bfs_fill_super() [1].
Since file system image is corrupted such that bfs_sb->s_start == 0,
bfs_fill_super() is trying to allocate 8MB of continuous memory. Fix
this by adding a sanity check on bfs_sb->s_start, __GFP_NOWARN and
printf().
Michal Hocko [Fri, 2 Nov 2018 22:48:31 +0000 (15:48 -0700)]
mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask
THP allocation mode is quite complex and it depends on the defrag mode.
This complexity is hidden in alloc_hugepage_direct_gfpmask from a large
part currently. The NUMA special casing (namely __GFP_THISNODE) is
however independent and placed in alloc_pages_vma currently. This both
adds an unnecessary branch to all vma based page allocation requests and
it makes the code more complex unnecessarily as well. Not to mention
that e.g. shmem THP used to do the node reclaiming unconditionally
regardless of the defrag mode until recently. This was not only
unexpected behavior but it was also hardly a good default behavior and I
strongly suspect it was just a side effect of the code sharing more than
a deliberate decision which suggests that such a layering is wrong.
Get rid of the thp special casing from alloc_pages_vma and move the
logic to alloc_hugepage_direct_gfpmask. __GFP_THISNODE is applied to the
resulting gfp mask only when the direct reclaim is not requested and
when there is no explicit numa binding to preserve the current logic.
Please note that there's also a slight difference wrt MPOL_BIND now. The
previous code would avoid using __GFP_THISNODE if the local node was
outside of policy_nodemask(). After this patch __GFP_THISNODE is avoided
for all MPOL_BIND policies. So there's a difference that if local node
is actually allowed by the bind policy's nodemask, previously
__GFP_THISNODE would be added, but now it won't be. From the behavior
POV this is still correct because the policy nodemask is used.
Link: http://lkml.kernel.org/r/20180925120326.24392-3-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Stefan Priebe - Profihost AG <s.priebe@profihost.ag> Cc: Zi Yan <zi.yan@cs.rutgers.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Larry Chen [Fri, 2 Nov 2018 22:48:27 +0000 (15:48 -0700)]
ocfs2: fix clusters leak in ocfs2_defrag_extent()
ocfs2_defrag_extent() might leak allocated clusters. When the file
system has insufficient space, the number of claimed clusters might be
less than the caller wants. If that happens, the original code might
directly commit the transaction without returning clusters.
This patch is based on code in ocfs2_add_clusters_in_btree().
[akpm@linux-foundation.org: include localalloc.h, reduce scope of data_ac] Link: http://lkml.kernel.org/r/20180904041621.16874-3-lchen@suse.com Signed-off-by: Larry Chen <lchen@suse.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <jiangqi903@gmail.com> Cc: Changwei Ge <ge.changwei@h3c.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Arnd Bergmann [Fri, 2 Nov 2018 22:48:23 +0000 (15:48 -0700)]
ocfs2: dlmglue: clean up timestamp handling
The handling of timestamps outside of the 1970..2038 range in the dlm
glue is rather inconsistent: on 32-bit architectures, this has always
wrapped around to negative timestamps in the 1902..1969 range, while on
64-bit kernels all timestamps are interpreted as positive 34 bit numbers
in the 1970..2514 year range.
Now that the VFS code handles 64-bit timestamps on all architectures, we
can make the behavior more consistent here, and return the same result
that we had on 64-bit already, making the file system y2038 safe in the
process. Outside of dlmglue, it already uses 64-bit on-disk timestamps
anway, so that part is fine.
For consistency, I'm changing ocfs2_pack_timespec() to clamp anything
outside of the supported range to the minimum and maximum values. This
avoids a possible ambiguity of values before 1970 in particular, which
used to be interpreted as times at the end of the 2514 range previously.
Link: http://lkml.kernel.org/r/20180619155826.4106487-1-arnd@arndb.de Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <jiangqi903@gmail.com> Cc: Changwei Ge <ge.changwei@h3c.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Changwei Ge [Fri, 2 Nov 2018 22:48:19 +0000 (15:48 -0700)]
ocfs2: don't put and assigning null to bh allocated outside
ocfs2_read_blocks() and ocfs2_read_blocks_sync() are both used to read
several blocks from disk. Currently, the input argument *bhs* can be
NULL or NOT. It depends on the caller's behavior. If the function
fails in reading blocks from disk, the corresponding bh will be assigned
to NULL and put.
Obviously, above process for non-NULL input bh is not appropriate.
Because the caller doesn't even know its bhs are put and re-assigned.
If buffer head is managed by caller, ocfs2_read_blocks and
ocfs2_read_blocks_sync() should not evaluate it to NULL. It will cause
caller accessing illegal memory, thus crash.
Link: http://lkml.kernel.org/r/HK2PR06MB045285E0F4FBB561F9F2F9B3D5680@HK2PR06MB0452.apcprd06.prod.outlook.com Signed-off-by: Changwei Ge <ge.changwei@h3c.com> Reviewed-by: Guozhonghua <guozhonghua@h3c.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <jiangqi903@gmail.com> Cc: Changwei Ge <ge.changwei@h3c.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Changwei Ge [Fri, 2 Nov 2018 22:48:15 +0000 (15:48 -0700)]
ocfs2: fix a misuse a of brelse after failing ocfs2_check_dir_entry
Somehow, file system metadata was corrupted, which causes
ocfs2_check_dir_entry() to fail in function ocfs2_dir_foreach_blk_el().
According to the original design intention, if above happens we should
skip the problematic block and continue to retrieve dir entry. But
there is obviouse misuse of brelse around related code.
After failure of ocfs2_check_dir_entry(), current code just moves to
next position and uses the problematic buffer head again and again
during which the problematic buffer head is released for multiple times.
I suppose, this a serious issue which is long-lived in ocfs2. This may
cause other file systems which is also used in a the same host insane.
So we should also consider about bakcporting this patch into linux
-stable.
Link: http://lkml.kernel.org/r/HK2PR06MB045211675B43EED794E597B6D56E0@HK2PR06MB0452.apcprd06.prod.outlook.com Signed-off-by: Changwei Ge <ge.changwei@h3c.com> Suggested-by: Changkuo Shi <shi.changkuo@h3c.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <jiangqi903@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Changwei Ge [Fri, 2 Nov 2018 22:48:11 +0000 (15:48 -0700)]
ocfs2: don't use iocb when EIOCBQUEUED returns
When -EIOCBQUEUED returns, it means that aio_complete() will be called
from dio_complete(), which is an asynchronous progress against
write_iter. Generally, IO is a very slow progress than executing
instruction, but we still can't take the risk to access a freed iocb.
And we do face a BUG crash issue. Using the crash tool, iocb is
obviously freed already.
And the backtrace shows:
ocfs2_file_write_iter+0xcaa/0xd00 [ocfs2]
aio_run_iocb+0x229/0x2f0
do_io_submit+0x291/0x540
SyS_io_submit+0x10/0x20
system_call_fastpath+0x16/0x75
Link: http://lkml.kernel.org/r/1523361653-14439-1-git-send-email-ge.changwei@h3c.com Signed-off-by: Changwei Ge <ge.changwei@h3c.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <jiangqi903@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Guozhonghua [Fri, 2 Nov 2018 22:48:07 +0000 (15:48 -0700)]
ocfs2: without quota support, avoid calling quota recovery
During one dead node's recovery by other node, quota recovery work will
be queued. We should avoid calling quota when it is not supported, so
check the quota flags.
Link: http://lkml.kernel.org/r/71604351584F6A4EBAE558C676F37CA401071AC9FB@H3CMLB12-EX.srv.huawei-3com.com Signed-off-by: guozhonghua <guozhonghua@h3c.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <jiangqi903@gmail.com> Cc: Changwei Ge <ge.changwei@h3c.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Gang He [Fri, 2 Nov 2018 22:48:03 +0000 (15:48 -0700)]
ocfs2: remove ocfs2_is_o2cb_active()
Remove ocfs2_is_o2cb_active(). We have similar functions to identify
which cluster stack is being used via osb->osb_cluster_stack.
Secondly, the current implementation of ocfs2_is_o2cb_active() is not
totally safe. Based on the design of stackglue, we need to get
ocfs2_stack_lock before using ocfs2_stack related data structures, and
that active_stack pointer can be NULL in the case of mount failure.
Link: http://lkml.kernel.org/r/1495441079-11708-1-git-send-email-ghe@suse.com Signed-off-by: Gang He <ghe@suse.com> Reviewed-by: Joseph Qi <jiangqi903@gmail.com> Reviewed-by: Eric Ren <zren@suse.com> Acked-by: Changwei Ge <ge.changwei@h3c.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings
THP allocation might be really disruptive when allocated on NUMA system
with the local node full or hard to reclaim. Stefan has posted an
allocation stall report on 4.12 based SLES kernel which suggests the
same issue:
The defrag mode is "madvise" and from the above report it is clear that
the THP has been allocated for MADV_HUGEPAGA vma.
Andrea has identified that the main source of the problem is
__GFP_THISNODE usage:
: The problem is that direct compaction combined with the NUMA
: __GFP_THISNODE logic in mempolicy.c is telling reclaim to swap very
: hard the local node, instead of failing the allocation if there's no
: THP available in the local node.
:
: Such logic was ok until __GFP_THISNODE was added to the THP allocation
: path even with MPOL_DEFAULT.
:
: The idea behind the __GFP_THISNODE addition, is that it is better to
: provide local memory in PAGE_SIZE units than to use remote NUMA THP
: backed memory. That largely depends on the remote latency though, on
: threadrippers for example the overhead is relatively low in my
: experience.
:
: The combination of __GFP_THISNODE and __GFP_DIRECT_RECLAIM results in
: extremely slow qemu startup with vfio, if the VM is larger than the
: size of one host NUMA node. This is because it will try very hard to
: unsuccessfully swapout get_user_pages pinned pages as result of the
: __GFP_THISNODE being set, instead of falling back to PAGE_SIZE
: allocations and instead of trying to allocate THP on other nodes (it
: would be even worse without vfio type1 GUP pins of course, except it'd
: be swapping heavily instead).
Fix this by removing __GFP_THISNODE for THP requests which are
requesting the direct reclaim. This effectivelly reverts 5265047ac301
on the grounds that the zone/node reclaim was known to be disruptive due
to premature reclaim when there was memory free. While it made sense at
the time for HPC workloads without NUMA awareness on rare machines, it
was ultimately harmful in the majority of cases. The existing behaviour
is similar, if not as widespare as it applies to a corner case but
crucially, it cannot be tuned around like zone_reclaim_mode can. The
default behaviour should always be to cause the least harm for the
common case.
If there are specialised use cases out there that want zone_reclaim_mode
in specific cases, then it can be built on top. Longterm we should
consider a memory policy which allows for the node reclaim like behavior
for the specific memory ranges which would allow a
: Both patches look correct to me but I'm responding to this one because
: it's the fix. The change makes sense and moves further away from the
: severe stalling behaviour we used to see with both THP and zone reclaim
: mode.
:
: I put together a basic experiment with usemem configured to reference a
: buffer multiple times that is 80% the size of main memory on a 2-socket
: box with symmetric node sizes and defrag set to "always". The defrag
: setting is not the default but it would be functionally similar to
: accessing a buffer with madvise(MADV_HUGEPAGE). Usemem is configured to
: reference the buffer multiple times and while it's not an interesting
: workload, it would be expected to complete reasonably quickly as it fits
: within memory. The results were;
:
: usemem
: vanilla noreclaim-v1
: Amean Elapsd-1 42.78 ( 0.00%) 26.87 ( 37.18%)
: Amean Elapsd-3 27.55 ( 0.00%) 7.44 ( 73.00%)
: Amean Elapsd-4 5.72 ( 0.00%) 5.69 ( 0.45%)
:
: This shows the elapsed time in seconds for 1 thread, 3 threads and 4
: threads referencing buffers 80% the size of memory. With the patches
: applied, it's 37.18% faster for the single thread and 73% faster with two
: threads. Note that 4 threads showing little difference does not indicate
: the problem is related to thread counts. It's simply the case that 4
: threads gets spread so their workload mostly fits in one node.
:
: The overall view from /proc/vmstats is more startling
:
: 4.19.0-rc1 4.19.0-rc1
: vanillanoreclaim-v1r1
: Minor Faults 35593425 708164
: Major Faults 484088 36
: Swap Ins 3772837 0
: Swap Outs 3932295 0
:
: Massive amounts of swap in/out without the patch
:
: Direct pages scanned 6013214 0
: Kswapd pages scanned 0 0
: Kswapd pages reclaimed 0 0
: Direct pages reclaimed 4033009 0
:
: Lots of reclaim activity without the patch
:
: Kswapd efficiency 100% 100%
: Kswapd velocity 0.000 0.000
: Direct efficiency 67% 100%
: Direct velocity 11191.956 0.000
:
: Mostly from direct reclaim context as you'd expect without the patch.
:
: Page writes by reclaim 3932314.000 0.000
: Page writes file 19 0
: Page writes anon 3932295 0
: Page reclaim immediate 42336 0
:
: Writes from reclaim context is never good but the patch eliminates it.
:
: We should never have default behaviour to thrash the system for such a
: basic workload. If zone reclaim mode behaviour is ever desired but on a
: single task instead of a global basis then the sensible option is to build
: a mempolicy that enforces that behaviour.
This was a severe regression compared to previous kernels that made
important workloads unusable and it starts when __GFP_THISNODE was
added to THP allocations under MADV_HUGEPAGE. It is not a significant
risk to go to the previous behavior before __GFP_THISNODE was added, it
worked like that for years.
This was simply an optimization to some lucky workloads that can fit in
a single node, but it ended up breaking the VM for others that can't
possibly fit in a single node, so going back is safe.
[mhocko@suse.com: rewrote the changelog based on the one from Andrea] Link: http://lkml.kernel.org/r/20180925120326.24392-2-mhocko@kernel.org Fixes: 5265047ac301 ("mm, thp: really limit transparent hugepage allocation to local node") Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Stefan Priebe <s.priebe@profihost.ag> Debugged-by: Andrea Arcangeli <aarcange@redhat.com> Reported-by: Alex Williamson <alex.williamson@redhat.com> Reviewed-by: Mel Gorman <mgorman@techsingularity.net> Tested-by: Mel Gorman <mgorman@techsingularity.net> Cc: Zi Yan <zi.yan@cs.rutgers.edu> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: <stable@vger.kernel.org> [4.1+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>