- Add new vxlan_fdb_alloc helper
- rename existing vxlan_fdb_create into vxlan_fdb_update:
because it really creates or updates an existing
fdb entry
- move new fdb creation into a separate vxlan_fdb_create
Main motivation for this change is to introduce the ability
to decouple vxlan fdb creation and notify, used in a later patch.
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
rtnetlink: add rtnl_link_state check in rtnl_configure_link
rtnl_configure_link sets dev->rtnl_link_state to
RTNL_LINK_INITIALIZED and unconditionally calls
__dev_notify_flags to notify user-space of dev flags.
current call sequence for rtnl_configure_link
rtnetlink_newlink
rtnl_link_ops->newlink
rtnl_configure_link (unconditionally notifies userspace of
default and new dev flags)
If a newlink handler wants to call rtnl_configure_link
early, we will end up with duplicate notifications to
user-space.
This patch fixes rtnl_configure_link to check rtnl_link_state
and call __dev_notify_flags with gchanges = 0 if already
RTNL_LINK_INITIALIZED.
Later in the series, this patch will help the following sequence
where a driver implementing newlink can call rtnl_configure_link
to initialize the link early.
makes the following call sequence work:
rtnetlink_newlink
rtnl_link_ops->newlink (vxlan) -> rtnl_configure_link (initializes
link and notifies
user-space of default
dev flags)
rtnl_configure_link (updates dev flags if requested by user ifm
and notifies user-space of new dev flags)
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
The bad access is in neigh_hh_output(), at skb->data - 16 (HH_DATA_MOD).
atl1c driver provided skb with no headroom, so 14 bytes (ethernet
header) got pulled, but then 16 are copied.
Reserve NET_SKB_PAD bytes headroom, like netdev_alloc_skb().
Compile tested only; I lack hardware.
Fixes: 7b7017642199 ("atl1c: Fix misuse of netdev_alloc_skb in refilling rx ring") Signed-off-by: Florian Westphal <fw@strlen.de> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Hangbin Liu [Fri, 20 Jul 2018 06:04:27 +0000 (14:04 +0800)]
multicast: do not restore deleted record source filter mode to new one
There are two scenarios that we will restore deleted records. The first is
when device down and up(or unmap/remap). In this scenario the new filter
mode is same with previous one. Because we get it from in_dev->mc_list and
we do not touch it during device down and up.
The other scenario is when a new socket join a group which was just delete
and not finish sending status reports. In this scenario, we should use the
current filter mode instead of restore old one. Here are 4 cases in total.
Fixes: 24803f38a5c0b (igmp: do not remove igmp souce list info when set link down) Fixes: 1666d49e1d416 (mld: do not remove mld souce list info when set link down) Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net: dsa: mv88e6xxx: fix races between lock and irq freeing
free_irq() waits until all handlers for this IRQ have completed. As the
relevant handler (mv88e6xxx_g1_irq_thread_fn()) takes the chip's reg_lock
it might never return if the thread calling free_irq() holds this lock.
For the same reason kthread_cancel_delayed_work_sync() in the polling case
must not hold this lock.
Also first free the irq (or stop the worker respectively) such that
mv88e6xxx_g1_irq_thread_work() isn't called any more before the irq
mappings are dropped in mv88e6xxx_g1_irq_free_common() to prevent the
worker thread to call handle_nested_irq(0) which results in a NULL-pointer
exception.
Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
Fixes: ddff00d42043 ("net: Move skb_has_shared_frag check out of GRE code and into segmentation") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Alexander Duyck <alexander.h.duyck@intel.com> Reported-by: syzbot <syzkaller@googlegroups.com> Acked-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David Ahern [Thu, 19 Jul 2018 19:41:18 +0000 (12:41 -0700)]
net/ipv6: Fix linklocal to global address with VRF
Example setup:
host: ip -6 addr add dev eth1 2001:db8:104::4
where eth1 is enslaved to a VRF
switch: ip -6 ro add 2001:db8:104::4/128 dev br1
where br1 only has an LLA
ping6 2001:db8:104::4
ssh 2001:db8:104::4
(NOTE: UDP works fine if the PKTINFO has the address set to the global
address and ifindex is set to the index of eth1 with a destination an
LLA).
For ICMP, icmp6_iif needs to be updated to check if skb->dev is an
L3 master. If it is then return the ifindex from rt6i_idev similar
to what is done for loopback.
For TCP, restore the original tcp_v6_iif definition which is needed in
most places and add a new tcp_v6_iif_l3_slave that considers the
l3_slave variability. This latter check is only needed for socket
lookups.
Fixes: 9ff74384600a ("net: vrf: Handle ipv6 multicast and link-local addresses") Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Fix following warning:
net/ipv4/bpfilter/sockopt.c:28:5: error: symbol 'bpfilter_ip_set_sockopt' redeclared with different type
net/ipv4/bpfilter/sockopt.c:34:5: error: symbol 'bpfilter_ip_get_sockopt' redeclared with different type
Signed-off-by: YueHaibing <yuehaibing@huawei.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
net: phy: consider PHY_IGNORE_INTERRUPT in phy_start_aneg_priv
The situation described in the comment can occur also with
PHY_IGNORE_INTERRUPT, therefore change the condition to include it.
Fixes: f555f34fdc58 ("net: phy: fix auto-negotiation stall due to unavailable interrupt") Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
There's a possible race where driver can read link status in mid-transition
and see that virtual-link is up yet speed is 0. Since in this
mid-transition we're guaranteed to see a mailbox from MFW soon, we can
afford to treat this as link down.
Fixes: cc875c2e ("qed: Add link support") Signed-off-by: Sudarsana Reddy Kalluru <Sudarsana.Kalluru@cavium.com> Signed-off-by: Ariel Elior <ariel.elior@cavium.com> Signed-off-by: Michal Kalderon <Michal.Kalderon@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
qed: Fix link flap issue due to mismatching EEE capabilities.
Apparently, MFW publishes EEE capabilities even for Fiber-boards that don't
support them, and later since qed internally sets adv_caps it would cause
link-flap avoidance (LFA) to fail when driver would initiate the link.
This in turn delays the link, causing traffic to fail.
Driver has been modified to not to ask MFW for any EEE config if EEE isn't
to be enabled.
Fixes: 645874e5 ("qed: Add support for Energy efficient ethernet.") Signed-off-by: Sudarsana Reddy Kalluru <Sudarsana.Kalluru@cavium.com> Signed-off-by: Ariel Elior <ariel.elior@cavium.com> Signed-off-by: Michal Kalderon <Michal.Kalderon@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net: caif: Add a missing rcu_read_unlock() in caif_flow_cb
Add a missing rcu_read_unlock in the error path
Fixes: c95567c80352 ("caif: added check for potential null return") Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jarod Wilson [Wed, 18 Jul 2018 18:49:36 +0000 (14:49 -0400)]
bonding: set default miimon value for non-arp modes if not set
For some time now, if you load the bonding driver and configure bond
parameters via sysfs using minimal config options, such as specifying
nothing but the mode, relying on defaults for everything else, modes
that cannot use arp monitoring (802.3ad, balance-tlb, balance-alb) all
wind up with both arp_interval=0 (as it should be) and miimon=0, which
means the miimon monitor thread never actually runs. This is particularly
problematic for 802.3ad.
For example, from an LNST recipe I've set up:
$ modprobe bonding max_bonds=0"
$ echo "+t_bond0" > /sys/class/net/bonding_masters"
$ ip link set t_bond0 down"
$ echo "802.3ad" > /sys/class/net/t_bond0/bonding/mode"
$ ip link set ens1f1 down"
$ echo "+ens1f1" > /sys/class/net/t_bond0/bonding/slaves"
$ ip link set ens1f0 down"
$ echo "+ens1f0" > /sys/class/net/t_bond0/bonding/slaves"
$ ethtool -i t_bond0"
$ ip link set ens1f1 up"
$ ip link set ens1f0 up"
$ ip link set t_bond0 up"
$ ip addr add 192.168.9.1/24 dev t_bond0"
$ ip addr add 2002::1/64 dev t_bond0"
This bond comes up okay, but things look slightly suspect in
/proc/net/bonding/t_bond0 output:
$ grep -i mii /proc/net/bonding/t_bond0
MII Status: up
MII Polling Interval (ms): 0
MII Status: up
MII Status: up
Now, pull a cable on one of the ports in the bond, then reconnect it, and
you'll see:
Slave Interface: ens1f0
MII Status: down
Speed: 1000 Mbps
Duplex: full
I believe this became a major issue as of commit 4d2c0cda0744, which for
802.3ad bonds, sets slave->link = BOND_LINK_DOWN, with a comment about
relying on link monitoring via miimon to set it correctly, but since the
miimon work queue never runs, the link just stays marked down.
If we simply tweak bond_option_mode_set() slightly, we can check for the
non-arp modes having no miimon value set, and insert BOND_DEFAULT_MIIMON,
which gets things back in full working order. This problem exists as far
back as 4.14, and might be worth fixing in all stable trees since, though
the work-around is to simply specify an miimon value yourself.
Reported-by: Bob Ball <ball@umich.edu> Signed-off-by: Jarod Wilson <jarod@redhat.com> Acked-by: Mahesh Bandewar <maheshb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
The following series provides fixes to mlx5 core and net device driver.
Please pull and let me know if there's any problem.
For -stable v4.7
net/mlx5e: Don't allow aRFS for encapsulated packets
net/mlx5e: Fix quota counting in aRFS expire flow
For -stable v4.15
net/mlx5e: Only allow offloading decap egress (egdev) flows
net/mlx5e: Refine ets validation function
net/mlx5: Adjust clock overflow work period
For -stable v4.17
net/mlx5: E-Switch, UBSAN fix undefined behavior in mlx5_eswitch_mode
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The following pull-request contains BPF updates for your *net* tree.
The main changes are:
1) Fix in BPF Makefile to detect llvm-objcopy in a more robust way which is
needed for pahole's BTF converter and minor UAPI tweaks in BTF_INT_BITS()
to shrink the mask before eventual UAPI freeze, from Martin.
2) Fix a segfault in bpftool when prog pin id has no further arguments such
as id value or file specified, from Taeung.
3) Fix powerpc JIT handling of XADD which has jumps to exit path that would
potentially bypass verifier expectations e.g. with subprog calls. Also add
a test case to make sure XADD is not mangling src/dst register, from Daniel.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The current code does not check sk->sk_shutdown & RCV_SHUTDOWN.
tls_sw_recvmsg may return a positive value in the case where bytes have
already been copied when the socket is shutdown. sk->sk_err has been
cleared, causing the tls_wait_data to hang forever on a subsequent
invocation. Checking sk->sk_shutdown & RCV_SHUTDOWN, as in tcp_recvmsg,
fixes this problem.
Fixes: c46234ebb4d1 ("tls: RX path for ktls") Acked-by: Dave Watson <davejwatson@fb.com> Signed-off-by: Doron Roberts-Kedes <doronrk@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 20 Jul 2018 21:32:24 +0000 (14:32 -0700)]
Merge branch 'tcp-fix-DCTCP-ECE-Ack-series'
Yuchung Cheng says:
====================
fix DCTCP ECE Ack series
This patch set address that the existing DCTCP implementation does not
fully implement the ACK policy specified in the RFC. This improves
the responsiveness of CE status change particularly on flows with
small inflight.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
+0.000 > [ect01] . 2:2(0) ack 2001
// Previously the ACK below would be delayed by 40ms
+0.000 > [ect01] E. 2:2(0) ack 3001
+0.500 < F. 9501:9501(0) ack 4 win 257
Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Currently when a DCTCP receiver delays an ACK and receive a
data packet with a different CE mark from the previous one's, it
sends two immediate ACKs acking previous and latest sequences
respectly (for ECN accounting).
Previously sending the first ACK may mark off the delayed ACK timer
(tcp_event_ack_sent). This may subsequently prevent sending the
second ACK to acknowledge the latest sequence (tcp_ack_snd_check).
The culprit is that tcp_send_ack() assumes it always acknowleges
the latest sequence, which is not true for the first special ACK.
The fix is to not make the assumption in tcp_send_ack and check the
actual ack sequence before cancelling the delayed ACK. Further it's
safer to pass the ack sequence number as a local variable into
tcp_send_ack routine, instead of intercepting tp->rcv_nxt to avoid
future bugs like this.
Reported-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Refactor and create helpers to send the special ACK in DCTCP.
Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Martin KaFai Lau [Fri, 20 Jul 2018 05:34:10 +0000 (22:34 -0700)]
bpf: Use option "help" in the llvm-objcopy test
I noticed the "--version" option of the llvm-objcopy command has recently
disappeared from the master llvm branch. It is currently used as a BTF
support test in tools/testing/selftests/bpf/Makefile.
This patch replaces it with "--help" which should be
less error prone in the future.
Fixes: c0fa1b6c3efc ("bpf: btf: Add BTF tests") Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Martin KaFai Lau [Fri, 20 Jul 2018 05:14:31 +0000 (22:14 -0700)]
bpf: btf: Clean up BTF_INT_BITS() in uapi btf.h
This patch shrinks the BTF_INT_BITS() mask. The current
btf_int_check_meta() ensures the nr_bits of an integer
cannot exceed 64. Hence, it is mostly an uapi cleanup.
The actual btf usage (i.e. seq_show()) is also modified
to use u8 instead of u16. The verification (e.g. btf_int_check_meta())
path stays as is to deal with invalid BTF situation.
Fixes: 69b693f0aefa ("bpf: btf: Introduce BPF Type Format (BTF)") Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Taeung Song [Thu, 19 Jul 2018 12:10:05 +0000 (21:10 +0900)]
tools/bpftool: Fix segfault case regarding 'pin' arguments
Arguments of 'pin' subcommand should be checked
at the very beginning of do_pin_any().
Otherwise segfault errors can occur when using
'map pin' or 'prog pin' commands, so fix it.
# bpftool prog pin id
Segmentation fault
Fixes: 71bb428fe2c1 ("tools: bpf: add bpftool") Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reported-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: Taeung Song <treeze.taeung@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
net-next/hinic: fix a problem in hinic_xmit_frame()
The calculation of "wqe_size" is not correct when the tx queue is busy in
hinic_xmit_frame().
When there are no free WQEs, the tx flow will unmap the skb buffer, then
ring the doobell for the pending packets. But the "wqe_size" which used
to calculate the doorbell address is not correct. The wqe size should be
cleared to 0, otherwise, it will cause a doorbell error.
This patch fixes the problem.
Reported-by: Zhou Wang <wangzhou1@hisilicon.com> Signed-off-by: Zhao Chen <zhaochen6@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Fixes: ff7d6b27f894 ("page_pool: refurbish version of page_pool code") Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
====================
This set adds a ppc64 JIT fix for xadd as well as a missing test
case for verifying whether xadd messes with src/dst reg. Thanks!
====================
Daniel Borkmann [Thu, 19 Jul 2018 16:18:36 +0000 (18:18 +0200)]
bpf: test case to check whether src/dst regs got mangled by xadd
We currently do not have such a test case in test_verifier selftests
but it's important to test under bpf_jit_enable=1 to make sure JIT
implementations do not mistakenly mess with src/dst reg for xadd/{w,dw}.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
None of the JITs is allowed to implement exit paths from the BPF
insn mappings other than BPF_JMP | BPF_EXIT. In the BPF core code
we have a couple of rewrites in eBPF (e.g. LD_ABS / LD_IND) and
in eBPF to cBPF translation to retain old existing behavior where
exceptions may occur; they are also tightly controlled by the
verifier where it disallows some of the features such as BPF to
BPF calls when legacy LD_ABS / LD_IND ops are present in the BPF
program. During recent review of all BPF_XADD JIT implementations
I noticed that the ppc64 one is buggy in that it contains two
jumps to exit paths. This is problematic as this can bypass verifier
expectations e.g. pointed out in commit f6b1b3bf0d5f ("bpf: fix
subprog verifier bypass by div/mod by 0 exception"). The first
exit path is obsoleted by the fix in ca36960211eb ("bpf: allow xadd
only on aligned memory") anyway, and for the second one we need to
do a fetch, add and store loop if the reservation from lwarx/ldarx
was lost in the meantime.
Fixes: 156d0e290e96 ("powerpc/ebpf/jit: Implement JIT compiler for extended BPF") Reviewed-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Reviewed-by: Sandipan Das <sandipan@linux.vnet.ibm.com> Tested-by: Sandipan Das <sandipan@linux.vnet.ibm.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Merge tag 'sound-4.18-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
"A rawmidi race fix and three trivial HD-audio quirks"
* tag 'sound-4.18-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
ALSA: hda/realtek - Yet another Clevo P950 quirk entry
ALSA: rawmidi: Change resized buffers atomically
ALSA: hda/realtek - Add Panasonic CF-SZ6 headset jack quirk
ALSA: hda: add mute led support for HP ProBook 455 G5
Roi Dayan [Thu, 12 Jul 2018 15:25:59 +0000 (18:25 +0300)]
net/mlx5e: Only allow offloading decap egress (egdev) flows
We get egress rules through the egdev mechanism when the ingress device
is not supporting offload, with the expected use-case of tunnel decap
ingress rule set on shared tunnel device.
Make sure to offload egress/egdev rules only if decap action (tunnel key
unset) exists there and err otherwise.
Fixes: 717503b9cf57 ("net: sched: convert cls_flower->egress_dev users to tc_setup_cb_egdev infra") Signed-off-by: Roi Dayan <roid@mellanox.com> Signed-off-by: Paul Blakey <paulb@mellanox.com> Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Fix bad alignment of SQ buffer in fragmented QP allocation.
It should start directly after RQ buffer ends.
Take special care of the end case where the RQ buffer does not occupy
a whole page. RQ size is a power of two, so would be the case only for
small RQ sizes (RQ size < PAGE_SIZE).
Fix wrong assignments for sqb->size (mistakenly assigned RQ size),
and for npages value of RQ and SQ.
Fixes: 3a2f70331226 ("net/mlx5: Use order-0 allocations for all WQ types") Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Raed Salem [Tue, 3 Jul 2018 07:10:38 +0000 (10:10 +0300)]
net/mlx5: Fix 'DON'T_TRAP' functionality
The flow counters binding support commit introduced a code change where
none NULL 'rule_dest' is always passed to mlx5_add_flow_rules, this breaks
'DON'T_TRAP' rules insertion.
The fix uses the equivalent 'dest_num' value instead of dest pointer
at the failed check.
net/mlx5: E-Switch, UBSAN fix undefined behavior in mlx5_eswitch_mode
With debug kernel UBSAN detects the following issue, which might happen
when eswitch instance is not created, fix this by testing the eswitch
pointer before returning the eswitch mode, if not set return mode =
SRIOV_NONE.
Fixes: 57cbd893c4c5 ("net/mlx5: E-Switch, Move representors definition to a global scope") Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Ariel Levkovich [Mon, 25 Jun 2018 16:12:02 +0000 (19:12 +0300)]
net/mlx5: Adjust clock overflow work period
When driver converts HW timestamp to wall clock time it subtracts
the last saved cycle counter from the HW timestamp and converts the
difference to nanoseconds.
The conversion is done by multiplying the cycles difference with the
clock multiplier value as a first step and therefore the cycles
difference should be small enough so that the multiplication product
doesn't exceed 64bit.
The overflow handling routine is in charge of updating the last saved
cycle counter in driver and it is called periodically using kernel
delayed workqueue.
The delay period for this work is calculated using the max HW cycle
counter value (a 41 bit mask) as a base which doesn't take the 64bit
limit into account so the delay period may be incorrect and too
long to prevent a large difference between the HW counter and the last
saved counter in SW.
This change adjusts the work period for the HW clock overflow work by
taking the minimum between the previous value and the quotient of max
u64 value and the clock multiplier value.
Shay Agroskin [Wed, 27 Jun 2018 12:43:07 +0000 (15:43 +0300)]
net/mlx5e: Refine ets validation function
Removed an error message received when configuring ETS total
bandwidth to be zero.
Our hardware doesn't support such configuration, so we shall
reject it in the driver. Nevertheless, we removed the error message
in order to eliminate error messages caused by old userspace tools
who try to pass such configuration.
Randy Dunlap [Wed, 18 Jul 2018 01:27:45 +0000 (18:27 -0700)]
tcp: identify cryptic messages as TCP seq # bugs
Attempt to make cryptic TCP seq number error messages clearer by
(1) identifying the source of the message as "TCP", (2) identifying the
errors as "seq # bug", and (3) grouping the field identifiers and values
by separating them with commas.
Suggested-by: 積丹尼 Dan Jacobson <jidanni@jidanni.org> Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
It seems that a *break* is missing in order to avoid falling through
to the default case. Otherwise, checking *chan* makes no sense.
Fixes: 72df7a7244c0 ("ptp: Allow reassigning calibration pin function") Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Acked-by: Richard Cochran <richardcochran@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
hv_netvsc: Fix napi reschedule while receive completion is busy
If out ring is full temporarily and receive completion cannot go out,
we may still need to reschedule napi if certain conditions are met.
Otherwise the napi poll might be stopped forever, and cause network
disconnect.
Fixes: 7426b1a51803 ("netvsc: optimize receive completions") Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
The Vitaly Bordug's email bounces ("ru.mvista.com: Name or service not
known") and there was no activity (ack, review, sign) since 2009.
Cc: Vitaly Bordug <vitb@kernel.crashing.org> Cc: Pantelis Antoniou <pantelis.antoniou@gmail.com> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
net: cavium: Add fine-granular dependencies on PCI
Add dependencies on PCI where necessary.
Fixes: 7e2bc7fb65 ("net: cavium: Drop dependency of NET_VENDOR_CAVIUM on PCI") Signed-off-by: Alexander Sverdlin <alexander.sverdlin@nokia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Stefan Wahren [Wed, 18 Jul 2018 06:31:44 +0000 (08:31 +0200)]
net: qca_spi: Make sure the QCA7000 reset is triggered
In case the SPI thread is not running, a simple reset of sync
state won't fix the transmit timeout. We also need to wake up the kernel
thread.
Signed-off-by: Stefan Wahren <stefan.wahren@i2se.com> Fixes: ed7d42e24eff ("net: qca_spi: fix transmit queue timeout handling") Signed-off-by: David S. Miller <davem@davemloft.net>
Stefan Wahren [Wed, 18 Jul 2018 06:31:43 +0000 (08:31 +0200)]
net: qca_spi: Avoid packet drop during initial sync
As long as the synchronization with the QCA7000 isn't finished, we
cannot accept packets from the upper layers. So let the SPI thread
enable the TX queue after sync and avoid unwanted packet drop.
Signed-off-by: Stefan Wahren <stefan.wahren@i2se.com> Fixes: 291ab06ecf67 ("net: qualcomm: new Ethernet over SPI driver for QCA7000") Signed-off-by: David S. Miller <davem@davemloft.net>
Colin Ian King [Tue, 17 Jul 2018 16:12:39 +0000 (17:12 +0100)]
ipv6: fix useless rol32 call on hash
The rol32 call is currently rotating hash but the rol'd value is
being discarded. I believe the current code is incorrect and hash
should be assigned the rotated value returned from rol32.
Thanks to David Lebrun for spotting this.
Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Colin Ian King [Tue, 17 Jul 2018 15:52:54 +0000 (16:52 +0100)]
ipv6: sr: fix useless rol32 call on hash
The rol32 call is currently rotating hash but the rol'd value is
being discarded. I believe the current code is incorrect and hash
should be assigned the rotated value returned from rol32.
Detected by CoverityScan, CID#1468411 ("Useless call")
Fixes: b5facfdba14c ("ipv6: sr: Compute flowlabel for outer IPv6 header of seg6 encap mode") Signed-off-by: Colin Ian King <colin.king@canonical.com> Acked-by: dlebrun@google.com Signed-off-by: David S. Miller <davem@davemloft.net>
net: usb: asix: replace mii_nway_restart in resume path
mii_nway_restart is not pm aware which results in a rtnl deadlock.
Implement mii_nway_restart manual by setting BMCR_ANRESTART if
BMCR_ANENABLE is set.
To reproduce:
* plug an asix based usb network interface
* wait until the device enters PM (~5 sec)
* `ip link set eth1 up` will never return
Fixes: d9fe64e51114 ("net: asix: Add in_pm parameter") Signed-off-by: Alexander Couzens <lynxis@fe80.eu> Signed-off-by: David S. Miller <davem@davemloft.net>
Fix this by sanitizing t.qset_idx before using it to index
adapter->msix_info
Notice that given that speculation windows are large, the policy is
to kill the speculation on the first load and not worry if it can be
completed with a dependent load/store [1].
lib/rhashtable: consider param->min_size when setting initial table size
rhashtable_init() currently does not take into account the user-passed
min_size parameter unless param->nelem_hint is set as well. As such,
the default size (number of buckets) will always be HASH_DEFAULT_SIZE
even if the smallest allowed size is larger than that. Remediate this
by unconditionally calling into rounded_hashtable_size() and handling
things accordingly.
Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
Merge tag 'for-4.18-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"Three regression fixes. They're few-liners and fixing some corner
cases missed in the origial patches"
* tag 'for-4.18-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: scrub: Don't use inode page cache in scrub_handle_errored_block()
btrfs: fix use-after-free of cmp workspace pages
btrfs: restore uuid_mutex in btrfs_open_devices
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm fixes from Paolo Bonzini:
"Miscellaneous bugfixes, plus a small patchlet related to Spectre v2"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
kvmclock: fix TSC calibration for nested guests
KVM: VMX: Mark VMXArea with revision_id of physical CPU even when eVMCS enabled
KVM: irqfd: fix race between EPOLLHUP and irq_bypass_register_consumer
KVM/Eventfd: Avoid crash when assign and deassign specific eventfd in parallel.
x86/kvmclock: set pvti_cpu0_va after enabling kvmclock
x86/kvm/Kconfig: Ensure CRYPTO_DEV_CCP_DD state at minimum matches KVM_AMD
kvm: nVMX: Restore exit qual for VM-entry failure due to MSR loading
x86/kvm/vmx: don't read current->thread.{fs,gs}base of legacy tasks
KVM: VMX: support MSR_IA32_ARCH_CAPABILITIES as a feature MSR
David S. Miller [Wed, 18 Jul 2018 17:58:27 +0000 (10:58 -0700)]
Merge branch 'smc-fixes'
Ursula Braun says:
====================
net/smc: fixes 2018-07-18
here are small fixes for SMC: The first patch speeds up unidirectional
traffic, the second patch increases security, and the third patch
fixes a problem for fallback cases.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
During clc handshake the receive timeout is set to CLC_WAIT_TIME.
Remember and reset the original timeout value after the receive calls,
and remove a duplicate assignment of CLC_WAIT_TIME.
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
For security reasons the return code of get_user() should always be
checked.
Fixes: 01d2f7e2cdd31 ("net/smc: sockopts TCP_NODELAY and TCP_CORK") Reported-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
The SMC protocol requires to send a separate consumer cursor update,
if it cannot be piggybacked to updates of the producer cursor.
Currently the decision to send a separate consumer cursor update
just considers the amount of data already received by the socket
program. It does not consider the amount of data already arrived, but
not yet consumed by the receiver. Basing the decision on the
difference between already confirmed and already arrived data
(instead of difference between already confirmed and already consumed
data), may lead to a somewhat earlier consumer cursor update send in
fast unidirectional traffic scenarios, and thus to better throughput.
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Suggested-by: Thomas Richter <tmricht@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net/nfc: Avoid stalls when nfc_alloc_send_skb() returned NULL.
syzbot is reporting stalls at nfc_llcp_send_ui_frame() [1]. This is
because nfc_llcp_send_ui_frame() is retrying the loop without any delay
when nonblocking nfc_alloc_send_skb() returned NULL.
Since there is no need to use MSG_DONTWAIT if we retry until
sock_alloc_send_pskb() succeeds, let's use blocking call.
Also, in case an unexpected error occurred, let's break the loop
if blocking nfc_alloc_send_skb() failed.
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reported-by: syzbot <syzbot+d29d18215e477cfbfbdd@syzkaller.appspotmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
We almost never run into this by accident because randconfig builds
end up selecting DST_CACHE from some other tunnel protocol, and this
one appears to be the only one missing the explicit 'select'.
>From all I can tell, this problem first appeared in linux-4.9
when dst_cache support got added to ILA.
Fixes: 79ff2fc31e0f ("ila: Cache a route to translated address") Cc: Tom Herbert <tom@herbertland.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: David S. Miller <davem@davemloft.net>
Inside a nested guest, access to hardware can be slow enough that
tsc_read_refs always return ULLONG_MAX, causing tsc_refine_calibration_work
to be called periodically and the nested guest to spend a lot of time
reading the ACPI timer.
However, if the TSC frequency is available from the pvclock page,
we can just set X86_FEATURE_TSC_KNOWN_FREQ and avoid the recalibration.
'refine' operation.
Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Peng Hao <peng.hao2@zte.com.cn>
[Commit message rewritten. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Liran Alon [Fri, 29 Jun 2018 19:59:04 +0000 (22:59 +0300)]
KVM: VMX: Mark VMXArea with revision_id of physical CPU even when eVMCS enabled
When eVMCS is enabled, all VMCS allocated to be used by KVM are marked
with revision_id of KVM_EVMCS_VERSION instead of revision_id reported
by MSR_IA32_VMX_BASIC.
However, even though not explictly documented by TLFS, VMXArea passed
as VMXON argument should still be marked with revision_id reported by
physical CPU.
This issue was found by the following setup:
* L0 = KVM which expose eVMCS to it's L1 guest.
* L1 = KVM which consume eVMCS reported by L0.
This setup caused the following to occur:
1) L1 execute hardware_enable().
2) hardware_enable() calls kvm_cpu_vmxon() to execute VMXON.
3) L0 intercept L1 VMXON and execute handle_vmon() which notes
vmxarea->revision_id != VMCS12_REVISION and therefore fails with
nested_vmx_failInvalid() which sets RFLAGS.CF.
4) L1 kvm_cpu_vmxon() don't check RFLAGS.CF for failure and therefore
hardware_enable() continues as usual.
5) L1 hardware_enable() then calls ept_sync_global() which executes
INVEPT.
6) L0 intercept INVEPT and execute handle_invept() which notes
!vmx->nested.vmxon and thus raise a #UD to L1.
7) Raised #UD caused L1 to panic.
Paolo Bonzini [Mon, 28 May 2018 11:31:13 +0000 (13:31 +0200)]
KVM: irqfd: fix race between EPOLLHUP and irq_bypass_register_consumer
A comment warning against this bug is there, but the code is not doing what
the comment says. Therefore it is possible that an EPOLLHUP races against
irq_bypass_register_consumer. The EPOLLHUP handler schedules irqfd_shutdown,
and if that runs soon enough, you get a use-after-free.
Reported-by: syzbot <syzkaller@googlegroups.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com>
Lan Tianyu [Fri, 22 Dec 2017 02:10:36 +0000 (21:10 -0500)]
KVM/Eventfd: Avoid crash when assign and deassign specific eventfd in parallel.
Syzbot reports crashes in kvm_irqfd_assign(), caused by use-after-free
when kvm_irqfd_assign() and kvm_irqfd_deassign() run in parallel
for one specific eventfd. When the assign path hasn't finished but irqfd
has been added to kvm->irqfds.items list, another thead may deassign the
eventfd and free struct kvm_kernel_irqfd(). The assign path then uses
the struct kvm_kernel_irqfd that has been freed by deassign path. To avoid
such issue, keep irqfd under kvm->irq_srcu protection after the irqfd
has been added to kvm->irqfds.items list, and call synchronize_srcu()
in irq_shutdown() to make sure that irqfd has been fully initialized in
the assign path.
Reported-by: Dmitry Vyukov <dvyukov@google.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Tianyu Lan <tianyu.lan@intel.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
octeon_mgmt: Fix MIX registers configuration on MTU setup
octeon_mgmt driver doesn't drop RX frames that are 1-4 bytes bigger than
MTU set for the corresponding interface. The problem is in the
AGL_GMX_RX0/1_FRM_MAX register setting, which should not account for VLAN
tagging.
According to Octeon HW manual:
"For tagged frames, MAX increases by four bytes for each VLAN found up to a
maximum of two VLANs, or MAX + 8 bytes."
OCTEON_FRAME_HEADER_LEN "define" is fine for ring buffer management, but
should not be used for AGL_GMX_RX0/1_FRM_MAX.
The problem could be easily reproduced using "ping" command. If affected
system has default MTU 1500, other host (having MTU >= 1504) can
successfully "ping" the affected system with payload size 1473-1476,
resulting in IP packets of size 1501-1504 accepted by the mgmt driver.
Fixed system still accepts IP packets of 1500 bytes even with VLAN tagging,
because the limits are lifted in HW as expected, for every VLAN tag.
Signed-off-by: Alexander Sverdlin <alexander.sverdlin@nokia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Mon, 8 Jan 2018 19:51:04 +0000 (11:51 -0800)]
Mark HI and TASKLET softirq synchronous
Way back in 4.9, we committed 4cd13c21b207 ("softirq: Let ksoftirqd do
its job"), and ever since we've had small nagging issues with it. For
example, we've had:
1ff688209e2e ("watchdog: core: make sure the watchdog_worker is not deferred") 8d5755b3f77b ("watchdog: softdog: fire watchdog even if softirqs do not get to run") 217f69743681 ("net: busy-poll: allow preemption in sk_busy_loop()")
all of which worked around some of the effects of that commit.
The DVB people have also complained that the commit causes excessive USB
URB latencies, which seems to be due to the USB code using tasklets to
schedule USB traffic. This seems to be an issue mainly when already
living on the edge, but waiting for ksoftirqd to handle it really does
seem to cause excessive latencies.
Now Hanna Hawa reports that this issue isn't just limited to USB URB and
DVB, but also causes timeout problems for the Marvell SoC team:
"I'm facing kernel panic issue while running raid 5 on sata disks
connected to Macchiatobin (Marvell community board with Armada-8040
SoC with 4 ARMv8 cores of CA72) Raid 5 built with Marvell DMA engine
and async_tx mechanism (ASYNC_TX_DMA [=y]); the DMA driver (mv_xor_v2)
uses a tasklet to clean the done descriptors from the queue"
The latency problem causes a panic:
mv_xor_v2 f0400000.xor: dma_sync_wait: timeout!
Kernel panic - not syncing: async_tx_quiesce: DMA error waiting for transaction
We've discussed simply just reverting the original commit entirely, and
also much more involved solutions (with per-softirq threads etc). This
patch is intentionally stupid and fairly limited, because the issue
still remains, and the other solutions either got sidetracked or had
other issues.
We should probably also consider the timer softirqs to be synchronous
and not be delayed to ksoftirqd (since they were the issue with the
earlier watchdog problems), but that should be done as a separate patch.
This does only the tasklet cases.
Reported-and-tested-by: Hanna Hawa <hannah@marvell.com> Reported-and-tested-by: Josef Griebichler <griebichler.josef@gmx.at> Reported-by: Mauro Carvalho Chehab <mchehab@s-opensource.com> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Eric Dumazet <edumazet@google.com> Cc: Ingo Molnar <mingo@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The SNDRV_RAWMIDI_IOCTL_PARAMS ioctl may resize the buffers and the
current code is racy. For example, the sequencer client may write to
buffer while it being resized.
As a simple workaround, let's switch to the resized buffer inside the
stream runtime lock.
btrfs: scrub: Don't use inode page cache in scrub_handle_errored_block()
In commit ac0b4145d662 ("btrfs: scrub: Don't use inode pages for device
replace") we removed the branch of copy_nocow_pages() to avoid
corruption for compressed nodatasum extents.
However above commit only solves the problem in scrub_extent(), if
during scrub_pages() we failed to read some pages,
sctx->no_io_error_seen will be non-zero and we go to fixup function
scrub_handle_errored_block().
In scrub_handle_errored_block(), for sctx without csum (no matter if
we're doing replace or scrub) we go to scrub_fixup_nodatasum() routine,
which does the similar thing with copy_nocow_pages(), but does it
without the extra check in copy_nocow_pages() routine.
So for test cases like btrfs/100, where we emulate read errors during
replace/scrub, we could corrupt compressed extent data again.
This patch will fix it just by avoiding any "optimization" for
nodatasum, just falls back to the normal fixup routine by try read from
any good copy.
This also solves WARN_ON() or dead lock caused by lame backref iteration
in scrub_fixup_nodatasum() routine.
The deadlock or WARN_ON() won't be triggered before commit ac0b4145d662
("btrfs: scrub: Don't use inode pages for device replace") since
copy_nocow_pages() have better locking and extra check for data extent,
and it's already doing the fixup work by try to read data from any good
copy, so it won't go scrub_fixup_nodatasum() anyway.
This patch disables the faulty code and will be removed completely in a
followup patch.
Fixes: ac0b4145d662 ("btrfs: scrub: Don't use inode pages for device replace") Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
SMC ioctl processing requires the sock lock to work properly in
all thinkable scenarios.
Problem has been found with RaceFuzzer and fixes:
KASAN: null-ptr-deref Read in smc_ioctl
Reported-by: Byoungyoung Lee <lifeasageek@gmail.com> Reported-by: syzbot+35b2c5aa76fd398b9fd4@syzkaller.appspotmail.com Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
This patch has fix for TX timeout while running bi-directional
traffic with 100 Mbps using 5762.
Signed-off-by: Sanjeev Bansal <sanjeevb.bansal@broadcom.com> Signed-off-by: Siva Reddy Kallam <siva.kallam@broadcom.com> Reviewed-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Siva Reddy Kallam <siva.kallam@broadcom.com> Reviewed-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
John Allen [Mon, 16 Jul 2018 15:29:30 +0000 (10:29 -0500)]
ibmvnic: Fix error recovery on login failure
Testing has uncovered a failure case that is not handled properly. In the
event that a login fails and we are not able to recover on the spot, we
return 0 from do_reset, preventing any error recovery code from being
triggered. Additionally, the state is set to "probed" meaning that when we
are able to trigger the error recovery, the driver always comes up in the
probed state. To handle the case properly, we need to return a failure code
here and set the adapter state to the state that we entered the reset in
indicating the state that we would like to come out of the recovery reset
in.
Signed-off-by: John Allen <jallen@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Stefan Wahren [Sun, 15 Jul 2018 19:53:20 +0000 (21:53 +0200)]
net: lan78xx: Fix race in tx pending skb size calculation
The skb size calculation in lan78xx_tx_bh is in race with the start_xmit,
which could lead to rare kernel oopses. So protect the whole skb walk with
a spin lock. As a benefit we can unlink the skb directly.
This patch was tested on Raspberry Pi 3B+
Link: https://github.com/raspberrypi/linux/issues/2608 Fixes: 55d7de9de6c3 ("Microchip's LAN7800 family USB 2/3 to 10/100/1000 Ethernet") Cc: stable <stable@vger.kernel.org> Signed-off-by: Floris Bos <bos@je-eigen-domein.nl> Signed-off-by: Stefan Wahren <stefan.wahren@i2se.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David Ahern [Sun, 15 Jul 2018 16:35:19 +0000 (09:35 -0700)]
net/ipv6: Do not allow device only routes via the multipath API
Eric reported that reverting the patch that fixed and simplified IPv6
multipath routes means reverting back to invalid userspace notifications.
eg.,
$ ip -6 route add 2001:db8:1::/64 nexthop dev eth0 nexthop dev eth1
only generates a single notification:
2001:db8:1::/64 dev eth0 metric 1024 pref medium
While working on a fix for this problem I found another case that is just
broken completely - a multipath route with a gateway followed by device
followed by gateway:
$ ip -6 ro add 2001:db8:103::/64
nexthop via 2001:db8:1::64
nexthop dev dummy2
nexthop via 2001:db8:3::64
In this case the device only route is dropped completely - no notification
to userpsace but no addition to the FIB either:
$ ip -6 ro ls
2001:db8:1::/64 dev dummy1 proto kernel metric 256 pref medium
2001:db8:2::/64 dev dummy2 proto kernel metric 256 pref medium
2001:db8:3::/64 dev dummy3 proto kernel metric 256 pref medium
2001:db8:103::/64 metric 1024
nexthop via 2001:db8:1::64 dev dummy1 weight 1
nexthop via 2001:db8:3::64 dev dummy3 weight 1 pref medium
fe80::/64 dev dummy1 proto kernel metric 256 pref medium
fe80::/64 dev dummy2 proto kernel metric 256 pref medium
fe80::/64 dev dummy3 proto kernel metric 256 pref medium
Really, IPv6 multipath is just FUBAR'ed beyond repair when it comes to
device only routes, so do not allow it all.
This change will break any scripts relying on the mpath api for insert,
but I don't see any other way to handle the permutations. Besides, since
the routes are added to the FIB as standalone (non-multipath) routes the
kernel is not doing what the user requested, so it might as well tell the
user that.
Reported-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David Ahern <dsahern@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Stefan Baranoff [Sun, 15 Jul 2018 15:36:37 +0000 (11:36 -0400)]
tcp: Fix broken repair socket window probe patch
Correct previous bad attempt at allowing sockets to come out of TCP
repair without sending window probes. To avoid changing size of
the repair variable in struct tcp_sock, this lets the decision for
sending probes or not to be made when coming out of repair by
introducing two ways to turn it off.
v2:
* Remove erroneous comment; defines now make behavior clear
Fixes: 70b7ff130224 ("tcp: allow user to create repair socket without window probes") Signed-off-by: Stefan Baranoff <sbaranoff@gmail.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Andrei Vagin <avagin@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
When a new rx packet arrives, the rx path will decide whether to reuse
the remainder of the page or not according to one of the below conditions:
1. frag_info->frag_stride == PAGE_SIZE / 2
2. frags->page_offset + frag_info->frag_size > PAGE_SIZE;
The first condition is no met for when XDP is set.
For XDP, page_offset is always set to priv->rx_headroom which is
XDP_PACKET_HEADROOM and frag_info->frag_size is around mtu size + some
padding, still the 2nd release condition will hold since
XDP_PACKET_HEADROOM + 1536 < PAGE_SIZE, as a result the page will not
be released and will be _wrongly_ reused for next free rx descriptor.
In XDP there is an assumption to have a page per packet and reuse can
break such assumption and might cause packet data corruptions.
Fix this by adding an extra condition (!priv->rx_headroom) to the 2nd
case to avoid page reuse when XDP is set, since rx_headroom is set to 0
for non XDP setup and set to XDP_PACKET_HEADROOM for XDP setup.
No additional cache line is required for the new condition.
Fixes: 34db548bfb95 ("mlx4: add page recycling in receive path") Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Suggested-by: Martin KaFai Lau <kafai@fb.com> CC: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
CC [M] drivers/net/ethernet/freescale/fman/fman.o
In file included from ../drivers/net/ethernet/freescale/fman/fman.c:35:
../include/linux/fsl/guts.h: In function 'guts_set_dmacr':
../include/linux/fsl/guts.h:165:2: error: implicit declaration of function 'clrsetbits_be32' [-Werror=implicit-function-declaration]
clrsetbits_be32(&guts->dmacr, 3 << shift, device << shift);
^~~~~~~~~~~~~~~
Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Madalin Bucur <madalin.bucur@nxp.com> Cc: netdev@vger.kernel.org Cc: linuxppc-dev@lists.ozlabs.org Signed-off-by: David S. Miller <davem@davemloft.net>
hv/netvsc: fix handling of fallback to single queue mode
The netvsc device may need to fallback to running in single queue
mode if host side only wants to support single queue.
Recent change for handling mtu broke this in setup logic.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Fixes: 3ffe64f1a641 ("hv_netvsc: split sub-channel setup into async and sync") Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Thomas Falcon [Fri, 13 Jul 2018 17:03:32 +0000 (12:03 -0500)]
ibmvnic: Revise RX/TX queue error messages
During a device failover, there may be latency between the loss
of the current backing device and a notification from firmware that
a failover has occurred. This latency can result in a large amount of
error printouts as firmware returns outgoing traffic with a generic
error code. These are not necessarily errors in this case as the
firmware is busy swapping in a new backing adapter and is not ready
to send packets yet. This patch reclassifies those error codes as
warnings with an explanation that a failover may be pending. All
other return codes will be considered errors.
Signed-off-by: Thomas Falcon <tlfalcon@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
ipv6: make DAD fail with enhanced DAD when nonce length differs
Commit adc176c54722 ("ipv6 addrconf: Implemented enhanced DAD (RFC7527)")
added enhanced DAD with a nonce length of 6 bytes. However, RFC7527
doesn't specify the length of the nonce, other than being 6 + 8*k bytes,
with integer k >= 0 (RFC3971 5.3.2). The current implementation simply
assumes that the nonce will always be 6 bytes, but others systems are
free to choose different sizes.
If another system sends a nonce of different length but with the same 6
bytes prefix, it shouldn't be considered as the same nonce. Thus, check
that the length of the received nonce is the same as the length we sent.
Ugly scapy test script running on veth0:
def loop():
pkt=sniff(iface="veth0", filter="icmp6", count=1)
pkt = pkt[0]
b = bytearray(pkt[Raw].load)
b[1] += 1
b += b'\xde\xad\xbe\xef\xde\xad\xbe\xef'
pkt[Raw].load = bytes(b)
pkt[IPv6].plen += 8
# fixup checksum after modifying the payload
pkt[IPv6].payload.cksum -= 0x3b44
if pkt[IPv6].payload.cksum < 0:
pkt[IPv6].payload.cksum += 0xffff
sendp(pkt, iface="veth0")
This should result in DAD failure for any address added to veth0's peer,
but is currently ignored.
Fixes: adc176c54722 ("ipv6 addrconf: Implemented enhanced DAD (RFC7527)") Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
This patch remove the following documentation warning
drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c:103: warning: Excess function parameter 'priv' description in 'stmmac_axi_setup'
It was introduced in commit afea03656add7 ("stmmac: rework DMA bus setting and introduce new platform AXI structure")
Signed-off-by: Corentin Labbe <clabbe@baylibre.com> Signed-off-by: David S. Miller <davem@davemloft.net>
A KASAN:use-after-free bug was found related to ip6-erspan
while running selftests/net/ip6_gre_headroom.sh
It happens because of following sequence:
- ipv6hdr pointer is obtained from skb
- skb_cow_head() is called, skb->head memory is reallocated
- old data is accessed using ipv6hdr pointer
skb_cow_head() call was added in e41c7c68ea77 ("ip6erspan: make sure
enough headroom at xmit."), but looking at the history there was a
chance of similar bug because gre_handle_offloads() and pskb_trim()
can also reallocate skb->head memory. Fixes tag points to commit
which introduced possibility of this bug.
This patch moves ipv6hdr pointer assignment after skb_cow_head() call.
Fixes: 5a963eb61b7c ("ip6_gre: Add ERSPAN native tunnel support") Signed-off-by: Prashant Bhole <bhole_prashant_q7@lab.ntt.co.jp> Reviewed-by: Greg Rose <gvrose8192@gmail.com> Acked-by: William Tu <u9012063@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
On XDP_TX we need to free up the frame only when tun_xdp_tx() returns a
negative value. A positive value indicates that the packet is
successfully enqueued to the ptr_ring, so freeing the page causes
use-after-free.
Fixes: 735fc4054b3a ("xdp: change ndo_xdp_xmit API to support bulking") Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> Acked-by: Jason Wang <jasowang@redhat.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Dave Watson [Thu, 12 Jul 2018 15:03:43 +0000 (08:03 -0700)]
tls: Stricter error checking in zerocopy sendmsg path
In the zerocopy sendmsg() path, there are error checks to revert
the zerocopy if we get any error code. syzkaller has discovered
that tls_push_record can return -ECONNRESET, which is fatal, and
happens after the point at which it is safe to revert the iter,
as we've already passed the memory to do_tcp_sendpages.
Previously this code could return -ENOMEM and we would want to
revert the iter, but AFAIK this no longer returns ENOMEM after a447da7d004 ("tls: fix waitall behavior in tls_sw_recvmsg"),
so we fail for all error codes.
Reported-by: syzbot+c226690f7b3126c5ee04@syzkaller.appspotmail.com Reported-by: syzbot+709f2810a6a05f11d4d3@syzkaller.appspotmail.com Signed-off-by: Dave Watson <davejwatson@fb.com> Fixes: 3c4d7559159b ("tls: kernel TLS support") Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Constantine Shulyupin <const@MakeLinux.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Biggers [Wed, 11 Jul 2018 17:46:29 +0000 (10:46 -0700)]
KEYS: DNS: fix parsing multiple options
My recent fix for dns_resolver_preparse() printing very long strings was
incomplete, as shown by syzbot which still managed to hit the
WARN_ONCE() in set_precision() by adding a crafted "dns_resolver" key:
precision 50001 too large
WARNING: CPU: 7 PID: 864 at lib/vsprintf.c:2164 vsnprintf+0x48a/0x5a0
The bug this time isn't just a printing bug, but also a logical error
when multiple options ("#"-separated strings) are given in the key
payload. Specifically, when separating an option string into name and
value, if there is no value then the name is incorrectly considered to
end at the end of the key payload, rather than the end of the current
option. This bypasses validation of the option length, and also means
that specifying multiple options is broken -- which presumably has gone
unnoticed as there is currently only one valid option anyway.
A similar problem also applied to option values, as the kstrtoul() when
parsing the "dnserror" option will read past the end of the current
option and into the next option.
Fix these bugs by correctly computing the length of the option name and
by copying the option value, null-terminated, into a temporary buffer.
Reproducer for "dnserror" option being parsed incorrectly (expected
behavior is to fail when seeing the unknown option "foo", actual
behavior was to read the dnserror value as "1#foo" and fail there):
Reported-by: syzbot <syzkaller@googlegroups.com> Fixes: 4a2d789267e0 ("DNS: If the DNS server returns an error, allow that to be cached [ver #2]") Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
====================
multicast: init as INCLUDE when join SSM INCLUDE group
Based on RFC3376 5.1 and RFC3810 6.1, we should init as INCLUDE when join SSM
INCLUDE group. In my first version I only clear the group change record. But
this is not enough as when a new group join, it will init as EXCLUDE and
trigger an filter mode change in ip/ip6_mc_add_src(), which will clear all
source addresses' sf_crcount. This will prevent early joined address sending
state change records if multi source addresses joined at the same time.
In this v2 patchset, I fixed it by directly initializing the mode to INCLUDE
for SSM JOIN_SOURCE_GROUP. I also split the original patch into two separated
patches for IPv4 and IPv6.
Test: test by myself and customer.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Hangbin Liu [Tue, 10 Jul 2018 14:41:27 +0000 (22:41 +0800)]
ipv6/mcast: init as INCLUDE when join SSM INCLUDE group
This an IPv6 version patch of "ipv4/igmp: init group mode as INCLUDE when
join source group". From RFC3810, part 6.1:
If no per-interface state existed for that
multicast address before the change (i.e., the change consisted of
creating a new per-interface record), or if no state exists after the
change (i.e., the change consisted of deleting a per-interface
record), then the "non-existent" state is considered to have an
INCLUDE filter mode and an empty source list.
Which means a new multicast group should start with state IN(). Currently,
for MLDv2 SSM JOIN_SOURCE_GROUP mode, we first call ipv6_sock_mc_join(),
then ip6_mc_source(), which will trigger a TO_IN() message instead of
ALLOW().
The issue was exposed by commit a052517a8ff65 ("net/multicast: should not
send source list records when have filter mode change"). Before this change,
we sent both ALLOW(A) and TO_IN(A). Now, we only send TO_IN(A).
Fix it by adding a new parameter to init group mode. Also add some wrapper
functions to avoid changing too much code.
v1 -> v2:
In the first version I only cleared the group change record. But this is not
enough. Because when a new group join, it will init as EXCLUDE and trigger
a filter mode change in ip/ip6_mc_add_src(), which will clear all source
addresses sf_crcount. This will prevent early joined address sending state
change records if multi source addressed joined at the same time.
In v2 patch, I fixed it by directly initializing the mode to INCLUDE for SSM
JOIN_SOURCE_GROUP. I also split the original patch into two separated patches
for IPv4 and IPv6.
There is also a difference between v4 and v6 version. For IPv6, when the
interface goes down and up, we will send correct state change record with
unspecified IPv6 address (::) with function ipv6_mc_up(). But after DAD is
completed, we resend the change record TO_IN() in mld_send_initial_cr().
Fix it by sending ALLOW() for INCLUDE mode in mld_send_initial_cr().
Fixes: a052517a8ff65 ("net/multicast: should not send source list records when have filter mode change") Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>