Or Gerlitz [Thu, 21 May 2015 12:14:09 +0000 (15:14 +0300)]
net/mlx4_core: Adjust the schedule queue port in reset-to-init too
It's legal for drivers to provide the QP port through the
QPC schedule-queue field on the reset-to-init QP state change.
Add adjusting of the schedule queue port in the SRIOV wrapper
for that operation too.
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il> Signed-off-by: David S. Miller <davem@davemloft.net>
Or Gerlitz [Thu, 21 May 2015 12:14:08 +0000 (15:14 +0300)]
net/mlx4_core: Adjust the schedule queue port for single ported IB VFs
Some VF drivers flow set the schedule queue in the QP context but
without setting none of OPTPAR_SCHED_QUEUE or OPTPAR_PRIMARY_ADDR_PATH.
To allow for such non-modified drivers to function as single ported
IB VFs, we must adjust the schedule queue port whenever being set,
e.g as currently done for single ported Eth VFs.
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il> Signed-off-by: David S. Miller <davem@davemloft.net>
Or Gerlitz [Thu, 21 May 2015 12:14:07 +0000 (15:14 +0300)]
net/mlx4_core: Modify port values when generting EQEs for VFs
As part of enabling single ported VFs over IB ports we need to handle
some of the flows for generting EQ events for VFs which don't come
into play under Eth ports.
This mainly includes port management events derived from changes of the
phyiscal port (lid change, client re-register, down/up, etc), VF pkey table
changes and VF guid changes initiated by the IB driver.
(1) make sure that events are generated only for VFs sitting on
the relevant physical port (under the ALL_SLAVES flow).
(2) before generating the event, convert from physical (one or two)
to VF port (always equals one).
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il> Signed-off-by: David S. Miller <davem@davemloft.net>
Or Gerlitz [Thu, 21 May 2015 12:14:06 +0000 (15:14 +0300)]
IB/mlx4: Convert slave port before building address-handle
When multiplexling a MAD sent from VF, we should convert the port used
by the guest to send the packet to the actual physical port which will be
used to transmit the packet, before building the relevant address-handle (AH).
This is needed under VPI for single ported VFs, since the code that builds
the AH (mlx4_ib_query_ah()) makes decisions based on the input port. If we
use the port number provided by the guest, it might have different protocol
vs. the one this packat has to go from, and hence the result could be wrong.
So far, the conversion was done after the AH was built and it worked for
single ported Eth VFs which were not enabled under VPI. When adding support
for single ported IB VFs and VPI, we hit that.
Fixes: 449fc48866f7 ('net/mlx4: Adapt code for N-Port VF') Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Or Gerlitz [Thu, 21 May 2015 12:14:05 +0000 (15:14 +0300)]
net/mlx4_core: Enhance the MAD_IFC wrapper to convert VF port to physical
Single port VFs always provide port = 1 (even if the actual physical
port used is port 2). As such, we need to convert the port provided
by the VF to the physical port before calling into the firmware.
It turns out that the Linux mlx4 VF RoCE driver maintains a copy of
the GID table and hence this change became critical only for single
ported IB VFs, but it could be needed for other RoCE VF drivers too.
Fixes: 449fc48866f7 ('net/mlx4: Adapt code for N-Port VF') Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Sat, 23 May 2015 03:59:23 +0000 (23:59 -0400)]
Merge branch 'pktgen-new-scripts'
Jesper Dangaard Brouer says:
====================
pktgen: cleanups and introducing new samples/pktgen scripts
v3:
- Aborted v2 send due it was not generating diff stat
(this is a bug in stg-mail, if not in the root directory)
v2: address nitpicks from Cong Wang
- Remove useless cat's, but keep them for old pgset()
- Comment on: Due to pgctrl, cannot use exit code $? from grep
- Use arithmetic compare in pktgen_sample03_burst_single_flow.sh
This patchset is focused on making pktgen easier to use and better
documented. It contains a number of documentation updates and minor
changes to pktgen. The major contribution is introduction of common
helper function for sample scripts.
Instead of the old pgset() function, three new shell functions for
configuring the different components of pktgen are introduced:
pg_ctrl(), pg_thread() and pg_set().
The new functions correspond to pktgens different components.
* pg_ctrl() control "pgctrl" (/proc/net/pktgen/pgctrl)
* pg_thread() control the kernel threads and binding to devices
* pg_set() control setup of individual devices
Helpers also provide consistent parameter parsing across the sample
scripts.
This script pktgen_bench_xmit_mode_netif_receive.sh is a benchmark
script, which can be used for benchmarking part of the network stack.
This can be used for performance improving or catching regression in
that area.
The script is developed for benchmarking ingress qdisc path, original
idea by Alexei Starovoitov. This script don't really need any
hardware. This is achieved via the recently introduced stack inject
feature "xmit_mode netif_receive". See commit 62f64aed622b6 ("pktgen:
introduce xmit_mode '<start_xmit|netif_receive>'").
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Add the pktgen samples script pktgen_sample03_burst_single_flow.sh
that demonstrates how to acheive maximum performance.
If correctly tuned[1] single CPU 10Gbit/s wirespeed small pkts is
possible[2] which is 14.88Mpps. The trick is to take advantage of the
"burst" feature introduced in commit 38b2cf2982dc73 ("net: pktgen:
packet bursting via skb->xmit_more").
Add the pktgen samples script pktgen_sample02_multiqueue.sh that
demonstrates generating packets on multiqueue NICs.
Specifically notice the options "-t" that specifies how many
kernel threads to activate. Also notice the flag QUEUE_MAP_CPU,
which cause the SKB TX queue to be mapped to the CPU running the
kernel thread. For best scalability people are also encourage to
map NIC IRQ /proc/irq/*/smp_affinity to CPU number.
Usage example with "-t" 4 threads and help:
./pktgen_sample02_multiqueue.sh -i eth4 -m 00:1B:21:3C:9D:F8 -t 4
Add the first basic pktgen samples script pktgen_sample01_simple.sh,
which demonstrates the a simple use of the helper functions.
Removing pktgen.conf-1-1 as that example should be covered now.
The naming scheme pktgen_sampleNN, where NN is a number, should encourage
reading the samples in a specific order.
Script cause pktgen sending with a single thread and single interface,
and introduce flow variation via random UDP source port.
Usage example and help:
./pktgen_sample01_simple.sh -i eth4 -m 00:1B:21:3C:9D:F8 -d 192.168.8.2
pktgen: new pktgen helper functions for samples scripts
Preparing for removing existing samples/pktgen/ scripts, and
replacing these with easier to use samples.
This commit provides two helper shell files, that can
be "included" by shell source'ing. Namely "functions.sh"
and "parameters.sh".
The parameters.sh file support easy and consistant parameter
parsing across the sample scripts. Usage example is printed on
errors.
The functions.sh file provides, three new shell functions for
configuring the different components of pktgen: pg_ctrl(),
pg_thread() and pg_set(). A slightly improved version of the old
pgset() function is also provided for backwards compat.
The new functions correspond to pktgens different components.
* pg_ctrl() control "pgctrl" (/proc/net/pktgen/pgctrl)
* pg_thread() control the kernel threads and binding to devices
* pg_set() control setup of individual devices
These changes are borrowed from:
https://github.com/netoptimizer/network-testing/tree/master/pktgen
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
pktgen: make /proc/net/pktgen/pgctrl report fail on invalid input
Giving /proc/net/pktgen/pgctrl an invalid command just returns shell
success and prints a warning in dmesg. This is not very useful for
shell scripting, as it can only detect the error by parsing dmesg.
Instead return -EINVAL when the command is unknown, as this provides
userspace shell scripting a way of detecting this.
Also bump version tag to 2.75, because (1) reading /proc/net/pktgen/pgctrl
output this version number which would allow to detect this small
semantic change, and (2) because the pktgen version tag have not been
updated since 2010.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
pktgen: document ability to add same device to several threads
The pktgen.txt documentation still claimed that adding same device to
multiple threads were not supported, but it have been since 2008 via
commit e6fce5b916cd7 ("pktgen: multiqueue etc.").
Document this and describe the naming scheme dev@X, as the procfile name
still need to be unique.
Fixes: e6fce5b916cd7 ("pktgen: multiqueue etc.") Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Sat, 23 May 2015 00:34:24 +0000 (17:34 -0700)]
Merge branch 'drm-fixes' of git://people.freedesktop.org/~airlied/linux
Pull drm fixes from Dave Airlie:
"Radeon has two displayport fixes, one for a regression.
i915 regression flicker fix needed so 4.0 can get fixed.
A bunch of msm fixes and a bunch of exynos fixes, these two are
probably a bit larger than I'd like, but most of them seems pretty
good"
* 'drm-fixes' of git://people.freedesktop.org/~airlied/linux: (29 commits)
drm/radeon: fix error flag checking in native aux path
drm/radeon: retry dcpd fetch
drm/msm/mdp5: fix incorrect parameter for msm_framebuffer_iova()
drm/exynos: dp: Lower level of EDID read success message
drm/exynos: cleanup exynos_drm_plane
drm/exynos: 'win' is always unsigned
drm/exynos: mixer: don't dump registers under spinlock
drm/exynos: Consolidate return statements in fimd_bind()
drm/exynos: Constify exynos_drm_crtc_ops
drm/exynos: Fix build breakage on !DRM_EXYNOS_FIMD
drm/exynos: mixer: Constify platform_device_id
drm/exynos: mixer: cleanup pixelformat handling
drm/exynos: mixer: also allow NV21 for the video processor
drm/exynos: mixer: remove buffer count handling in vp_video_buffer()
drm/exynos: plane: honor buffer offset for dma_addr
drm/exynos: fb: use drm_format_num_planes to get buffer count
drm/i915: fix screen flickering
drm/msm: fix locking inconsistencies in gpu->destroy()
drm/msm/dsi: Simplify the code to get the number of read byte
drm/msm: Attach assigned encoder to eDP and DSI connectors
...
1) Don't leak ipvs->sysctl_tbl, from Tommi Rentala.
2) Fix neighbour table entry leak in rocker driver, from Ying Xue.
3) Do not emit bonding notifications for unregistered interfaces, from
Nicolas Dichtel.
4) Set ipv6 flow label properly when in TIME_WAIT state, from Florent
Fourcot.
5) Fix regression in ipv6 multicast filter test, from Henning Rogge.
6) do_replace() in various footables netfilter modules is missing a
check for 0 counters in the datastructure provided by the user. Fix
from Dave Jones, and found with trinity.
7) Fix RCU bug in packet scheduler classifier module unloads, from
Daniel Borkmann.
8) Avoid deadlock in tcp_get_info() by using u64_sync. From Eric
Dumzaet.
9) Input packet processing can race with inetdev_destroy() teardown,
fix potential OOPS in ip_error() by explicitly testing whether the
inetdev is still attached. From Eric W Biederman.
10) MLDv2 parser in bridge multicast code breaks too early while
parsing. Fix from Thadeu Lima de Souza Cascardo.
11) Asking for settings on non-zero PHYID doesn't work because we do not
import the command structure from the user and use the PHYID
provided there. Fix from Arun Parameswaran.
12) Fix UDP checksums with IPV6 RAW sockets, from Vlad Yasevich.
13) Missing NF_TABLES depends for TPROXY etc can cause build failures,
fix from Florian Westphal.
14) Fix netfilter conntrack to handle RFC5961 challenge ACKs properly,
from Jesper Dangaard Brouer.
15) If netlink autobind retry fails, we have to reset the sockets portid
back to zero. From Herbert Xu.
16) VXLAN netns exit code unregisters using wrong device, from John W
Linville.
17) Add some USB device IDs to ath3k and btusb bluetooth drivers, from
Dmitry Tunin and Wen-chien Jesse Sung.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (44 commits)
bridge: fix lockdep splat
net: core: 'ethtool' issue with querying phy settings
bridge: fix parsing of MLDv2 reports
ARM: zynq: DT: Use the zynq binding with macb
net: macb: Disable half duplex gigabit on Zynq
net: macb: Document zynq gem dt binding
ipv4: fill in table id when replacing a route
cdc_ncm: Fix tx_bytes statistics
ipv4: Avoid crashing in ip_error
tcp: fix a potential deadlock in tcp_get_info()
net: sched: fix call_rcu() race on classifier module unloads
net: phy: Make sure phy_start() always re-enables the phy interrupts
ipv6: fix ECMP route replacement
ipv6: do not delete previously existing ECMP routes if add fails
Revert "netfilter: bridge: query conntrack about skb dnat"
netfilter: ensure number of counters is >0 in do_replace()
netfilter: nfnetlink_{log,queue}: Register pernet in first place
tcp: don't over-send F-RTO probes
tcp: only undo on partial ACKs in CA_Loss
net/ipv6/udp: Fix ipv6 multicast socket filter regression
...
Linus Torvalds [Fri, 22 May 2015 22:15:30 +0000 (15:15 -0700)]
Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
"Three small fixes that have been picked up the last few weeks.
Specifically:
- Fix a memory corruption issue in NVMe with malignant user
constructed request. From Christoph.
- Kill (now) unused blk_queue_bio(), dm was changed to not need this
anymore. From Mike Snitzer.
- Always use blk_schedule_flush_plug() from the io_schedule() path
when flushing a plug, fixing a !TASK_RUNNING warning with md. From
Shaohua"
* 'for-linus' of git://git.kernel.dk/linux-block:
sched: always use blk_schedule_flush_plug in io_schedule_out
nvme: fix kernel memory corruption with short INQUIRY buffers
block: remove export for blk_queue_bio
Linus Torvalds [Fri, 22 May 2015 21:49:55 +0000 (14:49 -0700)]
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input
Pull input fixes from Dmitry Torokhov:
"Updates for the input subsystem.
The main change is that we tell joydev not to touch "absolute mice",
such as VMware virtual mouse, as that produced bad result (cursor
stuck in upper right corner) with games"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
Input: smtpe-ts - wait 50mS until polling for pen-up
Input: smtpe-ts - use msecs_to_jiffies() instead of HZ
Input: joydev - don't classify the vmmouse as a joystick
Input: vmmouse - do not reference non-existing version of X driver
Input: alps - fix finger jumps on lifting 2 fingers on v7 touchpad
Input: elantech - fix semi-mt protocol for v3 HW
Input: sx8654 - fix memory allocation check
Problem here is that br_forward_delay_timer_expired() is a timer
handler, calling br_ifinfo_notify() which assumes either rcu_read_lock()
or RTNL are held.
Simplest fix seems to add rcu read lock section.
Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Josh Boyer <jwboyer@fedoraproject.org> Reported-by: Dominick Grift <dac.override@gmail.com> Cc: Vlad Yasevich <vyasevich@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net: core: 'ethtool' issue with querying phy settings
When trying to configure the settings for PHY1, using commands
like 'ethtool -s eth0 phyad 1 speed 100', the 'ethtool' seems to
modify other settings apart from the speed of the PHY1, in the
above case.
The ethtool seems to query the settings for PHY0, and use this
as the base to apply the new settings to the PHY1. This is
causing the other settings of the PHY 1 to be wrongly
configured.
The issue is caused by the '_ethtool_get_settings()' API, which
gets called because of the 'ETHTOOL_GSET' command, is clearing
the 'cmd' pointer (of type 'struct ethtool_cmd') by calling
memset. This clears all the parameters (if any) passed for the
'ETHTOOL_GSET' cmd. So the driver's callback is always invoked
with 'cmd->phy_address' as '0'.
The '_ethtool_get_settings()' is called from other files in the
'net/core'. So the fix is applied to the 'ethtool_get_settings()'
which is only called in the context of the 'ethtool'.
Signed-off-by: Arun Parameswaran <aparames@broadcom.com> Reviewed-by: Ray Jui <rjui@broadcom.com> Reviewed-by: Scott Branden <sbranden@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Holzheu [Fri, 22 May 2015 15:36:40 +0000 (08:36 -0700)]
test_bpf: Add backward jump test case
Currently the testsuite does not have a test case with a backward jump.
The s390x JIT (kernel 4.0) had a bug in that area.
So add one new test case for this now.
Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com> Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
When more than a multicast address is present in a MLDv2 report, all but
the first address is ignored, because the code breaks out of the loop if
there has not been an error adding that address.
This has caused failures when two guests connected through the bridge
tried to communicate using IPv6. Neighbor discoveries would not be
transmitted to the other guest when both used a link-local address and a
static address.
This only happens when there is a MLDv2 querier in the network.
The fix will only break out of the loop when there is a failure adding a
multicast address.
The mdb before the patch:
dev ovirtmgmt port vnet0 grp ff02::1:ff7d:6603 temp
dev ovirtmgmt port vnet1 grp ff02::1:ff7d:6604 temp
dev ovirtmgmt port bond0.86 grp ff02::2 temp
After the patch:
dev ovirtmgmt port vnet0 grp ff02::1:ff7d:6603 temp
dev ovirtmgmt port vnet1 grp ff02::1:ff7d:6604 temp
dev ovirtmgmt port bond0.86 grp ff02::fb temp
dev ovirtmgmt port bond0.86 grp ff02::2 temp
dev ovirtmgmt port bond0.86 grp ff02::d temp
dev ovirtmgmt port vnet0 grp ff02::1:ff00:76 temp
dev ovirtmgmt port bond0.86 grp ff02::16 temp
dev ovirtmgmt port vnet1 grp ff02::1:ff00:77 temp
dev ovirtmgmt port bond0.86 grp ff02::1:ff00:def temp
dev ovirtmgmt port bond0.86 grp ff02::1:ffa1:40bf temp
Fixes: 08b202b67264 ("bridge br_multicast: IPv6 MLD support.") Reported-by: Rik Theys <Rik.Theys@esat.kuleuven.be> Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@redhat.com> Tested-by: Rik Theys <Rik.Theys@esat.kuleuven.be> Signed-off-by: David S. Miller <davem@davemloft.net>
Nathan Sullivan [Fri, 22 May 2015 14:22:10 +0000 (09:22 -0500)]
net: macb: Disable half duplex gigabit on Zynq
According to the Zynq TRM, gigabit half duplex is not supported. Add a
new cap and compatible string so Zynq can avoid advertising that mode.
Signed-off-by: Nathan Sullivan <nathan.sullivan@ni.com> Acked-by: Nicolas Ferre <nicolas.ferre@atmel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Nathan Sullivan [Fri, 22 May 2015 14:22:09 +0000 (09:22 -0500)]
net: macb: Document zynq gem dt binding
Signed-off-by: Nathan Sullivan <nathan.sullivan@ni.com> Acked-by: Nicolas Ferre <nicolas.ferre@atmel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Michal Kubeček [Fri, 22 May 2015 11:40:09 +0000 (13:40 +0200)]
ipv4: fill in table id when replacing a route
When replacing an IPv4 route, tb_id member of the new fib_alias
structure is not set in the replace code path so that the new route is
ignored.
Fixes: 0ddcf43d5d4a ("ipv4: FIB Local/MAIN table collapse") Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Acked-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Bjørn Mork [Fri, 22 May 2015 11:15:22 +0000 (13:15 +0200)]
cdc_ncm: Fix tx_bytes statistics
The tx_curr_frame_payload field is u32. When we try to calculate a
small negative delta based on it, we end up with a positive integer
close to 2^32 instead. So the tx_bytes pointer increases by about
2^32 for every transmitted frame.
Fix by calculating the delta as a signed long.
Cc: Ben Hutchings <ben.hutchings@codethink.co.uk> Reported-by: Florian Bruhin <me@the-compiler.org> Fixes: 7a1e890e2168 ("usbnet: Fix tx_bytes statistic running backward in cdc_ncm") Signed-off-by: Bjørn Mork <bjorn@mork.no> Signed-off-by: David S. Miller <davem@davemloft.net>
The following patchset contain Netfilter fixes for your net tree, they are:
1) Fix a race in nfnetlink_log and nfnetlink_queue that can lead to a crash.
This problem is due to wrong order in the per-net registration and netlink
socket events. Patch from Francesco Ruggeri.
2) Make sure that counters that userspace pass us are higher than 0 in all the
x_tables frontends. Discovered via Trinity, patch from Dave Jones.
3) Revert a patch for br_netfilter to rely on the conntrack status bits. This
breaks stateless IPv6 NAT transformations. Patch from Florian Westphal.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
ip_error does not check if in_dev is NULL before dereferencing it.
IThe following sequence of calls is possible:
CPU A CPU B
ip_rcv_finish
ip_route_input_noref()
ip_route_input_slow()
inetdev_destroy()
dst_input()
With the result that a network device can be destroyed while processing
an input packet.
A crash was triggered with only unicast packets in flight, and
forwarding enabled on the only network device. The error condition
was created by the removal of the network device.
As such it is likely the that error code was -EHOSTUNREACH, and the
action taken by ip_error (if in_dev had been accessible) would have
been to not increment any counters and to have tried and likely failed
to send an icmp error as the network device is going away.
Therefore handle this weird case by just dropping the packet if
!in_dev. It will result in dropping the packet sooner, and will not
result in an actual change of behavior.
Fixes: 251da4130115b ("ipv4: Cache ip_error() routes even when not forwarding.") Reported-by: Vittorio Gambaletta <linuxbugs@vittgam.net> Tested-by: Vittorio Gambaletta <linuxbugs@vittgam.net> Signed-off-by: Vittorio Gambaletta <linuxbugs@vittgam.net> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Fri, 22 May 2015 09:05:58 +0000 (11:05 +0200)]
flow_dissector: do not break if ports are not needed in flowlabel
This restored previous behaviour. If caller does not want ports to be
filled, we should not break.
Fixes: 06635a35d13d ("flow_dissect: use programable dissector in skb_flow_dissect and friends") Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Fri, 22 May 2015 04:51:19 +0000 (21:51 -0700)]
tcp: fix a potential deadlock in tcp_get_info()
Taking socket spinlock in tcp_get_info() can deadlock, as
inet_diag_dump_icsk() holds the &hashinfo->ehash_locks[i],
while packet processing can use the reverse locking order.
We could avoid this locking for TCP_LISTEN states, but lockdep would
certainly get confused as all TCP sockets share same lockdep classes.
[ 523.722504] ======================================================
[ 523.728706] [ INFO: possible circular locking dependency detected ]
[ 523.734990] 4.1.0-dbg-DEV #1676 Not tainted
[ 523.739202] -------------------------------------------------------
[ 523.745474] ss/18032 is trying to acquire lock:
[ 523.750002] (slock-AF_INET){+.-...}, at: [<ffffffff81669d44>] tcp_get_info+0x2c4/0x360
[ 523.758129]
[ 523.758129] but task is already holding lock:
[ 523.763968] (&(&hashinfo->ehash_locks[i])->rlock){+.-...}, at: [<ffffffff816bcb75>] inet_diag_dump_icsk+0x1d5/0x6c0
[ 523.774661]
[ 523.774661] which lock already depends on the new lock.
[ 523.774661]
[ 523.782850]
[ 523.782850] the existing dependency chain (in reverse order) is:
[ 523.790326]
-> #1 (&(&hashinfo->ehash_locks[i])->rlock){+.-...}:
[ 523.796599] [<ffffffff811126bb>] lock_acquire+0xbb/0x270
[ 523.802565] [<ffffffff816f5868>] _raw_spin_lock+0x38/0x50
[ 523.808628] [<ffffffff81665af8>] __inet_hash_nolisten+0x78/0x110
[ 523.815273] [<ffffffff816819db>] tcp_v4_syn_recv_sock+0x24b/0x350
[ 523.822067] [<ffffffff81684d41>] tcp_check_req+0x3c1/0x500
[ 523.828199] [<ffffffff81682d09>] tcp_v4_do_rcv+0x239/0x3d0
[ 523.834331] [<ffffffff816842fe>] tcp_v4_rcv+0xa8e/0xc10
[ 523.840202] [<ffffffff81658fa3>] ip_local_deliver_finish+0x133/0x3e0
[ 523.847214] [<ffffffff81659a9a>] ip_local_deliver+0xaa/0xc0
[ 523.853440] [<ffffffff816593b8>] ip_rcv_finish+0x168/0x5c0
[ 523.859624] [<ffffffff81659db7>] ip_rcv+0x307/0x420
Lets use u64_sync infrastructure instead. As a bonus, 64bit
arches get optimized, as these are nop for them.
Fixes: 0df48c26d841 ("tcp: add tcpi_bytes_acked to tcp_info") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Dave Airlie [Fri, 22 May 2015 03:31:03 +0000 (13:31 +1000)]
Merge tag 'drm-intel-fixes-2015-05-21' of git://anongit.freedesktop.org/drm-intel into drm-fixes
There's a stable backport from Ander [1] that combines this and a few
other commits to fix the flickering on v4.0, reported in [2] among
others. Having this upstream is obviously a requirement for stable.
* tag 'drm-intel-fixes-2015-05-21' of git://anongit.freedesktop.org/drm-intel:
drm/i915: fix screen flickering
Florian Westphal [Thu, 21 May 2015 00:26:24 +0000 (02:26 +0200)]
net: sched: pkt_cls: remove unused macros from uapi
Jamal points out that this header also contains kernel internal magic that
cannot be used from userspace for anything meaningful.
Lets remove what the kernel doesn't use anymore and wrap remainder with
__KERNEL__.
Suggested-by: Jamal Hadi Salim <jhs@mojatatu.com> Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
tcp: add tcpi_segs_in and tcpi_segs_out to tcp_info
This patch tracks the total number of inbound and outbound segments on a
TCP socket. One may use this number to have an idea on connection
quality when compared against the retransmissions.
RFC4898 named these : tcpEStatsPerfSegsIn and tcpEStatsPerfSegsOut
These are a 32bit field each and can be fetched both from TCP_INFO
getsockopt() if one has a handle on a TCP socket, or from inet_diag
netlink facility (iproute2/ss patch will follow)
Note that tp->segs_out was placed near tp->snd_nxt for good data
locality and minimal performance impact, while tp->segs_in was placed
near tp->bytes_received for the same reason.
Join work with Eric Dumazet.
Note that received SYN are accounted on the listener, but sent SYNACK
are not accounted.
Signed-off-by: Marcelo Ricardo Leitner <mleitner@redhat.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Westphal [Wed, 20 May 2015 22:25:41 +0000 (00:25 +0200)]
ipv6: reject locally assigned nexthop addresses
ip -6 addr add dead::1/128 dev eth0
sleep 5
ip -6 route add default via dead::1/128
-> fails
ip -6 addr add dead::1/128 dev eth0
ip -6 route add default via dead::1/128
-> succeeds
reason is that if (nonsensensical) route above is added,
dead::1 is still subject to DAD, so the route lookup will
pick eth0 as outdev due to the prefix route that is added before
DAD work is started.
Add explicit test that checks if nexthop gateway is a local address.
Link: https://bugzilla.redhat.com/show_bug.cgi?id=1167969 Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
crypto: s390/ghash - Fix incorrect ghash icv buffer handling.
Multitheaded tests showed that the icv buffer in the current ghash
implementation is not handled correctly. A move of this working ghash
buffer value to the descriptor context fixed this. Code is tested and
verified with an multithreaded application via af_alg interface.
Cc: stable@vger.kernel.org Signed-off-by: Harald Freudenberger <freude@linux.vnet.ibm.com> Signed-off-by: Gerald Schaefer <geraldsc@linux.vnet.ibm.com> Reported-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Linus Torvalds [Fri, 22 May 2015 02:54:50 +0000 (19:54 -0700)]
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 fixes from Martin Schwidefsky:
"Bug fixes.
Three for our crypto code, two for eBPF, and one memory management fix
to get machines with memory > 8TB working"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
s390/mm: correct return value of pmd_pfn
s390/crypto: fix stckf loop
s390/zcrypt: Fix invalid domain handling during ap module unload
s390/bpf: Fix gcov stack space problem
s390/zcrypt: fixed ap poll timer behavior
s390/bpf: Adjust ALU64_DIV/MOD to match interpreter change
Linus Torvalds [Fri, 22 May 2015 00:54:00 +0000 (17:54 -0700)]
Merge tag 'sound-4.1-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
"This batch became slightly large, just because I've been on vacation
for the last two weeks. Nothing to scare much here, all
device-specific fixes, mostly small patches.
Majority of patches are for HD-audio, especially Dell machines. The
rest are small ASoC fixes for various codecs, and a USB-audio quirk.
One PCM fix is included to ease the faulty condition checks in the
case of two periods PCM buffers"
* tag 'sound-4.1-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
ALSA: hda - Disable widget power-saving for ALC292 & co
ALSA: hda - Reduce verbs by node power-saves
ALSA: sound/atmel/ac97c.c: remove unused variable
ALSA: usb-audio: Add quirk for MS LifeCam Studio
ALSA: pcm: Modify double acknowledged interrupts check condition
ALSA: hda/realtek - ALC292 dock fix for Thinkpad L450
ALSA: hda - Add Conexant codecs CX20721, CX20722, CX20723 and CX20724
ALSA: hda - Fix headset mic and mic-in for a Dell desktop
ASoC: wm8994: correct BCLK DIV 348 to 384
ASoC: wm8960: fix "RINPUT3" audio route error
ASoC: dapm: Modify widget stream name according to prefix
ALSA: hda - Add headset mic quirk for Dell Inspiron 5548
ASoC: rt5645: Fix mask for setting RT5645_DMIC_2_DP_GPIO12 bit
ASoC: rt5645: Add ACPI match ID
ALSA: hda/realtek - Add ALC298 alias name for Dell
ASoC: uda1380: Avoid accessing i2c bus when codec is disabled
ALSA: hda/realtek - Fix typo for ALC286/ALC288
ASoC: mc13783: Fix wrong mask value used in mc13xxx_reg_rmw() calls
ALSA: hda - Add headphone quirk for Lifebook E752
ASoC: davinci-mcasp: Correct pm status check in suspend callback
Linus Torvalds [Fri, 22 May 2015 00:36:23 +0000 (17:36 -0700)]
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma
Pull infiniband/rdma fixes from Doug Ledford:
"This should hopefully be the last request for 4.1-rc for the RDMA
stack. It contains some late ocrdma fixes that I'm including because
they are small and self contained. It also contains two bug fixes
that are simple and easily verified.
Summary:
- a number of small, well contained bug fixes for ocrdma driver
- a simple fix for the connection negotiation sequence on IB
- fix for broken AF_IB address on UD queue pair support"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma:
IB/cma: Fix broken AF_IB UD support
ib/cm: Change reject message type when destroying cm_id
RDMA/ocrdma: Update ocrdma version number
RDMA/ocrdma: Fail connection for MTU lesser than 512
RDMA/ocrdma: Fix dmac resolution for link local address
RDMA/ocrdma: Prevent allocation of DPP PDs if FW doesnt support it
RDMA/ocrdma: Fix the request length for RDMA_QUERY_QP mailbox command to FW.
RDMA/ocrdma: Use VID 0 if PFC is enabled and vlan is not configured
RDMA/ocrdma: Fix QP state transition in destroy_qp
RDMA/ocrdma: Report EQ full fatal error
RDMA/ocrdma: Fix EQ destroy failure during driver unload
Linus Torvalds [Fri, 22 May 2015 00:23:11 +0000 (17:23 -0700)]
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid
Pull HID fixes from Jiri Kosina:
"Bugfixes for HID subsystem that should go in 4.1. Important
highlights:
- the patch that extended support for HID++ protocol for TK820
touchpad turns out to be causing regressions due to firmware
issues; patch reverting back to basic support from Benjamin
Tissoires
- Wacom driver can oops for devices that report non-touch data on
touch interfaces. Fix from Ping Cheng
- gpiolib is not mandatory for i2c-hid, so the driver shouldn't fail
if gpiolib is not enabled. Fix from Mika Westerberg"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid:
HID: wacom: fix an Oops caused by wacom_wac_finger_count_touches
HID: usbhid: Add HID_QUIRK_NOGET for Aten DVI KVM switch
HID: hid-sensor-hub: Fix debug lock warning
Revert "HID: logitech-hidpp: support combo keyboard touchpad TK820"
HID: i2c-hid: Do not fail probing if gpiolib is not enabled
Marek Vasut [Fri, 8 May 2015 05:25:49 +0000 (22:25 -0700)]
Input: smtpe-ts - wait 50mS until polling for pen-up
Wait a little bit longer, 50mS instead of 20mS, until the driver starts
polling for pen-up. The problematic behavior before this patch is applied
is as follows. The behavior was observed on the STMPE610QTR controller.
Upon a physical pen-down event, the touchscreen reports one set of x-y-p
coordinates and a pen-down event. After that, the pen-up polling is
triggered and since the controller is not ready yet, the polling mistakenly
detects a pen-up event while the physical state is still such that the pen
is down on the touch surface.
The pen-up handling flushes the controller FIFO, so after that, all the
samples in the controller are discarded. The controller becomes ready
shortly after this bogus pen-up handling and does generate again a pen-down
interrupt. This time, the controller contains x-y-p samples which all read
as zero. Since pressure value is zero, this set of samples is effectively
ignored by userland.
In the end, the driver just bounces between pen-down and bogus pen-up
handling, generating no useful results. Fix this by giving the controller a
bit more time before polling it for pen-up.
Marek Vasut [Fri, 8 May 2015 05:24:58 +0000 (22:24 -0700)]
Input: smtpe-ts - use msecs_to_jiffies() instead of HZ
Use msecs_to_jiffies(20) instead of plain (HZ / 50), as the former is much
more explicit about it's behavior. We want to schedule the task 20 mS from
now, so make it explicit in the code.
Linus Torvalds [Thu, 21 May 2015 23:57:50 +0000 (16:57 -0700)]
Merge tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux
Pull clk fixes from Michael Turquette:
"The first set of clk fixes for 4.1 are all driver bugs, with the
exception of a single locking fix in the core code.
All driver fixes are for code that was merged recently. The Samsung
stuff is mostly fixes around suspend/resume, the Qualcomm fixes are
for invalid hardware configuration data and the Silicon Labs patches
are fixes following their move away from platform_data to Device Tree"
* tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux:
clk: si5351: Do not pass struct clk in platform_data
clk: si5351: Mention clock-names in the binding documentation
clk: add missing lock when call clk_core_enable in clk_set_parent
clk: exynos5420: Restore GATE_BUS_TOP on suspend
clk: qcom: Fix MSM8916 gfx3d_clk_src configuration
clk: qcom: Fix MSM8916 venus divider value
clk: exynos5433: Fix wrong PMS value of exynos5433_pll_rates
clk: exynos5433: Fix wrong parent clock of sclk_apollo clock
clk: exynos5433: Fix CLK_PCLK_MONOTONIC_CNT clk register assignment
clk: exynos5433: Fix wrong offset of PCLK_MSCL_SECURE_SMMU_JPEG
clk: Use CONFIG_ARCH_EXYNOS instead of CONFIG_ARCH_EXYNOS5433
Thomas Hellstrom [Wed, 20 May 2015 21:50:30 +0000 (14:50 -0700)]
Input: joydev - don't classify the vmmouse as a joystick
Joydev is currently thinking some absolute mice are joystick, and that
messes up games in VMware guests, as the cursor typically gets stuck in
the top left corner.
Try to detect the event signature of a VMmouse input device and back off
for such devices. We're still incorrectly detecting, for example, the
VMware absolute USB mouse as a joystick, but adding an event signature
matching also that device would be considerably more risky, so defer that
to a later merge window.
Signed-off-by: Thomas Hellstrom <thellstrom@vmware.com> Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
David S. Miller [Thu, 21 May 2015 22:57:26 +0000 (18:57 -0400)]
Merge branch 'stmmac-probe-refactoring'
Joachim Eastwood says:
====================
stmmac: probe code refactoring and clean up part 1
This patch set refactor the code in stmmac_pci_probe and stmmac_pltfr_probe
and moves the common bits into stmmac_dvr_probe. Along the way some clean-
ups are applied to stmmac_pltfr_probe.
The code has been tested on the LPC18xx platform.
I am still working on more refactoring of the platform probe code, hence
part 1, but I need some more time on this.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Joachim Eastwood [Wed, 20 May 2015 18:03:09 +0000 (20:03 +0200)]
stmmac: drop unnecessary dt checks in stmmac_probe_config_dt
Since the caller already check the presence of a of_node there
is no need to repeat the check in stmmac_probe_config_dt.
There is also no point in checking the return value of the
of_match_device function since if there wasn't match in the
first place we would never be in this function.
Signed-off-by: Joachim Eastwood <manabian@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Joachim Eastwood [Wed, 20 May 2015 18:03:08 +0000 (20:03 +0200)]
stmmac: change the stmmac_dvr_probe return type to int
Since stmmac_dvr_probe takes care of setting driver data and
assign resources to the priv structure there is no need to
access the priv structure from the other probe functions.
This mean that this function can be changed into just return
an int and thus simplifying the callers.
Signed-off-by: Joachim Eastwood <manabian@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Joachim Eastwood [Wed, 20 May 2015 18:03:07 +0000 (20:03 +0200)]
stmmac: let stmmac_dvr_probe take a struct of resources
Creat a struct that contain all the resources that needs to be
assigned to the priv struct in stmmac_dvr_probe. This makes it
possible to factor out more common code from the other probe
functions and also use this struct to hold the resources as
they are fetched.
Signed-off-by: Joachim Eastwood <manabian@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Joachim Eastwood [Wed, 20 May 2015 18:03:06 +0000 (20:03 +0200)]
stmmac: move driver data setting into stmmac_dvr_probe
Move setting of driver data into stmmac_dvr_probe so the
other probe functions don't have to. This will help to
simplify the other probe functions later.
Signed-off-by: Joachim Eastwood <manabian@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Wed, 20 May 2015 17:59:01 +0000 (10:59 -0700)]
inet_hashinfo: remove bsocket counter
We no longer need bsocket atomic counter, as inet_csk_get_port()
calls bind_conflict() regardless of its value, after commit 2b05ad33e1e624e ("tcp: bind() fix autoselection to share ports")
This patch removes overhead of maintaining this counter and
double inet_csk_get_port() calls under pressure.
Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Marcelo Ricardo Leitner <mleitner@redhat.com> Cc: Flavio Leitner <fbl@redhat.com> Acked-by: Flavio Leitner <fbl@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jason Baron [Wed, 20 May 2015 15:52:53 +0000 (15:52 +0000)]
tcp: ensure epoll edge trigger wakeup when write queue is empty
We currently rely on the setting of SOCK_NOSPACE in the write()
path to ensure that we wake up any epoll edge trigger waiters when
acks return to free space in the write queue. However, if we fail
to allocate even a single skb in the write queue, we could end up
waiting indefinitely.
Fix this by explicitly issuing a wakeup when we detect the condition
of an empty write queue and a return value of -EAGAIN. This allows
userspace to re-try as we expect this to be a temporary failure.
I've tested this approach by artificially making
sk_stream_alloc_skb() return NULL periodically. In that case,
epoll edge trigger waiters will hang indefinitely in epoll_wait()
without this patch.
Signed-off-by: Jason Baron <jbaron@akamai.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann [Wed, 20 May 2015 15:13:33 +0000 (17:13 +0200)]
net: sched: fix call_rcu() race on classifier module unloads
Vijay reported that a loop as simple as ...
while true; do
tc qdisc add dev foo root handle 1: prio
tc filter add dev foo parent 1: u32 match u32 0 0 flowid 1
tc qdisc del dev foo root
rmmod cls_u32
done
... will panic the kernel. Moreover, he bisected the change
apparently introducing it to 78fd1d0ab072 ("netlink: Re-add
locking to netlink_lookup() and seq walker").
The removal of synchronize_net() from the netlink socket
triggering the qdisc to be removed, seems to have uncovered
an RCU resp. module reference count race from the tc API.
Given that RCU conversion was done after e341694e3eb5 ("netlink:
Convert netlink_lookup() to use RCU protected hash table")
which added the synchronize_net() originally, occasion of
hitting the bug was less likely (not impossible though):
When qdiscs that i) support attaching classifiers and,
ii) have at least one of them attached, get deleted, they
invoke tcf_destroy_chain(), and thus call into ->destroy()
handler from a classifier module.
After RCU conversion, all classifier that have an internal
prio list, unlink them and initiate freeing via call_rcu()
deferral.
Meanhile, tcf_destroy() releases already reference to the
tp->ops->owner module before the queued RCU callback handler
has been invoked.
Subsequent rmmod on the classifier module is then not prevented
since all module references are already dropped.
By the time, the kernel invokes the RCU callback handler from
the module, that function address is then invalid.
One way to fix it would be to add an rcu_barrier() to
unregister_tcf_proto_ops() to wait for all pending call_rcu()s
to complete.
synchronize_rcu() is not appropriate as under heavy RCU
callback load, registered call_rcu()s could be deferred
longer than a grace period. In case we don't have any pending
call_rcu()s, the barrier is allowed to return immediately.
Since we came here via unregister_tcf_proto_ops(), there
are no users of a given classifier anymore. Further nested
call_rcu()s pointing into the module space are not being
done anywhere.
Only cls_bpf_delete_prog() may schedule a work item, to
unlock pages eventually, but that is not in the range/context
of cls_bpf anymore.
Fixes: 25d8c0d55f24 ("net: rcu-ify tcf_proto") Fixes: 9888faefe132 ("net: sched: cls_basic use RCU") Reported-by: Vijay Subramanian <subramanian.vijay@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: John Fastabend <john.r.fastabend@intel.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Thomas Graf <tgraf@suug.ch> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Alexei Starovoitov <ast@plumgrid.com> Tested-by: Vijay Subramanian <subramanian.vijay@gmail.com> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 21 May 2015 22:46:36 +0000 (18:46 -0400)]
Merge branch 'cxgb4-next'
Hariprasad Shenai says:
====================
cxgb4: Cleanup and update T4/T4 register ranges
This series cleans and optimizes setup_memwin function and also updates
T4/T5 adapter register ranges by removing incorrect register addresses
This patch series has been created against net-next tree and includes
patches on cxgb4 driver.
We have included all the maintainers of respective drivers. Kindly review
the change and let us know in case of any review comments.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 21 May 2015 22:43:55 +0000 (18:43 -0400)]
Merge branch 'sfc-next'
Shradha Shah says:
====================
sfc: Get/Set MAC address and ndo_[set/get]_vf_* entrypoint functions
This is the second installment of patches towards supporting EF10 SRIOV.
This patch series implements the ndo_get_vf_config, ndo_set_vf_mac,
ndo_set_vf_vlan and ndo_set_vf_spoofcheck function callbacks for EF10.
This patch series also introduces privileges for the MCDI commands
based on which functions are allowed to call them, i.e. Link control
or primary function.
The patch series has been tested with and without CONFIG_SFC_SRIOV.
The ndo function callbacks are tested using ip link.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Shradha Shah [Wed, 20 May 2015 10:12:48 +0000 (11:12 +0100)]
sfc: set the MAC address using MC_CMD_VADAPTOR_SET_MAC
Add a set_mac_address() NIC-type function for EF10 only, and
use this to set the MAC address on the vadaptor. For Siena and
earlier, the MAC address continues to be set by MC_CMD_SET_MAC;
this is still called on EF10, and including a MAC address in
this command has no effect.
The sriov_mac_address_changed() NIC-type function is no longer
needed on EF10, but it is needed for Siena where it is used to
update the peer address of the PF for VFDI. Change this to use
the new set_mac_address function pointer.
efx_ef10_sriov_mac_address_changed() is no longer called, as VFs
will try to change the MAC address on their vadaptor rather than
trying to change to the context of the PF to alter the vport.
When a VF is running in direct passthrough mode with MAC spoofing
enabled, it will be able to change the MAC address on its vadaptor.
In this case, there is a link to the PF, so find the correct VF in
its ef10_vf array and update the MAC address.
ndo_set_mac_address() can be called during driver unload while
bonding, and in this case the device has already been stopped, so
don't call efx_net_open() to restart it after reconfiguration.
efx->port_enabled is set to false in efx_stop_port(), so it is
indicator of whether the device needs to be restarted.
Signed-off-by: Shradha Shah <sshah@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Edward Cree [Wed, 20 May 2015 10:12:13 +0000 (11:12 +0100)]
sfc: add ndo_set_vf_link_state() function for EF10
Exercised with
"ip link set <PF intf> vf <vf_i> state {auto|enable|disable}"
Sets the reporting policy for VF link state to either
- mirror physical link state
- always up
- always down
get VF link state mode in efx_ef10_sriov_get_vf_config
Exercised by
"ip link show <PF intf>";
output will include a line like
vf 0 MAC 12:34:56:78:9a:bc, link-state auto
Signed-off-by: Shradha Shah <sshah@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Shradha Shah [Wed, 20 May 2015 10:11:54 +0000 (11:11 +0100)]
sfc: add ndo_set_vf_vlan() function for EF10
The max vlan tags that can be offloaded is 2, including any upstream VLAN
aggregator. Currently there is no way for the net driver to know whether
the upstream vswitch (if any) is using vlan tags, so there is no way to
know how many tags we can request.
Along with the implementation for the ndo_set_vf_vlan callback, this patch
also adds 2 VLAN tags for the driver created VEB switch if possible, that
way it is possible to offload as many tags as are allowed.
Signed-off-by: Shradha Shah <sshah@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jon Cooper [Wed, 20 May 2015 10:11:35 +0000 (11:11 +0100)]
sfc: Change entity reset on MC reboot to a new datapath-only reset.
Currently we do an entity reset when we detect an MC reboot.
This messes up SRIOV because it leaves VFs orphaned. The extra
reset is rather redundant anyway, since the MC reboot will have
basically reset everything.
This change replaces the entity reset after MC reboot with a
simpler datapath reset that reallocates resources but doesn't
perform the entity reset.
Signed-off-by: Shradha Shah <sshah@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Shradha Shah [Wed, 20 May 2015 10:11:18 +0000 (11:11 +0100)]
sfc: Add ndo_get_vf_config() function for EF10
rtnetlink calls ndo_get_vf_config when compiling information
about a network interface, so that the VFs associated with a PF
can be listed (eg: ip link show).
Implement a response to this entry point and return PF-set MAC
address for VF in ndo_get_vf_config
Signed-off-by: Shradha Shah <sshah@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jon Cooper [Wed, 20 May 2015 10:10:41 +0000 (11:10 +0100)]
sfc: Initialise MCDI buffers to 0 on declaration.
In order to avoid MC bugs the flags field needs to be set to 0.
Instead of explicitly clearing out the flags individually, a
better way to do this is to memset the MCDI_BUF to 0.
Signed-off-by: Shradha Shah <sshah@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Pieczko [Wed, 20 May 2015 10:10:20 +0000 (11:10 +0100)]
sfc: Enable a VF to get its own MAC address
A VF's MAC address is set by its parent PF and added to its vport.
To get this MAC address, the VF must use MC_CMD_ VPORT_GET_MAC_ADDRESSES.
In the current scheme, a VF's vport should only have one MAC address,
so warn if this is not the case.
Signed-off-by: Shradha Shah <sshah@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Edward Cree [Wed, 20 May 2015 10:10:03 +0000 (11:10 +0100)]
sfc: protect filter table against use-after-free
If MCDI timeouts are encountered during efx_ef10_filter_table_remove(),
an FLR will be queued, but efx->filter_state will still be kfree()d.
The queued FLR will then call efx_ef10_filter_table_restore(), which
will try to use efx->filter_state. This previously caused a panic.
This patch adds an rwsem to protect the existence of efx->filter_state,
separately from the spinlock protecting its contents. Users which can
race against efx_ef10_filter_table_remove() should down_read this rwsem.
Signed-off-by: Shradha Shah <sshah@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Shradha Shah [Wed, 20 May 2015 10:09:46 +0000 (11:09 +0100)]
sfc: Store the efx_nic struct of the current VF in the VF data struct
Initialised in efx_probe_vf and removal is dealt with in
efx_ef10_remove.
vf->efx is needed in future patches to change the MAC address
of the VF via the parent PF, while the driver is bound to the
VF.
Example: ip link set dev vf NUM mac LLADDR
Signed-off-by: Shradha Shah <sshah@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 21 May 2015 21:20:55 +0000 (17:20 -0400)]
Merge branch 'rocker-transaction-fixes'
Simon Horman says:
====================
rocker: transaction fixes
this series addresses what appear to be errors in the handling of
prepare and then commit transactions in the rocker driver.
In all cases the problem is that data structures visible outside of
the transaction are modified during the prepare phase.
In the case of the first two patches this results in the kernel reporting a
BUG. I have noted test-cases in the change logs.
The third patch is also a bug fix, as noted by Toshiaki Makita,
however I have not been able to reliably reproduce the problem and
thus have not provided a test case.
The last patch is a correctness fix that does not fix a bug
that manifests as far as I can tell.
Changes: v3->v4
* All patches
- Add Jiri Pirko's ack
* "rocker: do not make neighbour entry changes when preparing transactions"
- Setting of entry values in all transaction phases
as suggested by Toshiaki Makita
* "rocker: make rocker_port_internal_vlan_id_{get,put}() non-transactional"
- Remove Fixes tag as I believe this is a correctness rather than a bug fix
Changes: v2->v3
* "rocker: do not make neighbour entry changes when preparing transactions"
- Correct inverted logic
- Added ack from Scott Feldman
Changes: v1->v2
* "rocker: do not make neighbour entry changes when preparing transactions"
- Revised changelog to reflect information from Toshiaki Makita
that there is a bug that can manifest
- Update address and ttl regardless of the value of the transaction state
* All other patches
- Added acks from Scott Feldman
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Simon Horman [Thu, 21 May 2015 03:40:17 +0000 (12:40 +0900)]
rocker: make rocker_port_internal_vlan_id_{get, put}() non-transactional
The motivation for this is that rocker_port_internal_vlan_id_{get,put} appear
to only partially implement the transaction model: memory allocation
and freeing is transactional, but hash and bitmap manipulation is not.
The latter could be fixed, however, as it is not currently exercised
due to trans always being SWITCHDEV_TRANS_NONE it seems cleaner
to make rocker_port_internal_vlan_id_get non-transactional.
This problem was introduced by c4f20321d968 ("rocker: support
prepare-commit transaction model").
Found by inspection.
I do not believe that this change should have any run-time effect.
Acked-by: Scott Feldman <sfeldma@gmail.com> Acked-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Simon Horman [Thu, 21 May 2015 03:40:16 +0000 (12:40 +0900)]
rocker: do not make neighbour entry changes when preparing transactions
rocker_port_ipv4_nh() and in turn rocker_port_ipv4_neigh() may be
be called with trans == SWITCHDEV_TRANS_PREPARE and then
trans == SWITCHDEV_TRANS_COMMIT from switchdev_port_obj_set() via
fib_table_insert().
The first time that rocker_port_ipv4_nh() is called, with
trans == SWITCHDEV_TRANS_PREPARE, _rocker_neigh_add() adds a new entry to
the neigh table.
And the second time rocker_port_ipv4_nh() is called, with
trans == SWITCHDEV_TRANS_COMMIT, that entry is found. This causes
rocker_port_ipv4_nh() to believe it is not adding an entry and thus it
frees "entry", which is still present in rocker driver's neigh table.
This problem does not appear to affect deletion as my analysis is that
deletion is always performed with trans == SWITCHDEV_TRANS_NONE.
For completeness _rocker_neigh_{add,del,prepare} are updated not to
manipulate fib table entries if trans == SWITCHDEV_TRANS_PREPARE.
Fixes: c4f20321d968 ("rocker: support prepare-commit transaction model") Reported-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> Acked-by: Scott Feldman <sfeldma@gmail.com> Acked-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Simon Horman [Thu, 21 May 2015 03:40:15 +0000 (12:40 +0900)]
rocker: do not modify fdb table in rocker_port_fdb() when preparing transactions
rocker_port_fdb_flush() may be called be called with
trans == SWITCHDEV_TRANS_PREPARE and then trans == SWITCHDEV_TRANS_COMMIT from
switchdev_port_attr_set() via switchdev_port_obj_add().
Adding the new entry to the FDB table when trans == SWITCHDEV_TRANS_PREPARE
may result in a memory leak because when trans == SWITCHDEV_TRANS_PREPARE
rocker_flow_tbl_bridge() will allocate memory when called via
rocker_port_fdb_learn(). However, when trans == SWITCHDEV_TRANS_COMMIT
the presence of the FDB entry in the FDB table causes
rocker_port_fdb() to set the ROCKER_OP_FLAG_REFRESH flag which results
in rocker_port_fdb_learn() skipping the call to rocker_flow_tbl_bridge()
which would free the memory allocated by it when
trans == SWITCHDEV_TRANS_PREPARE.
The above is resolved by not adding the new FDB entry to the FDB table
if trans == SWITCHDEV_TRANS_PREPARE.
For symmetry this patch also skips deleting FDB entries from the FDB
table trans == SWITCHDEV_TRANS_PREPARE. However, my analysis is that
this never occurs as trans is always SWITCHDEV_TRANS_NONE when removing
FDB entries.
Fixes: c4f20321d968 ("rocker: support prepare-commit transaction model") Acked-by: Scott Feldman <sfeldma@gmail.com> Acked-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Simon Horman [Thu, 21 May 2015 03:40:14 +0000 (12:40 +0900)]
rocker: do not delete fdb entries in rocker_port_fdb_flush() when preparing transactions
rocker_port_fdb_flush() is called by rocker_port_stp_update() which in
turn may be called with trans == SWITCHDEV_TRANS_PREPARE and then
trans == SWITCHDEV_TRANS_COMMIT from switchdev_port_attr_set() via
br_set_state().
When rocker_port_fdb_flush() is called with trans == SWITCHDEV_TRANS_PREPARE
it calls rocker_port_fdb_learn() for each entry in the FDB table which in
turn calls rocker_flow_tbl_bridge() which will allocate memory using
rocker_port_kzalloc(). rocker_port_fdb_learn() will then remove the entry
from the FDB table.
Then when rocker_port_fdb_learn() is called with
trans == SWITCHDEV_TRANS_PREPARE no calls are made to rocker_port_fdb_learn()
because there are no longer any entries present in the FDB table. Thus the
memory previously allocated by rocker_port_fdb_learn() is leaked resulting
in the kernel BUG() below.
Furthermore, it looks like the driver ends up with an incorrect view of the
fdb table as the FDB entries are purged from the driver's table but not the
hardware's table.
Fixes: c4f20321d968 ("rocker: support prepare-commit transaction model") Acked-by: Scott Feldman <sfeldma@gmail.com> Acked-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
introduce bpf_tail_call(ctx, &jmp_table, index) helper function
which can be used from BPF programs like:
int bpf_prog(struct pt_regs *ctx)
{
...
bpf_tail_call(ctx, &jmp_table, index);
...
}
that is roughly equivalent to:
int bpf_prog(struct pt_regs *ctx)
{
...
if (jmp_table[index])
return (*jmp_table[index])(ctx);
...
}
The important detail that it's not a normal call, but a tail call.
The kernel stack is precious, so this helper reuses the current
stack frame and jumps into another BPF program without adding
extra call frame.
It's trivially done in interpreter and a bit trickier in JITs.
Use cases:
- simplify complex programs
- dispatch into other programs
(for example: index in jump table can be syscall number or network protocol)
- build dynamic chains of programs
The chain of tail calls can form unpredictable dynamic loops therefore
tail_call_cnt is used to limit the number of calls and currently is set to 32.
patch 1 - support bpf_tail_call() in interpreter
patch 2 - support in x64 JIT
We've discussed what's neccessary to support it in arm64/s390 JITs
and it looks fine.
patch 3 - sample example for tracing
patch 4 - sample example for networking
More details in every patch.
This set went through several iterations of reviews/fixes and older
attempts can be seen:
https://git.kernel.org/cgit/linux/kernel/git/ast/bpf.git/log/?h=tail_call_v[123456]
- tail_call_v1 does it without touching JITs but introduces overhead
for all programs that don't use this helper function.
- tail_call_v2 still has some overhead and x64 JIT does full stack
unwind (prologue skipping optimization wasn't there)
- tail_call_v3 reuses 'call' instruction encoding and has interpreter
overhead for every normal call
- tail_call_v4 fixes above architectural shortcomings and v5,v6 fix few
more bugs
This last tail_call_v6 approach seems to be the best.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
this example is similar to sockex2 in a way that it accumulates per-flow
statistics, but it does packet parsing differently.
sockex2 inlines full packet parser routine into single bpf program.
This sockex3 example have 4 independent programs that parse vlan, mpls, ip, ipv6
and one main program that starts the process.
bpf_tail_call() mechanism allows each program to be small and be called
on demand potentially multiple times, so that many vlan, mpls, ip in ip,
gre encapsulations can be parsed. These and other protocol parsers can
be added or removed at runtime. TLVs can be parsed in similar manner.
Note, tail_call_cnt dynamic check limits the number of tail calls to 32.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
kprobe example that demonstrates how future seccomp programs may look like.
It attaches to seccomp_phase1() function and tail-calls other BPF programs
depending on syscall number.
Existing optimized classic BPF seccomp programs generated by Chrome look like:
if (sd.nr < 121) {
if (sd.nr < 57) {
if (sd.nr < 22) {
if (sd.nr < 7) {
if (sd.nr < 4) {
if (sd.nr < 1) {
check sys_read
} else {
if (sd.nr < 3) {
check sys_write and sys_open
} else {
check sys_close
}
}
} else {
} else {
} else {
} else {
} else {
}
the future seccomp using native eBPF may look like:
bpf_tail_call(&sd, &syscall_jmp_table, sd.nr);
which is simpler, faster and leaves more room for per-syscall checks.
bpf_tail_call() arguments:
ctx - context pointer
jmp_table - one of BPF_MAP_TYPE_PROG_ARRAY maps used as the jump table
index - index in the jump table
In this implementation x64 JIT bypasses stack unwind and jumps into the
callee program after prologue, so the callee program reuses the same stack.
The logic can be roughly expressed in C like:
u32 tail_call_cnt;
void *jumptable[2] = { &&label1, &&label2 };
int bpf_prog1(void *ctx)
{
label1:
...
}
int bpf_prog2(void *ctx)
{
label2:
...
}
int bpf_prog1(void *ctx)
{
...
if (tail_call_cnt++ < MAX_TAIL_CALL_CNT)
goto *jumptable[index]; ... and pass my 'ctx' to callee ...
... fall through if no entry in jumptable ...
}
Note that 'skip current program epilogue and next program prologue' is
an optimization. Other JITs don't have to do it the same way.
>From safety point of view it's valid as well, since programs always
initialize the stack before use, so any residue in the stack left by
the current program is not going be read. The same verifier checks are
done for the calls from the kernel into all bpf programs.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
bpf: allow bpf programs to tail-call other bpf programs
introduce bpf_tail_call(ctx, &jmp_table, index) helper function
which can be used from BPF programs like:
int bpf_prog(struct pt_regs *ctx)
{
...
bpf_tail_call(ctx, &jmp_table, index);
...
}
that is roughly equivalent to:
int bpf_prog(struct pt_regs *ctx)
{
...
if (jmp_table[index])
return (*jmp_table[index])(ctx);
...
}
The important detail that it's not a normal call, but a tail call.
The kernel stack is precious, so this helper reuses the current
stack frame and jumps into another BPF program without adding
extra call frame.
It's trivially done in interpreter and a bit trickier in JITs.
In case of x64 JIT the bigger part of generated assembler prologue
is common for all programs, so it is simply skipped while jumping.
Other JITs can do similar prologue-skipping optimization or
do stack unwind before jumping into the next program.
bpf_tail_call() arguments:
ctx - context pointer
jmp_table - one of BPF_MAP_TYPE_PROG_ARRAY maps used as the jump table
index - index in the jump table
Since all BPF programs are idenitified by file descriptor, user space
need to populate the jmp_table with FDs of other BPF programs.
If jmp_table[index] is empty the bpf_tail_call() doesn't jump anywhere
and program execution continues as normal.
New BPF_MAP_TYPE_PROG_ARRAY map type is introduced so that user space can
populate this jmp_table array with FDs of other bpf programs.
Programs can share the same jmp_table array or use multiple jmp_tables.
The chain of tail calls can form unpredictable dynamic loops therefore
tail_call_cnt is used to limit the number of calls and currently is set to 32.
Use cases: Acked-by: Daniel Borkmann <daniel@iogearbox.net>
==========
- simplify complex programs by splitting them into a sequence of small programs
- dispatch routine
For tracing and future seccomp the program may be triggered on all system
calls, but processing of syscall arguments will be different. It's more
efficient to implement them as:
int syscall_entry(struct seccomp_data *ctx)
{
bpf_tail_call(ctx, &syscall_jmp_table, ctx->nr /* syscall number */);
... default: process unknown syscall ...
}
int sys_write_event(struct seccomp_data *ctx) {...}
int sys_read_event(struct seccomp_data *ctx) {...}
syscall_jmp_table[__NR_write] = sys_write_event;
syscall_jmp_table[__NR_read] = sys_read_event;
For networking the program may call into different parsers depending on
packet format, like:
int packet_parser(struct __sk_buff *skb)
{
... parse L2, L3 here ...
__u8 ipproto = load_byte(skb, ... offsetof(struct iphdr, protocol));
bpf_tail_call(skb, &ipproto_jmp_table, ipproto);
... default: process unknown protocol ...
}
int parse_tcp(struct __sk_buff *skb) {...}
int parse_udp(struct __sk_buff *skb) {...}
ipproto_jmp_table[IPPROTO_TCP] = parse_tcp;
ipproto_jmp_table[IPPROTO_UDP] = parse_udp;
- for TC use case, bpf_tail_call() allows to implement reclassify-like logic
- bpf_map_update_elem/delete calls into BPF_MAP_TYPE_PROG_ARRAY jump table
are atomic, so user space can build chains of BPF programs on the fly
Implementation details:
=======================
- high performance of bpf_tail_call() is the goal.
It could have been implemented without JIT changes as a wrapper on top of
BPF_PROG_RUN() macro, but with two downsides:
. all programs would have to pay performance penalty for this feature and
tail call itself would be slower, since mandatory stack unwind, return,
stack allocate would be done for every tailcall.
. tailcall would be limited to programs running preempt_disabled, since
generic 'void *ctx' doesn't have room for 'tail_call_cnt' and it would
need to be either global per_cpu variable accessed by helper and by wrapper
or global variable protected by locks.
In this implementation x64 JIT bypasses stack unwind and jumps into the
callee program after prologue.
- bpf_prog_array_compatible() ensures that prog_type of callee and caller
are the same and JITed/non-JITed flag is the same, since calling JITed
program from non-JITed is invalid, since stack frames are different.
Similarly calling kprobe type program from socket type program is invalid.
- jump table is implemented as BPF_MAP_TYPE_PROG_ARRAY to reuse 'map'
abstraction, its user space API and all of verifier logic.
It's in the existing arraymap.c file, since several functions are
shared with regular array map.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>