1. mitigate a case of lock contention
2. avoid exporting resource exhaustion to other sockets,
by migrating only to a victim socket that has ample room
3. avoid reordering of most flows on the socket,
by migrating first the flow responsible for load imbalance
4. help processes detect load imbalance,
by exporting rollover counters
Context: rollover implements flow migration in packet socket fanout
groups in case of extreme load imbalance. It is a specific
implementation of migration that minimizes reordering by selecting
the same victim socket when possible (and by selecting subsequent
victims in a round robin fashion, from which its name derives).
Changes:
v2 -> v3:
- statistics: replace unsigned long with __aligned_u64
v1 -> v2:
- huge flow detection: run lockless
- huge flow detection: replace stored index with random
- contention avoidance: test in packet_poll while lock held
- contention avoidance: clear pressure sooner
packet_poll and packet_recvmsg would clear only if the sock
is empty to avoid taking the necessary lock. But,
* packet_poll already holds this lock, so a lockless variant
__packet_rcv_has_room is cheap.
* packet_recvmsg is usually called only for non-ring sockets,
which also runs lockless.
- preparation: drop "single return" patch
packet_rcv_has_room is now a locked wrapper around
__packet_rcv_has_room, achieving the same (single footer).
The benchmark mentioned in the patches is at
https://github.com/wdebruij/kerneltools/blob/master/tests/bench_rollover.c
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Willem de Bruijn [Tue, 12 May 2015 15:56:49 +0000 (11:56 -0400)]
packet: rollover huge flows before small flows
Migrate flows from a socket to another socket in the fanout group not
only when the socket is full. Start migrating huge flows early, to
divert possible 4-tuple attacks without affecting normal traffic.
Introduce fanout_flow_is_huge(). This detects huge flows, which are
defined as taking up more than half the load. It does so cheaply, by
storing the rxhashes of the N most recent packets. If over half of
these are the same rxhash as the current packet, then drop it. This
only protects against 4-tuple attacks. N is chosen to fit all data in
a single cache line.
Tested:
Ran bench_rollover for 10 sec with 1.5 Mpps of single flow input.
Willem de Bruijn [Tue, 12 May 2015 15:56:48 +0000 (11:56 -0400)]
packet: rollover lock contention avoidance
Rollover has to call packet_rcv_has_room on sockets in the fanout
group to find a socket to migrate to. This operation is expensive
especially if the packet sockets use rings, when a lock has to be
acquired.
Avoid pounding on the lock by all sockets by temporarily marking a
socket as "under memory pressure" when such pressure is detected.
While set, only the socket owner may call packet_rcv_has_room on the
socket. Once it detects normal conditions, it clears the flag. The
socket is not used as a victim by any other socket in the meantime.
Under reasonably balanced load, each socket writer frequently calls
packet_rcv_has_room and clears its own pressure field. As a backup
for when the socket is rarely written to, also clear the flag on
reading (packet_recvmsg, packet_poll) if this can be done cheaply
(i.e., without calling packet_rcv_has_room). This is only for
edge cases.
Tested:
Ran bench_rollover: a process with 8 sockets in a single fanout
group, each pinned to a single cpu that receives one nic recv
interrupt. RPS and RFS are disabled. The benchmark uses packet
rx_ring, which has to take a lock when determining whether a
socket has room.
Sent 3.5 Mpps of UDP traffic with sufficient entropy to spread
uniformly across the packet sockets (and inserted an iptables
rule to drop in PREROUTING to avoid protocol stack processing).
Without this patch, all sockets try to migrate traffic to
neighbors, causing lock contention when searching for a non-
empty neighbor. The lock is the top 9 entries.
Willem de Bruijn [Tue, 12 May 2015 15:56:47 +0000 (11:56 -0400)]
packet: rollover only to socket with headroom
Only migrate flows to sockets that have sufficient headroom, where
sufficient is defined as having at least 25% empty space.
The kernel has three different buffer types: a regular socket, a ring
with frames (TPACKET_V[12]) or a ring with blocks (TPACKET_V3). The
latter two do not expose a read pointer to the kernel, so headroom is
not computed easily. All three needs a different implementation to
estimate free space.
Tested:
Ran bench_rollover for 10 sec with 1.5 Mpps of single flow input.
bench_rollover has as many sockets as there are NIC receive queues
in the system. Each socket is owned by a process that is pinned to
one of the receive cpus. RFS is disabled. RPS is enabled with an
identity mapping (cpu x -> cpu x), to count drops with softnettop.
lpbb5:/export/hda3/willemb# ./bench_rollover -r -l 1000 -s
Press [Enter] to exit
Willem de Bruijn [Tue, 12 May 2015 15:56:45 +0000 (11:56 -0400)]
packet: rollover prepare: move code out of callsites
packet_rcv_fanout calls fanout_demux_rollover twice. Move all rollover
logic into the callee to simplify these callsites, especially with
upcoming changes.
The main differences between the two callsites is that the FLAG
variant tests whether the socket previously selected by another
mode (RR, RND, HASH, ..) has room before migrating flows, whereas the
rollover mode has no original socket to test.
Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Tue, 12 May 2015 13:31:48 +0000 (06:31 -0700)]
ipv4: __ip_local_out_sk() is static
__ip_local_out_sk() is only used from net/ipv4/ip_output.c
net/ipv4/ip_output.c:94:5: warning: symbol '__ip_local_out_sk' was not
declared. Should it be static?
Fixes: 7026b1ddb6b8 ("netfilter: Pass socket pointer down through okfn().") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Tue, 12 May 2015 13:22:56 +0000 (06:22 -0700)]
tcp/dccp: tw_timer_handler() is static
tw_timer_handler() is only used from net/ipv4/inet_timewait_sock.c
Fixes: 789f558cfb36 ("tcp/dccp: get rid of central timewait timer") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 13 May 2015 19:19:48 +0000 (15:19 -0400)]
Merge branch 'cls_flower'
Jiri Pirko says:
====================
introduce programable flow dissector and cls_flower
Per Davem's request, I prepared this patchset which introduces programmable
flow dissector. For current users of flow_keys, there is a wrapper
skb_flow_dissect_flow_keys which maintains the previous behaviour.
For purposes of cls_flower, couple of new dissection keys were introduced.
Note that this dissector can be also eventually used by openvswitch code.
Also, as a next step, I plan to get rid of *skb_flow_get_ports(export)
and *__skb_get_poff as their functionality can be now implemented by
skb_flow_dissect as well.
v2->v3:
- remove TCA_FLOWER_POLICE attr suggested by Jamal
v1->v2:
- move __skb_tx_hash rather to dev.c as suggested by Alex
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Tue, 12 May 2015 12:56:21 +0000 (14:56 +0200)]
tc: introduce Flower classifier
This patch introduces a flow-based filter. So far, the very essential
packet fields are supported.
This patch is only the first step. There is a lot of potential performance
improvements possible to implement. Also a lot of features are missing
now. They will be addressed in follow-up patches.
Signed-off-by: Jiri Pirko <jiri@resnulli.us> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Tue, 12 May 2015 12:56:08 +0000 (14:56 +0200)]
flow_dissector: remove unused function flow_get_hlen declaration
commit 56193d1bce ("net: Add function for parsing the header length out
of linear ethernet frames") added this function declaration but it is
defined nowhere.
Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
When the NIC doesn't support PTP, probe-time MCDI commands fail in
predictable ways. Instead of logging cryptic MCDI errors, just log that
PTP isn't supported.
v2: Hopefully stop Thunderbird mangling the patches.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Edward Cree [Tue, 12 May 2015 12:05:09 +0000 (13:05 +0100)]
sfc: suppress some MCDI error messages in PTP
Also, remove a needless netif_err() from efx_ptp_update_stats() - if the
MCDI fails it'll print its own error message, we don't need another that
adds no information.
Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Westphal [Mon, 11 May 2015 17:50:41 +0000 (19:50 +0200)]
net: sched: use counter to break reclassify loops
Seems all we want here is to avoid endless 'goto reclassify' loop.
tc_classify_compat even resets this counter when something other
than TC_ACT_RECLASSIFY is returned, so this skb-counter doesn't
break hypothetical loops induced by something other than perpetual
TC_ACT_RECLASSIFY return values.
skb_act_clone is now identical to skb_clone, so just use that.
Tested with following (bogus) filter:
tc filter add dev eth0 parent ffff: \
protocol ip u32 match u32 0 0 police rate 10Kbit burst \
64000 mtu 1500 action reclassify
Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
1) qca_spi.c renamed the local variable used for the SPI device
from spi_device to spi, meanwhile the spi_set_drvdata() call
got moved further up in the probe function.
2) Two changes were both adding new members to codel params
structure, and thus we had overlapping changes to the
initializer function.
3) 'net' was making a fix to sk_release_kernel() which is
completely removed in 'net-next'.
4) In net_namespace.c, the rtnl_net_fill() call for GET operations
had the command value fixed, meanwhile 'net-next' adjusted the
argument signature a bit.
This also matches example merge resolutions posted by Stephen
Rothwell over the past two days.
Signed-off-by: David S. Miller <davem@davemloft.net>
Scott Feldman [Wed, 13 May 2015 18:16:50 +0000 (11:16 -0700)]
switchdev: don't use anonymous union on switchdev attr/obj structs
Older gcc versions (e.g. gcc version 4.4.6) don't like anonymous unions
which was causing build issues on the newly added switchdev attr/obj
structs. Fix this by using named union on structs.
Signed-off-by: Scott Feldman <sfeldma@gmail.com> Reported-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 13 May 2015 16:26:28 +0000 (12:26 -0400)]
Merge branch 'switchdev-cleanups'
Scott Feldman says:
====================
switchdev: more (minor) cleanups
Fix some sparse warnings and include some documentation review comments that
didn't get picked up in the switchdev Spring Cleanup series.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
1) Handle max TX power properly wrt VIFs and the MAC in iwlwifi, from
Avri Altman.
2) Use the correct FW API for scan completions in iwlwifi, from Avraham
Stern.
3) FW monitor in iwlwifi accidently uses unmapped memory, fix from Liad
Kaufman.
4) rhashtable conversion of mac80211 station table was buggy, the
virtual interface was not taken into account. Fix from Johannes
Berg.
5) Fix deadlock in rtlwifi by not using a zero timeout for
usb_control_msg(), from Larry Finger.
6) Update reordering state before calculating loss detection, from
Yuchung Cheng.
7) Fix off by one in bluetooth firmward parsing, from Dan Carpenter.
8) Fix extended frame handling in xiling_can driver, from Jeppe
Ledet-Pedersen.
9) Fix CODEL packet scheduler behavior in the presence of TSO packets,
from Eric Dumazet.
10) Fix NAPI budget testing in fm10k driver, from Alexander Duyck.
11) macvlan needs to propagate promisc settings down the the lower
device, from Vlad Yasevich.
12) igb driver can oops when changing number of rings, from Toshiaki
Makita.
13) Source specific default routes not handled properly in ipv6, from
Markus Stenberg.
14) Use after free in tc_ctl_tfilter(), from WANG Cong.
15) Use softirq spinlocking in netxen driver, from Tony Camuso.
16) Two ARM bpf JIT fixes from Nicolas Schichan.
17) Handle MSG_DONTWAIT properly in ring based AF_PACKET sends, from
Mathias Kretschmer.
18) Fix x86 bpf JIT implementation of FROM_{BE16,LE16,LE32}, from Alexei
Starovoitov.
19) ll_temac driver DMA maps TX packet header with incorrect length, fix
from Michal Simek.
20) We removed pm_qos bits from netdevice.h, but some indirect
references remained. Kill them. From David Ahern.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (90 commits)
net: Remove remaining remnants of pm_qos from netdevice.h
e1000e: Add pm_qos header
net: phy: micrel: Fix regression in kszphy_probe
net: ll_temac: Fix DMA map size bug
x86: bpf_jit: fix FROM_BE16 and FROM_LE16/32 instructions
netns: return RTM_NEWNSID instead of RTM_GETNSID on a get
Update be2net maintainers' email addresses
net_sched: gred: use correct backlog value in WRED mode
pppoe: drop pppoe device in pppoe_unbind_sock_work
net: qca_spi: Fix possible race during probe
net: mdio-gpio: Allow for unspecified bus id
af_packet / TX_RING not fully non-blocking (w/ MSG_DONTWAIT).
bnx2x: limit fw delay in kdump to 5s after boot
ARM: net: delegate filter to kernel interpreter when imm_offset() return value can't fit into 12bits.
ARM: net fix emit_udiv() for BPF_ALU | BPF_DIV | BPF_K intruction.
mpls: Change reserved label names to be consistent with netbsd
usbnet: avoid integer overflow in start_xmit
netxen_nic: use spin_[un]lock_bh around tx_clean_lock (2)
net: xgene_enet: Set hardware dependency
net: amd-xgbe: Add hardware dependency
...
David Ahern [Tue, 12 May 2015 15:36:59 +0000 (09:36 -0600)]
e1000e: Add pm_qos header
Commit e2c6544829f moved pm_qos_req to e1000_adapter. Add the header file
that defines the struct.
Signed-off-by: David Ahern <dsahern@gmail.com> Cc: Thomas Graf <tgraf@suug.ch> Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Holzheu [Tue, 12 May 2015 05:22:44 +0000 (22:22 -0700)]
test_bpf: add 173 new testcases for eBPF
add an exhaustive set of eBPF tests bringing total to:
test_bpf: Summary: 233 PASSED, 0 FAILED, [0/226 JIT'ed]
Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com> Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
Brenden Blanco [Tue, 12 May 2015 04:25:51 +0000 (21:25 -0700)]
samples/bpf: fix in-source build of samples with clang
in-source build of 'make samples/bpf/' was incorrectly
using default compiler instead of invoking clang/llvm.
out-of-source build was ok.
Fixes: a80857822b0c ("samples: bpf: trivial eBPF program in C") Signed-off-by: Brenden Blanco <bblanco@plumgrid.com> Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
x86: bpf_jit: fix FROM_BE16 and FROM_LE16/32 instructions
FROM_BE16:
'ror %reg, 8' doesn't clear upper bits of the register,
so use additional 'movzwl' insn to zero extend 16 bits into 64
FROM_LE16:
should zero extend lower 16 bits into 64 bit
FROM_LE32:
should zero extend lower 32 bits into 64 bit
Fixes: 89aa075832b0 ("net: sock: allow eBPF programs to be attached to sockets") Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
cxgb4/cxgb4vf: Cleanup macros, add comments and add new MACROS
Cleanup few MACROS left out in t4_hw.h to be consistent with the
existing ones. Also replace few hardcoded values with MACROS. Also
update comments for some code
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>
KY Srinivasan [Mon, 11 May 2015 22:39:46 +0000 (15:39 -0700)]
hv_netvsc: Use the xmit_more skb flag to optimize signaling the host
Based on the information given to this driver (via the xmit_more skb flag),
we can defer signaling the host if more packets are on the way. This will help
make the host more efficient since it can potentially process a larger batch of
packets. Implement this optimization.
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
pkt_gen->last_ok was not set properly, so after the first burst
pktgen instead of allocating new packet, will reuse old one, advance
eth_type_trans further, which would mean the stack will be seeing very
short bogus packets.
Fixes: 62f64aed622b ("pktgen: introduce xmit_mode '<start_xmit|netif_receive>'") Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Fainelli [Mon, 11 May 2015 22:12:41 +0000 (15:12 -0700)]
net: systemport: Implement TX coalescing control knobs
Add the ability to configure both 'tx-frames' which controls how many frames
are doing to trigger a single interrupt and 'tx-usecs' which dictates how long
to wait before an interrupt should be services.
Since our timer resolution is close to 8.192 us, we round up to the nearest
value the 'tx-usecs' timeout value.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Denys Vlasenko [Mon, 11 May 2015 19:17:53 +0000 (21:17 +0200)]
net: deinline netif_tx_stop_all_queues(), remove WARN_ON in netif_tx_stop_queue()
These functions compile to 60 bytes of machine code each.
With this .config: http://busybox.net/~vda/kernel_config
there are 617 calls of netif_tx_stop_queue()
and 49 calls of netif_tx_stop_all_queues() in vmlinux.
To fix this, remove WARN_ON in netif_tx_stop_queue()
as suggested by davem, and deinline netif_tx_stop_all_queues().
Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com> CC: Alexei Starovoitov <alexei.starovoitov@gmail.com> CC: Alexander Duyck <alexander.duyck@gmail.com> CC: Joe Perches <joe@perches.com> CC: David S. Miller <davem@davemloft.net> CC: Jiri Pirko <jpirko@redhat.com> CC: linux-kernel@vger.kernel.org CC: netdev@vger.kernel.org CC: netfilter-devel@vger.kernel.org Acked-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Justin Cormack [Mon, 11 May 2015 19:00:10 +0000 (20:00 +0100)]
macvtap add missing ioctls - fix wrapping
The macvtap driver tries to emulate all the ioctls supported by a normal
tun/tap driver, however it was missing the generic SIOCGIFHWADDR and
SIOCSIFHWADDR ioctls to get and set the mac address that are supported
by tun/tap. This patch adds these.
Signed-off-by: Justin Cormack <justin@netbsd.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Tue, 12 May 2015 23:02:06 +0000 (16:02 -0700)]
Merge branch 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus
Pull MIPS fixes from Ralf Baechle:
"One build fix for build breakage of all MIPS SMP kernels caused by
Rusty's fix of obsolete use of cpu mask helpers, another to fix the FP
ABI selection when loading an ELF binary"
* 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus:
MIPS: fix FP mode selection in lieu of .MIPS.abiflags data
MIPS: SMP: Fix build error.
Linus Torvalds [Tue, 12 May 2015 22:54:54 +0000 (15:54 -0700)]
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma
Pull rdma fixes from Doug Ledford:
- update MAINTAINERS git repo pointer
- printk garbage fix
- fix for qib and iw_cxgb4 bugs introduced in 4.1 window
- fix for an older iWARP netlink bug
- fix a memcpy issue in ehca driver
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma:
infiniband: Remove duplicated KERN_<LEVEL> from pr_<level> uses
IB/qib: fix test of unsigned variable
RDMA/core: Fix for parsing netlink string attribute
MAINTAINERS: update the official rdma git repo
iw_cxgb4: use wildcard mapping for getting remote addr info
IB/ehca: use correct destination for memcpy
Nicolas Dichtel [Mon, 11 May 2015 13:57:31 +0000 (15:57 +0200)]
netns: return RTM_NEWNSID instead of RTM_GETNSID on a get
Usually, RTM_NEWxxx is returned on a get (same as a dump).
Fixes: 0c7aecd4bde4 ("netns: add rtnl cmd to add and get peer netns ids") Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Tue, 12 May 2015 22:49:29 +0000 (15:49 -0700)]
Merge tag 'for-v4.1-rc' of git://git.infradead.org/battery-2.6
Pull power supply and reset fixes from Sebastian Reichel:
"misc fixes"
* tag 'for-v4.1-rc' of git://git.infradead.org/battery-2.6:
power: bq27x00_battery: Add missing MODULE_ALIAS
power: reset: Add MFD_SYSCON depends for brcmstb
power: reset: ltc2952: Remove bogus hrtimer_start() return value checks
power_supply: fix oops in collie_battery driver
power/reset: at91: fix return value check in at91_reset_platform_probe()
MAINTAINERS: Add me as maintainer of Nokia N900 power supply drivers
axp288_fuel_gauge: Add original author details
David S. Miller [Tue, 12 May 2015 22:43:56 +0000 (18:43 -0400)]
Merge branch 'switchdev_spring_cleanup'
Scott Feldman says:
====================
switchdev: spring cleanup
v7:
Address review comments:
- [Jiri] split the br_setlink and br_dellink reverts into their own patches
- [Jiri] some parameter cleanup of rocker's memory allocators
- [Jiri] pass trans mode as formal parameter rather than hanging off of
rocker_port.
v6:
Address review comments:
- [Jiri] split a couple of patches into one-logical-change per patch
- [Joe Perches] revert checkpatch -f changes for wrapped lines with long
symbols.
v5:
Address review comments:
- [Jiri] include Jiri's s/swdev/switchdev rename patches up front.
- [Jiri] squash some patches. Now setlink/dellink/getlink patches are in
three parts: new implementation, convert drivers to new, delete old impl.
- [Jiri] some minor variable renames
- [Jiri] use BUG_ON rather than WARN when COMMIT phase fails when PREPARE
phase said it was safe to come into the water.
- [Simon] rocker: fix a few transaction prepare-commit cases that were wrong.
This was the bulk of the changes in v5.
v4:
Well, it was a lot of work, but now prepare-commit transaction model is how
davem advises: if prepare fails, abort the transaction. The driver must do
resource reservations up front in prepare phase and return those resources if
aborting. Commit phase would use reserved resources. The good news is the
driver code (for rocker) now handles resource allocation failures better by not
leaving partially device or driver states. This is a side-effect of the
prepare phase where state isn't modified; only validation of inputs and
resource reservations happen in the prepare phase. Since we're supporting
setting attrs and add objs across lower devs in the stacked case, we need to
hold rtnl_lock (or ensure rtnl_lock is held) so lower devs don't move on us
during the prepare-commit transaction. DSA driver code skips the prepare phase
and goes straight for the commit phase since no up-front allocations are done
and no device failures (that could be detected in the prepare phase) can
happen.
Remove NETIF_F_HW_SWITCH_OFFLOAD from rocker and the swdev_attr_set/get
wrappers. DSA doesn't set NETIF_F_HW_SWITCH_OFFLOAD, so it can't be in
swdev_attr_set/get. rocker doesn't need it; or rather can't support
NETIF_F_HW_SWITCH_OFFLOAD being set/cleared at run-time after the device
port is already up and offloading L2/L3. NETIF_F_HW_SWITCH_OFFLOAD is still
left as a feature flag for drivers that can use it.
Drop the renaming patch for netdev_switch_notifier. Other renames are a
result of moving to the attr get/set or obj add/del model. Everything
but the netdev_switch_notifier is still prefixed with "swdev_".
v3:
Move to two-phase prepare-commit transaction model for attr set and obj add.
Driver gets a change in prepare phase to NACK transaction if lack of resources
or support in device.
v2:
Address review comments:
- [Jiri] squash a few related patches
- [Roopa] don't remove NETIF_F_HW_SWITCH_OFFLOAD
- [Roopa] address VLAN setlink/dellink
- [Ronen] print warning is attr set revert fails
Not address:
- Using something other than "swdev_" prefix
- Vendor extentions
The patch set grew a bit to not only support port attr get/set but also add
support for port obj add/del. Example of port objs are VLAN, FDB entries, and
FIB entries. The VLAN support now allows the swdev driver to get VLAN ranges
and flags like PVID and "untagged". Sridhar will be adding FDB obj support
in follow-on patch.
v1:
The main theme of this patch set is to cleanup swdev in preparation for
new features or fixes to be added soon. We have a pretty good idea now how
to handle stacked drivers in swdev, but there where some loose ends. For
example, if a set failed in the middle of walking the lower devs, we would
leave the system in an undefined state...there was no way to recover back to
the previous state. Speaking of sets, also recognize a pattern that most
swdev API accesses are gets or sets of port attributes, so go ahead and make
port attr get/set the central swdev API, and convert everything that is
set-ish/get-ish to this new API.
Features/fixes that should follow from this cleanup:
- solve the duplicate pkt forwarding issue
- get/set bridge attrs, like ageing_time, from/to device
- get/set more bridge port attrs from/to device
There are some rename cleanups tagging along at the end, to give swdev
consistent naming.
And finally, some much needed updates to the switchdev.txt documentation to
hopefully capture the state-of-the-art of swdev. Hopefully, we can do a better
job keeping this document up-to-date.
Tested with rocker, of course, to make sure nothing functional broke. There
are a couple minor tweaks to DSA code for getting switch ID and setting STP
updates to use new API, but not expecting amy breakage there.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Scott Feldman [Sun, 10 May 2015 16:48:09 +0000 (09:48 -0700)]
switchdev: bring documentation up-to-date
Much need updated of switchdev documentation to cover what's been
implmented to-date. There are some XXX comments in the text for
unimplemented or broken items. I'd like to keep these in there (poor-man's
TODO list) and update the document once each issue is resolved.
Signed-off-by: Scott Feldman <sfeldma@gmail.com> Acked-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
Scott Feldman [Sun, 10 May 2015 16:48:06 +0000 (09:48 -0700)]
switchdev: convert fib_ipv4_add/del over to switchdev_port_obj_add/del
The IPv4 FIB ops convert nicely to the switchdev objs and we're left with
only four switchdev ops: port get/set and port add/del. Other objs will
follow, such as FDB. So go ahead and convert IPv4 FIB over to switchdev
obj for consistency, anticipating more objs to come.
Signed-off-by: Scott Feldman <sfeldma@gmail.com> Acked-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
Scott Feldman [Sun, 10 May 2015 16:48:04 +0000 (09:48 -0700)]
switchdev: add new switchdev_port_bridge_getlink
Like bridge_setlink, add switchdev wrapper to handle bridge_getlink and
call into port driver to get port attrs. For now, only BR_LEARNING and
BR_LEARNING_SYNC are returned. To add more, we'll probably want to break
away from ndo_dflt_bridge_getlink() and build the netlink skb directly in
the switchdev code.
Signed-off-by: Scott Feldman <sfeldma@gmail.com> Acked-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
Scott Feldman [Sun, 10 May 2015 16:47:59 +0000 (09:47 -0700)]
bridge: restore br_setlink back to original
This is revert of:
commit 68e331c785b8 ("bridge: offload bridge port attributes to switch asic
if feature flag set")
Restore br_setlink back to original and don't call into SELF port driver.
rtnetlink.c:bridge_setlink() already does a call into port driver for SELF.
bridge set link cmd defaults to MASTER. From man page for bridge link set
cmd:
self link setting is configured on specified physical device
master link setting is configured on the software bridge (default)
The link setting has two values: the device-side value and the software
bridge-side value. These are independent and settable using the bridge
link set cmd by specifying some combination of [master] | [self].
Furthermore, the device-side and bridge-side settings have their own
initial value, viewable from bridge -d link show cmd.
Restoring br_setlink back to original makes rocker (the only in-kernel user
of SELF link settings) work as first implement: two-sided values.
It's true that when both MASTER and SELF are specified from the command,
two netlink notifications are generated, one for each side of the settings.
The user-space app can distiquish between the two notifications by
observing the MASTER or SELF flag.
Signed-off-by: Scott Feldman <sfeldma@gmail.com> Acked-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
Scott Feldman [Sun, 10 May 2015 16:47:56 +0000 (09:47 -0700)]
switchdev: add new switchdev bridge setlink
Add new switchdev_port_bridge_setlink that can be used by drivers
implementing .ndo_bridge_setlink to set switchdev bridge attributes.
Basically turn the raw rtnl_bridge_setlink netlink into switchdev attr
sets. Proper netlink attr policy checking is done on the protinfo part of
the netlink msg.
Currently, for protinfo, only bridge port attrs BR_LEARNING and
BR_LEARNING_SYNC are parsed and passed to port driver.
For afspec, VLAN objs are passed so switchdev driver can set VLANs assigned
to SELF. To illustrate with iproute2 cmd, we have:
bridge vlan add vid 10 dev sw1p1 self master
To add VLAN 10 to port sw1p1 for both the bridge (master) and the device
(self).
Signed-off-by: Scott Feldman <sfeldma@gmail.com> Acked-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
Scott Feldman [Sun, 10 May 2015 16:47:52 +0000 (09:47 -0700)]
switchdev: introduce switchdev add/del obj ops
Like switchdev attr get/set, add new switchdev obj add/del. switchdev objs
will be things like VLANs or FIB entries, so add/del fits better for
objects than get/set used for attributes.
Use same two-phase prepare-commit transaction model as in attr set.
Signed-off-by: Scott Feldman <sfeldma@gmail.com> Acked-by: Sridhar Samudrala <sridhar.samudrala@intel.com> Acked-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
Scott Feldman [Sun, 10 May 2015 16:47:51 +0000 (09:47 -0700)]
switchdev: convert STP update to switchdev attr set
STP update is just a settable port attribute, so convert
switchdev_port_stp_update to an attr set.
For DSA, the prepare phase is skipped and STP updates are only done in the
commit phase. This is because currently the DSA drivers don't need to
allocate any memory for STP updates and the STP update will not fail to HW
(unless something horrible goes wrong on the MDIO bus, in which case the
prepare phase wouldn't have been able to predict anyway).
Signed-off-by: Scott Feldman <sfeldma@gmail.com> Acked-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
Scott Feldman [Sun, 10 May 2015 16:47:50 +0000 (09:47 -0700)]
rocker: support prepare-commit transaction model
For rocker, support prepare-commit transaction model for setting attributes
(and for adding objects). This requires rocker to preallocate memory
needed for the commit up front in the prepare phase. Since rtnl_lock is
held between prepare-commit, store the allocated memory on a queue hanging
off of the rocker_port. Also, in prepare phase, do everything right up to
calling into HW. The same code paths are tranversed in the driver for both
prepare and commit phases. In some cases, any state modified in the
prepare phase must be reverted before returning so the commit phase makes
the same decisions.
As a consequence of holding rtnl_lock in process context for all attr sets
(and obj adds), all memory is GFP_KERNEL allocated and we don't need to
busy spin waiting for the device to complete the command. So the bulk of
this patch is simplifying the memory allocations to only use GFP_KERNEL and
to remove the nowait flag and busy spin loop.
Signed-off-by: Scott Feldman <sfeldma@gmail.com> Acked-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
Scott Feldman [Sun, 10 May 2015 16:47:49 +0000 (09:47 -0700)]
switchdev: convert parent_id_get to switchdev attr get
Switch ID is just a gettable port attribute. Convert switchdev op
switchdev_parent_id_get to a switchdev attr.
Note: for sysfs and netlink interfaces, SWITCHDEV_ATTR_PORT_PARENT_ID is
called with SWITCHDEV_F_NO_RECUSE to limit switch ID user-visiblity to only
port netdevs. So when a port is stacked under bond/bridge, the user can
only query switch id via the switch ports, but not via the upper devices
Signed-off-by: Scott Feldman <sfeldma@gmail.com> Acked-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
Scott Feldman [Sun, 10 May 2015 16:47:48 +0000 (09:47 -0700)]
switchdev: introduce get/set attrs ops
Add two new swdev ops for get/set switch port attributes. Most swdev
interactions on a port are gets or sets on port attributes, so rather than
adding ops for each attribute, let's define clean get/set ops for all
attributes, and then we can have clear, consistent rules on how attributes
propagate on stacked devs.
Add the basic algorithms for get/set attr ops. Use the same recusive algo
to walk lower devs we've used for STP updates, for example. For get,
compare attr value for each lower dev and only return success if attr
values match across all lower devs. For sets, set the same attr value for
all lower devs. We'll use a two-phase prepare-commit transaction model for
sets. In the first phase, the driver(s) are asked if attr set is OK. If
all OK, the commit attr set in second phase. A driver would NACK the
prepare phase if it can't set the attr due to lack of resources or support,
within it's control. RTNL lock must be held across both phases because
we'll recurse all lower devs first in prepare phase, and then recurse all
lower devs again in commit phase. If any lower dev fails the prepare
phase, we need to abort the transaction for all lower devs.
If lower dev recusion isn't desired, allow a flag SWITCHDEV_F_NO_RECURSE to
indicate get/set only work on port (lowest) device.
Signed-off-by: Scott Feldman <sfeldma@gmail.com> Acked-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Sun, 10 May 2015 16:47:47 +0000 (09:47 -0700)]
switchdev: s/swdev_/switchdev_/
Turned out that "switchdev" sticks. So just unify all related terms to use
this prefix.
Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Scott Feldman <sfeldma@gmail.com> Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com> Acked-by: Andy Gospodarek <gospo@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Sun, 10 May 2015 16:47:46 +0000 (09:47 -0700)]
switchdev: s/netdev_switch_/switchdev_/ and s/NETDEV_SWITCH_/SWITCHDEV_/
Turned out that "switchdev" sticks. So just unify all related terms to use
this prefix.
Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Scott Feldman <sfeldma@gmail.com> Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com> Acked-by: Andy Gospodarek <gospo@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David Ward [Sun, 10 May 2015 02:01:46 +0000 (22:01 -0400)]
net_sched: gred: add TCA_GRED_LIMIT attribute
In a GRED qdisc, if the default "virtual queue" (VQ) does not have drop
parameters configured, then packets for the default VQ are not subjected
to RED and are only dropped if the queue is larger than the net_device's
tx_queue_len. This behavior is useful for WRED mode, since these packets
will still influence the calculated average queue length and (therefore)
the drop probability for all of the other VQs. However, for some drivers
tx_queue_len is zero. In other cases the user may wish to make the limit
the same for all VQs (including the default VQ with no drop parameters).
This change adds a TCA_GRED_LIMIT attribute to set the GRED queue limit,
in bytes, during qdisc setup. (This limit is in bytes to be consistent
with the drop parameters.) The default limit is the same as for a bfifo
queue (tx_queue_len * psched_mtu). If the drop parameters of any VQ are
configured with a smaller limit than the GRED queue limit, that VQ will
still observe the smaller limit instead.
Signed-off-by: David Ward <david.ward@ll.mit.edu> Signed-off-by: David S. Miller <davem@davemloft.net>
Mike Marciniszyn [Tue, 12 May 2015 17:42:42 +0000 (13:42 -0400)]
IB/qib: fix test of unsigned variable
Commit d4988623cc60 ("IB/qib: use arch_phys_wc_add()")
adjusted mtrr inititialization to use the new interface.
Unfortunately, the new interface returns a signed
value and the patch tested the unsigned wc_cookie.
Fix the issue by changing the type of wc_cookie to int. For
the success case the ret left at zero to avoid
a warning from the caller. For failure wc_cookie
is used as the ret.
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
RDMA/core: Fix for parsing netlink string attribute
The string iwpm_ulib_name is recorded in a nlmsg as a netlink attribute.
Without this fix parsing of the nlmsg by the userspace port mapper service fails
because of unknown attribute length, causing the port mapper service not to
register the client, which has sent the nlmsg.
Paul Burton [Wed, 6 May 2015 10:52:32 +0000 (11:52 +0100)]
MIPS: fix FP mode selection in lieu of .MIPS.abiflags data
Commit 46490b572544 ("MIPS: kernel: elf: Improve the overall ABI and FPU
mode checks") reworked the ELF FP ABI mode selection logic, but when
CONFIG_MIPS_O32_FP64_SUPPORT is enabled it breaks the use of binaries
which have no PT_MIPS_ABIFLAGS program header & associated
.MIPS.abiflags section.
A default mode is selected based upon whether the ELF contains MIPS32 or
MIPS64 code, but that selection is made in arch_elf_pt_proc.
arch_elf_pt_proc only executes when a PT_MIPS_ABIFLAGS program header is
found. If one is not found then arch_elf_pt_proc is never called, and no
default overall_fp_mode value is selected. When arch_check_elf is
called, both abi0 & abi1 are MIPS_ABI_FP_UNKNOWN which leads to both
prog_req & interp_req being set to none_req. none_req matches none of
the conditions for mode selection at the end of arch_check_elf, so
overall_fp_mode is left untouched. Finally once mips_set_personality_fp
is called the BUG() in the default case is then hit & the kernel likely
panics.
Fix this by moving the selection of a default overall mode to the start
of arch_check_elf, which runs once per ELF executed regardless of
whether it has a PT_MIPS_ABIFLAGS program header.
Sathya Perla [Tue, 12 May 2015 06:13:50 +0000 (02:13 -0400)]
Update be2net maintainers' email addresses
Emulex developers' email addresses are now "@avagotech" instead of
"@emulex". I'm also replacing Subbu with Padmanabh and Sriharsha in the
maintainers list. The driver's heading was outdated and did not include
some of the chip types (BE3, Lancer and Skyhawk) that the driver has
been supporting for a longtime. I've updated this too.
Signed-off-by: Sathya Perla <sathya.perla@avagotech.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 12 May 2015 14:39:27 +0000 (10:39 -0400)]
Merge branch 'netdev_page_frags'
Alexander Duyck says:
====================
Refactor netdev page frags and move them into mm/
This patch series addresses several things.
First I found an issue in the performance of the pfmemalloc check from
build_skb. To work around it I have provided a cached copy of pfmemalloc
to be used in __netdev_alloc_skb and __napi_alloc_skb.
Second I moved the page fragment allocation logic into the mm tree and
added functionality for freeing page fragments. I had to fix igb before I
could do this as it was using a reference to NETDEV_FRAG_PAGE_MAX_SIZE
incorrectly.
Finally I went through and replaced all of the duplicate code that was
calling put_page and replaced it with calls to skb_free_frag.
With these changes in place a simple receive and drop test increased from a
packet rate of 8.9Mpps to 9.8Mpps. The gains breakdown as follows:
8.9Mpps Before 9.8Mpps After
------------------------ ------------------------
7.8% put_compound_page 9.1% __free_page_frag
3.9% skb_free_head
1.1% put_page
Alexander Duyck [Thu, 7 May 2015 04:12:20 +0000 (21:12 -0700)]
e1000: Replace e1000_free_frag with skb_free_frag
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Alexander Duyck [Thu, 7 May 2015 04:12:03 +0000 (21:12 -0700)]
net: Add skb_free_frag to replace use of put_page in freeing skb->head
This change adds a function called skb_free_frag which is meant to
compliment the function netdev_alloc_frag. The general idea is to enable a
more lightweight version of page freeing since we don't actually need all
the overhead of a put_page, and we don't quite fit the model of __free_pages.
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Alexander Duyck [Thu, 7 May 2015 04:11:57 +0000 (21:11 -0700)]
mm/net: Rename and move page fragment handling from net/ to mm/
This change moves the __alloc_page_frag functionality out of the networking
stack and into the page allocation portion of mm. The idea it so help make
this maintainable by placing it with other page allocation functions.
Since we are moving it from skbuff.c to page_alloc.c I have also renamed
the basic defines and structure from netdev_alloc_cache to page_frag_cache
to reflect that this is now part of a different kernel subsystem.
I have also added a simple __free_page_frag function which can handle
freeing the frags based on the skb->head pointer. The model for this is
based off of __free_pages since we don't actually need to deal with all of
the cases that put_page handles. I incorporated the virt_to_head_page call
and compound_order into the function as it actually allows for a signficant
size reduction by reducing code duplication.
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Alexander Duyck [Thu, 7 May 2015 04:11:51 +0000 (21:11 -0700)]
net: Store virtual address instead of page in netdev_alloc_cache
This change makes it so that we store the virtual address of the page
in the netdev_alloc_cache instead of the page pointer. The idea behind
this is to avoid multiple calls to page_address since the virtual address
is required for every access, but the page pointer is only needed at
allocation or reset of the page.
While I was at it I also reordered the netdev_alloc_cache structure a bit
so that the size is always 16 bytes by dropping size in the case where
PAGE_SIZE is greater than or equal to 32KB.
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Alexander Duyck [Thu, 7 May 2015 04:11:45 +0000 (21:11 -0700)]
igb: Don't use NETDEV_FRAG_PAGE_MAX_SIZE in descriptor calculation
This change updates igb so that it will correctly perform the descriptor
count calculation. Previously it was taking NETDEV_FRAG_PAGE_MAX_SIZE
into account with isn't really correct since a different value is used to
determine the size of the pages used for TCP. That is actually determined
by SKB_FRAG_PAGE_ORDER.
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>