Jacob Keller [Mon, 15 Jun 2015 22:00:52 +0000 (15:00 -0700)]
fm10k: remove comment about rtnl_lock around mbx operations
This comment is no longer true due to a couple of mailbox locking
refactors, and we now don't actually do any rtnl protected operations
directly in the mailbox path. Remove this comment as it is factually
incorrect and confusing.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Krishneil Singh <krishneil.k.singh@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
====================
Netfilter/IPVS updates for net-next
The following patchset contains Netfilter/IPVS updates for your net-next tree
in this 4.4 development cycle, they are:
1) Schedule ICMP traffic to IPVS instances, this introduces a new schedule_icmp
proc knob to enable/disable it. By default is off to retain the old
behaviour. Patchset from Alex Gartrell.
I'm also including what Alex originally said for the record:
"The configuration of ipvs at Facebook is relatively straightforward. All
ipvs instances bgp advertise a set of VIPs and the network prefers the
nearest one or uses ECMP in the event of a tie. For the uninitiated, ECMP
deterministically and statelessly load balances by hashing the packet
(usually a 5-tuple of protocol, saddr, daddr, sport, and dport) and using
that number as an index (basic hash table type logic).
The problem is that ICMP packets (which contain really important
information like whether or not an MTU has been exceeded) will get a
different hash value and may end up at a different ipvs instance. With no
information about where to route these packets, they are dropped, creating
ICMP black holes and breaking Path MTU discovery. Suddenly, my mom's
pictures can't load and I'm fielding midday calls that I want nothing to do
with.
To address this, this patch set introduces the ability to schedule icmp
packets which is gated by a sysctl net.ipv4.vs.schedule_icmp. If set to 0,
the old behavior is maintained -- otherwise ICMP packets are scheduled."
2) Add another proc entry to ignore tunneled packets to avoid routing loops
from IPVS, also from Alex.
3) Fifteen patches from Eric Biederman to:
* Stop passing nf_hook_ops as parameter to the hook and use the state hook
object instead all around the netfilter code, so only the private data
pointer is passed to the registered hook function.
* Now that we've got state->net, propagate the netns pointer to netfilter hook
clients to avoid its computation over and over again. A good example of how
this has been simplified is the former TEE target (now nf_dup infrastructure)
since it has killed the ugly pick_net() function.
There's another round of netns updates from Eric Biederman making the line. To
avoid the patchbomb again to almost all the networking mailing list (that is 84
patches) I'd suggest we send you a pull request with no patches or let me know
if you prefer a better way.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Stephen Rothwell [Tue, 22 Sep 2015 03:41:44 +0000 (20:41 -0700)]
drivers/net/ieee802154/at86rf230.c: seq_printf() now returns NULL
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Alexander Aring <alex.aring@gmail.com> Cc: Stefan Schmidt <stefan@osg.samsung.com> Cc: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Alexander Aring <alex.aring@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net: bcmgenet: Remove duplicate test for tx_coalesce_usecs_high
We were checking twice for ec->tx_coalesce_usecs_high, remove the
duplicate test.
Reported-by: Julia Lawall <julia.lawall@lip6.fr> Reported-by: kbuild-all@01.org Fixes: 2f9130709d2c19 ("net: bcmgenet: Implement TX coalescing control knobs") Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
This patch makes TLP to use 1 sec timer by default when RTT is
not available due to SYN/ACK retransmission or SYN cookies.
Prior to this change, the lack of RTT prevents TLP so the first
data packets sent can only be recovered by fast recovery or RTO.
If the fast recovery fails to trigger the RTO is 3 second when
SYN/ACK is retransmitted. With this patch we can trigger fast
recovery in 1sec instead.
Note that we need to check Fast Open more properly. A Fast Open
connection could be (accepted then) closed before it receives
the final ACK of 3WHS so the state is FIN_WAIT_1. Without the
new check, TLP will retransmit FIN instead of SYN/ACK.
Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Nandita Dukkipati <nanditad@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Currently SYN/ACK RTT is measured in jiffies. For LAN the SYN/ACK
RTT is often measured as 0ms or sometimes 1ms, which would affect
RTT estimation and min RTT samping used by some congestion control.
This patch improves SYN/ACK RTT to be usec resolution if platform
supports it. While the timestamping of SYN/ACK is done in request
sock, the RTT measurement is carefully arranged to avoid storing
another u64 timestamp in tcp_sock.
For regular handshake w/o SYNACK retransmission, the RTT is sampled
right after the child socket is created and right before the request
sock is released (tcp_check_req() in tcp_minisocks.c)
For Fast Open the child socket is already created when SYN/ACK was
sent, the RTT is sampled in tcp_rcv_state_process() after processing
the final ACK an right before the request socket is released.
If the SYN/ACK was retransmistted or SYN-cookie was used, we rely
on TCP timestamps to measure the RTT. The sample is taken at the
same place in tcp_rcv_state_process() after the timestamp values
are validated in tcp_validate_incoming(). Note that we do not store
TS echo value in request_sock for SYN-cookies, because the value
is already stored in tp->rx_opt used by tcp_ack_update_rtt().
One side benefit is that the RTT measurement now happens before
initializing congestion control (of the passive side). Therefore
the congestion control can use the SYN/ACK RTT.
Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 21 Sep 2015 23:03:05 +0000 (16:03 -0700)]
Merge branch 's390-next'
Ursula Braun says:
====================
s390: qeth and iucv patches
here is version 2 of some s390 related qeth patches for net-next. The patch by
Thomas Richter adds a new feature to the qeth layer2 code; the remaining
patches are minor improvements.
Version 2 of patch 4 uses the desired indentation in function declarations
and definitions spanning multiple lines in almost all cases. Thomas run into a
conflict with the maximum number of columns once. Thus you will still see one
function definition using an earlier column before the opening paranthesis.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The iucv code uses arrays as arguments. Even though this does not
really cause a problem, it could be misleading, since the compiler
turns array arguments into just a pointer argument. To be more
precise this patch changes the array arguments into pointers.
Signed-off-by: Ursula Braun <ursula.braun@de.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Thomas Richter [Fri, 18 Sep 2015 14:06:51 +0000 (16:06 +0200)]
qeth: add layer 2 RX/TX checksum offloading
Checksum offloading for send and receive is already
supported for layer 3 (IP layer). This patch
adds support for RX and TX hardware checksum offloading
for layer 2 (MAC layer). The hardware calculates the checksum
for IP UDP and TCP packets.
This patch moves the hardware checksum offloading setup
to the set of common functions in qeth_core_main.c.
Layer 2 and layer 3 now simply call the same common functions.
Also note that TX checksum offloading is always enabled.
The device driver relies on the TCP/IP stack to make use of
this feature.
Signed-off-by: Thomas Richter <tmricht@linux.vnet.ibm.com> Signed-off-by: Ursula Braun <ursula.braun@de.ibm.com> Reviewed-by: Eugene Crosser <Eugene.Crosser@ru.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
An OSA-Express port name was required to identify a shared OSA port.
All operating system instances that shared the port had to use the
same port name. This requirement no longer applies.
Signed-off-by: Ursula Braun <ursula.braun@de.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
User is not allowed to write into bridge_state sysfs file.
Fixed attribute not mislead the user
Signed-off-by: Lakhvich Dmitriy <ldmitriy@ru.ibm.com> Signed-off-by: Ursula Braun <ursula.braun@de.ibm.com> Reported-by: Peter Oberparleiter <oberpar@linux.vnet.ibm.com> Reviewed-by: Eugene Crosser <Eugene.Crosser@ru.ibm.com> Reviewed-by: Thomas Richter <tmricht@de.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Length specifier in the %pM format is not supported (at least, not
documented). Remove it, and also an extraneous '&' for the array.
Signed-off-by: Eugene Crosser <Eugene.Crosser@ru.ibm.com> Signed-off-by: Ursula Braun <ursula.braun@de.ibm.com> Suggested-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Here's the first bluetooth-next pull request for the 4.4 kernel:
- ieee802154 cleanups & fixes
- debugfs support for the at86rf230 driver
- Support for quirky (seemingly counterfeit) CSR Bluetooth controllers
- Power management and device config improvements for Intel controllers
- Fix for devices with incorrect advertising data length
- Fix for closing HCI user channel socket
Please let me know if there are any issues pulling. Thanks.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
can: flexcan: enable interrupts atomically at the end of flexcan_chip_start()
This patch defers the writing of the interrupts bits of the CTRL register order
to enables all interrupts atomically at the the of the flexcan_chip_start()
function.
Suggested-by: Torsten Lang <torsten.lang@uweschneider.de> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
can: flexcan: use pointer to struct regs instead of void pointer for mmio address space
This patch renames the pointer to the mmio address space from "base" to "regs"
and changes the type from "void __iomem *" to "struct flexcan_regs __iomem *".
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
This patch renames the "features" member of struct flexcan_devtype_data to
"quirks". The corresponding defines are renamed too, to reflect what they
actually do.
can: flexcan: flexcan_chip_start(): cleanup writing of reg_mcr
This patch changes the order the individual bits of the mcr register in
flexcan_chip_start() are or'ed together to match the datasheet. The inline
documentation is adjusted accordingly.
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
David S. Miller [Mon, 21 Sep 2015 05:26:58 +0000 (22:26 -0700)]
Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue
Jeff Kirsher says:
====================
Intel Wired LAN Driver Updates 2015-09-17
This series contains updates to i40e and i40evf.
Shannon provides updates to i40e and i40evf to resolve an issue with the
nvmupdate utility. First renames a variable name to reduce confusion and
to differentiate it from the actual user variable. Then added the ability
to save the admin queue write back descriptor if a caller supplies a
buffer for it to be saved into. Added a new GetStatus command so that
the NVM update tool can query the current status instead of doing fake
write requests to probe for readiness. Added wait states to the NVM
update state machine to signify when waiting for an update operation to
finish, whether we are in the middle of a set of write operations, or we
are now idle but waiting. Then added a facility to run admin queue
commands through the NVM update utility in order to allow the update
tools to interact with the firmware and do special commands needed for
updates and configuration changes. Also added a facility to recover the
result of a previously run admin queue command.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace time_t type and get_seconds function which are not y2038 safe
on 32-bit systems. Function ktime_get_seconds use monotonic instead of
real time and therefore will not cause overflow.
Signed-off-by: Ksenija Stanojevic <ksenija.stanojevic@gmail.com> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: David S. Miller <davem@davemloft.net>
Changes from V1:
1. Remove "inline" in C file (according to LKML comment, same in below).
2. Fix a bug about class_find_device.
3. Change the DTS pattern on hnae, restruct it to compatible with Hi1610 soc.
4. Unified hip04_mdio and hip05_mdio into hns_mdio, which is more usaul for
later SOCs.
net: add Hisilicon Network Subsystem basic ethernet support
This is to add basic ethernet support for HNS. It is one of the way to
use the HNS acceleration engine. But most of the decoding/encoding
capability of the AE cannot be used in this way.
This submit contains the basic feature as a ethernet driver. More will
be added later.
Signed-off-by: huangdaode <huangdaode@hisilicon.com> Signed-off-by: Kenneth Lee <liguozhu@huawei.com> Signed-off-by: Yisen Zhuang <Yisen.Zhuang@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
DSAF, namely Distributed System Area Fabric, is one of the HNS
acceleration engine implementation. This patch add DSAF driver to the
system.
hns_ae_adapt: the adaptor for registering the driver to HNAE framework
hns_dsaf_mac: MAC cover interface for GE and XGE
hns_dsaf_gmac: GE (10/100/1000G Ethernet) MAC function
hns_dsaf_xgmac: XGE (10000+G Ethernet) MAC function
hns_dsaf_main: the platform device driver for the whole hardware
hns_dsaf_misc: some misc helper function, such as LED support
hns_dsaf_ppe: packet process engine function
hns_dsaf_rcb: ring buffer function
Signed-off-by: huangdaode <huangdaode@hisilicon.com> Signed-off-by: Yisen Zhuang <Yisen.Zhuang@huawei.com> Signed-off-by: Kenneth Lee <liguozhu@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net: add Hisilicon Network Subsystem hnae framework support
HNAE (Hisilicon Network Acceleration Engine) is a framework to provide a
unified ring buffer interface for Hisilicon Network Acceleration
Engines.
With the interface, upper layer can work as ethernet driver, ODP driver
or other service driver on purpose.
Signed-off-by: huangdaode <huangdaode@hisilicon.com> Signed-off-by: Kenneth Lee <liguozhu@huawei.com> Signed-off-by: Yisen Zhuang <Yisen.Zhuang@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
The MDIO support for Hisilicon Network Subsystem. It is used in Hislicon
hip04, hip05 and Hi1610 SoC to control the external PHY
Signed-off-by: huangdaode <huangdaode@hisilicon.com> Signed-off-by: Yisen Zhuang <Yisen.Zhuang@huawei.com> Signed-off-by: Kenneth Lee <liguozhu@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net: add Hisilicon Network Subsystem support (config and documents)
The Hisilicon Network Subsystem is a long term evolution IP which is
supposed to be used in Hisilicon ICT SoC. The IP, which is called hns
for short, is a TCP/IP acceleration engine, which can directly decode
TCP/IP stream and distribute them to different ring buffers.
HNS can be configured to work on different mode for different scenario.
This patch make use only some of the mode to make it as standard
ethernet NIC. The other mode will be added soon.
The whole function has 4 kernel sub-modules:
hnae: the HNS acceleration engine framework. It provides a abstract
interface between the engine and the upper layers which make use of the
engine by ring buffer.
hns_enet_drv: a standard ethernet driver that base on the ring buffer.
hns_dsaf: one of the implementation of HNS acceleration engine, which is
applied on Hililicon hip05, Hi1610 and other later-on SoCs
hns_mdio: the mdio control to the PHY, used by acceleration engine
This submit add basic config and documents
Signed-off-by: huangdaode <huangdaode@hisilicon.com> Signed-off-by: Kenneth Lee <liguozhu@huawei.com> Signed-off-by: Yisen Zhuang <Yisen.Zhuang@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
chas williams [Wed, 16 Sep 2015 20:28:25 +0000 (16:28 -0400)]
xen-netfront: always set num queues if possible
If netfront connects with two (or more) queues and then reconnects with
only one queue it fails to delete or rewrite the multi-queue-num-queues
key and netback will try to use the wrong number of queues.
Always write the num-queues field if the backend has multi-queue support.
Signed-off-by: Chas Williams <3chas3@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
netfilter: Pass priv instead of nf_hook_ops to netfilter hooks
Only pass the void *priv parameter out of the nf_hook_ops. That is
all any of the functions are interested now, and by limiting what is
passed it becomes simpler to change implementation details.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
act_connmark: Remember the struct net instead of guessing it.
Stop guessing the struct net instead of remember it. Guessing is just
silly and will be problematic in the future when I implement routes
between network namespaces.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
- Add nft_pktinfo.pf to replace ops->pf
- Add nft_pktinfo.hook to replace ops->hooknum
This simplifies the code, makes it more readable, and likely reduces
cache line misses. Maintainability is enhanced as the details of
nft_hook_ops are of no concern to the recpients of nft_pktinfo.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
inet netfilter: Prefer state->hook to ops->hooknum
The values of nf_hook_state.hook and nf_hook_ops.hooknum must be the
same by definition.
We are more likely to access the fields in nf_hook_state over the
fields in nf_hook_ops so with a little luck this results in
fewer cache line misses, and slightly more consistent code.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
inet netfilter: Remove hook from ip6t_do_table, arp_do_table, ipt_do_table
The values of ops->hooknum and state->hook are guaraneted to be equal
making the hook argument to ip6t_do_table, arp_do_table, and
ipt_do_table is unnecessary. Remove the unnecessary hook argument.
In the callers use state->hook instead of ops->hooknum for clarity and
to reduce the number of cachelines the callers touch.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
netfilter: ebtables: Simplify the arguments to ebt_do_table
Nearly everything thing of interest to ebt_do_table is already present
in nf_hook_state. Simplify ebt_do_table by just passing in the skb,
nf_hook_state, and the table. This make the code easier to read and
maintenance easier.
To support this create an nf_hook_state on the stack in ebt_broute
(the only caller without a nf_hook_state already available). This new
nf_hook_state adds no new computations to ebt_broute, but does use a
few more bytes of stack.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Merge tag 'ipvs-for-v4.4' of https://git.kernel.org/pub/scm/linux/kernel/git/horms/ipvs-next
Simon Horman says:
====================
IPVS Updates for v4.4
please consider these IPVS Updates for v4.4.
The updates include the following from Alex Gartrell:
* Scheduling of ICMP
* Sysctl to ignore tunneled packets; and hence some packet-looping scenarios
====================
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Bluetooth: Fix reporting incorrect EIR in device found mgmt event
Some remote devices (ie Gigaset G-Tag) misbehave with ADV data length.
This can lead to incorrect EIR format in device found event when
ADV_DATA and SCAN_RSP are merged (terminator field before SCAN_RSP
part).
Fix this by inspecting ADV_DATA and correct its length if terminator
is found.
> HCI Event: LE Meta Event (0x3e) plen 42 [hci0] 32.172182
LE Advertising Report (0x02)
Num reports: 1
Event type: Connectable undirected - ADV_IND (0x00)
Address type: Public (0x00)
Address: 7C:2F:80:94:97:5A (Gigaset Communications GmbH)
Data length: 30
Flags: 0x06
LE General Discoverable Mode
BR/EDR Not Supported
Company: Gigaset Communications GmbH (384)
Data: 021512348094975abbc5
16-bit Service UUIDs (partial): 1 entry
Battery Service (0x180f)
RSSI: -65 dBm (0xbf)
> HCI Event: LE Meta Event (0x3e) plen 27 [hci0] 32.172191
LE Advertising Report (0x02)
Num reports: 1
Event type: Scan response - SCAN_RSP (0x04)
Address type: Public (0x00)
Address: 7C:2F:80:94:97:5A (Gigaset Communications GmbH)
Data length: 15
Name (complete): Gigaset G-tag
RSSI: -59 dBm (0xc5)
Note "Data length: 30" in ADV_DATA which results in 9 extra zero bytes
after Battery Service UUID. Terminator field present in the middle of
EIR in Device Found event resulted in userspace stop parsing EIR and
skipping device name.
net: bcmgenet: Implement RX coalescing control knobs
Add support for the ethtool rx-frames coalescing parameter which allows
defining the number of RX interrupts per frames received. The RDMA
engine supports a configurable timeout with a resolution of
approximately 8.192 us.
We can no longer enable the BDONE/PDONE interrupts as those would
fire for each packet/buffer received, which would defeat the MBDONE
interrupt purpose. The MBDONE interrupt is guaranteed to correspond to a
PDONE/BDONE interrupt when the threshold is set to 1.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net: bcmgenet: Implement TX coalescing control knobs
Configuring the ethtool tx-frames property, which translates into N
packets before a TX interrupt is the simplest configuration scheme
because it requires no locking neither at the softare nor hardware
level, and is completely indepedent from the link speed. Since ethtool
does not allow per-tx queue coalescing parameters, we apply the same
setting to any transmit queue.
We can no longer enable the BDONE/PDONE interrupts as those would fire
for each packet/buffer received, which would defeat the MBDONE interrupt
purpose. The MBDONE interrupt is guaranteed to correspond to a
PDONE/BDONE interrupt when the threshold is set to 1, but offers
interrupt coalescing when the value is > 1.
Since the HW is configured to generate an interrupt when the ring
becomes emtpy, we have to deny any timeout/timer settings coming from
user-space to indicate we can only generate an interrupt very <N>
packets.
While we are at it, fix the DMA_INTR_THRESHOLD_MASK value which was off
by one bit (0xff vs. 0x1ff).
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Thanks to Nikolay for noticing the uninitialized use amongst the maze of
gotos.
As Nikolay pointed out the second initialization is not required to fix
the oops, but rather to fix a related problem where a valid lookup should
be invalidated before creating the rth entry.
Fixes: b7503e0cdb5d ("net: Add FIB table id to rtable") Reported-by: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Reported-by: Richard Alpe <richard.alpe@ericsson.com> Reported-by: Fabio Estevam <festevam@gmail.com> Tested-by: Fabio Estevam <fabio.estevam@freescale.com> Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Tested-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
At plumbers we discussed different options on how to get rid of skb_clone
from bpf_clone_redirect(), the patch 2 implements the best option.
Patch 1 adds 'integrated exts' to cls_bpf to improve performance by
combining simple actions into bpf classifier.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Existing bpf_clone_redirect() helper clones skb before redirecting
it to RX or TX of destination netdev.
Introduce bpf_redirect() helper that does that without cloning.
Benchmarked with two hosts using 10G ixgbe NICs.
One host is doing line rate pktgen.
Another host is configured as:
$ tc qdisc add dev $dev ingress
$ tc filter add dev $dev root pref 10 u32 match u32 0 0 flowid 1:2 \
action bpf run object-file tcbpf1_kern.o section clone_redirect_xmit drop
so it receives the packet on $dev and immediately xmits it on $dev + 1
The section 'clone_redirect_xmit' in tcbpf1_kern.o file has the program
that does bpf_clone_redirect() and performance is 2.0 Mpps
$ tc filter add dev $dev root pref 10 u32 match u32 0 0 flowid 1:2 \
action bpf run object-file tcbpf1_kern.o section redirect_xmit drop
which is using bpf_redirect() - 2.4 Mpps
and using cls_bpf with integrated actions as:
$ tc filter add dev $dev root pref 10 \
bpf run object-file tcbpf1_kern.o section redirect_xmit integ_act classid 1
performance is 2.5 Mpps
To summarize:
u32+act_bpf using clone_redirect - 2.0 Mpps
u32+act_bpf using redirect - 2.4 Mpps
cls_bpf using redirect - 2.5 Mpps
For comparison linux bridge in this setup is doing 2.1 Mpps
and ixgbe rx + drop in ip_rcv - 7.8 Mpps
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.r.fastabend@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann [Wed, 16 Sep 2015 06:05:42 +0000 (23:05 -0700)]
cls_bpf: introduce integrated actions
Often cls_bpf classifier is used with single action drop attached.
Optimize this use case and let cls_bpf return both classid and action.
For backwards compatibility reasons enable this feature under
TCA_BPF_FLAG_ACT_DIRECT flag.
Then more interesting programs like the following are easier to write:
int cls_bpf_prog(struct __sk_buff *skb)
{
/* classify arp, ip, ipv6 into different traffic classes
* and drop all other packets
*/
switch (skb->protocol) {
case htons(ETH_P_ARP):
skb->tc_classid = 1;
break;
case htons(ETH_P_IP):
skb->tc_classid = 2;
break;
case htons(ETH_P_IPV6):
skb->tc_classid = 3;
break;
default:
return TC_ACT_SHOT;
}
return TC_ACT_OK;
}
Joint work with Daniel Borkmann.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Tom Herbert <tom@herbertland.com> Cc: Mahesh Bandewar <maheshb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Tue, 15 Sep 2015 22:24:20 +0000 (15:24 -0700)]
tcp: provide skb->hash to synack packets
In commit b73c3d0e4f0e ("net: Save TX flow hash in sock and set in skbuf
on xmit"), Tom provided a l4 hash to most outgoing TCP packets.
We'd like to provide one as well for SYNACK packets, so that all packets
of a given flow share same txhash, to later enable bonding driver to
also use skb->hash to perform slave selection.
Note that a SYNACK retransmit shuffles the tx hash, as Tom did
in commit 265f94ff54d62 ("net: Recompute sk_txhash on negative routing
advice") for established sockets.
This has nice effect making TCP flows resilient to some kind of black
holes, even at connection establish phase.
Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Tom Herbert <tom@herbertland.com> Cc: Mahesh Bandewar <maheshb@google.com> Acked-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Shannon Nelson [Fri, 28 Aug 2015 21:55:51 +0000 (17:55 -0400)]
i40e/i40evf: add get AQ result command to nvmupdate utility
Add a facility to recover the result of a previously run AQ command.
Change-ID: I21afec2c20c1a5e6ba60c7fbfcbedfff78c10e45 Signed-off-by: Shannon Nelson <shannon.nelson@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Shannon Nelson [Fri, 28 Aug 2015 21:55:50 +0000 (17:55 -0400)]
i40e/i40evf: add exec_aq command to nvmupdate utility
Add a facility to run AQ commands through the nvmupdate utility in order
to allow the update tools to interact with the FW and do special
commands needed for updates and configuration changes.
Change-ID: I5c41523e4055b37f8e4ee479f7a0574368f4a588 Signed-off-by: Shannon Nelson <shannon.nelson@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Shannon Nelson [Fri, 28 Aug 2015 21:55:49 +0000 (17:55 -0400)]
i40e/i40evf: add wait states to NVM state machine
This adds wait states to the NVM update state machine to signify when
waiting for an update operation to finish, whether we're in the middle
of a set of Write operations, or we're now idle but waiting.
Change-ID: Iabe91d6579ef6a2ea560647e374035656211ab43 Signed-off-by: Shannon Nelson <shannon.nelson@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Shannon Nelson [Fri, 28 Aug 2015 21:55:48 +0000 (17:55 -0400)]
i40e/i40evf: add GetStatus command for nvmupdate
This adds a new GetStatus command so that the NVM update tool can query
the current status instead of doing fake write requests to probe for
readiness.
Change-ID: I671ec6ccd4dfc9dbac3a03b964589d693fda5cd8 Signed-off-by: Shannon Nelson <shannon.nelson@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Shannon Nelson [Fri, 28 Aug 2015 21:55:47 +0000 (17:55 -0400)]
i40e/i40evf: add handling of writeback descriptor
If the writeback descriptor buffer was previously created, this gives it
to the AQ command request to be used to save the results.
Change-ID: I8c8a1af81e6ebed6d0a15ed31697fe1a6c4e3708 Signed-off-by: Shannon Nelson <shannon.nelson@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Shannon Nelson [Thu, 27 Aug 2015 15:42:42 +0000 (11:42 -0400)]
i40e/i40evf: save aq writeback for future inspection
Add the ability to save the AdminQ write back descriptor if a
caller supplies a buffer for it to be saved into.
Change-ID: I3d1301d26360b39a2d66dc8569e851f54133a3af Signed-off-by: Shannon Nelson <shannon.nelson@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Shannon Nelson [Thu, 23 Jul 2015 20:54:33 +0000 (16:54 -0400)]
i40e: rename variable to prevent clash of understanding
This code returns something that becomes the errno value from ethtool and
passes around a pointer to an errno variable. This patch changes the name
slightly to differentiate it from the actual user errno variable.
Change-ID: Idaa37845c069e66f4cea072e90f471bb2142454d Signed-off-by: Shannon Nelson <shannon.nelson@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
David S. Miller [Fri, 18 Sep 2015 00:18:38 +0000 (17:18 -0700)]
Merge branch 'nf_hook_netns'
Eric W. Biederman says:
====================
Passing net through the netfilter hooks
My primary goal with this patchset and it's follow ups is to cleanup the
network routing paths so that we do not look at the output device to
derive the network namespace. My plan is to pass the network namespace
of the transmitting socket through the output path, to replace code that
looks at the output network device today. Once that is done we can have
routes with output devices outside of the current network namespace.
Which should allow reception and transmission of packets in network
namespaces to be as fast as normal packet reception and transmission
with early demux disabled, because it will same code path.
Once skb_dst(skb)->dev is a little better under control I think it will
also be possible to use rcu to cleanup the ancient hack that sets
dst->dev to loopback_dev when a network device is removed.
The work to get there is a series of code cleanups. I am starting with
passing net into the netfilter hooks and into the functions that are
called after the netfilter hooks. This removes from netfilter the
need to guess which network namespace it is working on.
To get there I perform a series of minor prep patches so the big changes
at the end are possible to audit without getting lost in the noise. In
particular I have a lot of patches computing net into a local variable
and then using it through out the function.
So this patchset encompases removing dead code, sorting out the _sk
functions that were added last time someone pushed a prototype change
through the post netfilter functions. Cleaning up individual functions
use of the network namespace. Passing net into the netfilter hooks.
Passing net into the post netfilter functions. Using state->net in
the netfilter code where it is available and trivially usable.
Pablo, Dave I don't know whose tree this makes more sense to go
through. I am assuming at least initially Pablos as netfilter is
involved. From what I have seen there will be a lot of back and forth
between the netfilter code paths and the routing code paths.
The patches are also available (against 4.3-rc1) at:
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/net-next.git master
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
netfilter: Add blank lines in callers of netfilter hooks
In code review it was noticed that I had failed to add some blank lines
in places where they are customarily used. Taking a second look at the
code I have to agree blank lines would be nice so I have added them
here.
Reported-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
This is immediately motivated by the bridge code that chains functions that
call into netfilter. Without passing net into the okfns the bridge code would
need to guess about the best expression for the network namespace to process
packets in.
As net is frequently one of the first things computed in continuation functions
after netfilter has done it's job passing in the desired network namespace is in
many cases a code simplification.
To support this change the function dst_output_okfn is introduced to
simplify passing dst_output as an okfn. For the moment dst_output_okfn
just silently drops the struct net.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Instead of saying "net = dev_net(state->in?state->in:state->out)"
just say "state->net". As that information is now availabe,
much less confusing and much less error prone.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
netfilter: Pass struct net into the netfilter hooks
Pass a network namespace parameter into the netfilter hooks. At the
call site of the netfilter hooks the path a packet is taking through
the network stack is well known which allows the network namespace to
be easily and reliabily.
This allows the replacement of magic code like
"dev_net(state->in?:state->out)" that appears at the start of most
netfilter hooks with "state->net".
In almost all cases the network namespace passed in is derived
from the first network device passed in, guaranteeing those
paths will not see any changes in practice.
The exceptions are:
xfrm/xfrm_output.c:xfrm_output_resume() xs_net(skb_dst(skb)->xfrm)
ipvs/ip_vs_xmit.c:ip_vs_nat_send_or_cont() ip_vs_conn_net(cp)
ipvs/ip_vs_xmit.c:ip_vs_send_or_cont() ip_vs_conn_net(cp)
ipv4/raw.c:raw_send_hdrinc() sock_net(sk)
ipv6/ip6_output.c:ip6_xmit() sock_net(sk)
ipv6/ndisc.c:ndisc_send_skb() dev_net(skb->dev) not dev_net(dst->dev)
ipv6/raw.c:raw6_send_hdrinc() sock_net(sk)
br_netfilter_hooks.c:br_nf_pre_routing_finish() dev_net(skb->dev) before skb->dev is set to nf_bridge->physindev
In all cases these exceptions seem to be a better expression for the
network namespace the packet is being processed in then the historic
"dev_net(in?in:out)". I am documenting them in case something odd
pops up and someone starts trying to track down what happened.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
When struct net starts being passed through the ipv4 and ipv6 fragment
routines br_nf_push_frag_xmit will need to take a net parameter.
Prepare br_nf_push_frag_xmit before that is needed and introduce
br_nf_push_frag_xmit_sk for the call sites that still need the old
calling conventions.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
This is a prep work for passing struct net through ip_do_fragment and
later the netfilter okfn. Doing this independently makes the later
code changes clearer.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
A function with weird arguments that it will never use to accomdate a
netfilter callback prototype is absolutely in the core of the
networking stack. Frankly it does not make sense and it causes a lot
of confusion as to why arguments that are never used are being passed
to the function.
As I am preparing to make a second change to arguments to the okfn even
the names stops making sense.
As I have removed the two callers of this function remove this confusion
from the networking stack.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
The function dev_queue_xmit_skb_sk is unncessary and very confusing.
Introduce br_send_bpdu_finish to remove the need for dev_queue_xmit_skb_sk,
and have br_send_bpdu_finish call dev_queue_xmit.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
The function dev_queue_xmit_skb_sk is unncessary and very confusing.
Introduce arp_xmit_finish to remove the need for dev_queue_xmit_skb_sk,
and have arp_xmit_finish call dev_queue_xmit.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>