Feras Daoud [Tue, 31 Oct 2017 12:57:27 +0000 (14:57 +0200)]
net/mlx5e: IPoIB, Use correct timestamp in child receive flow
The current implementation takes the child timestamp object from
the parent since the rq in mlx5i_complete_rx_cqe belongs to the parent.
This change fixes the issue by taking the correct timestamp.
Fixes: 7e7f4780c340 ("net/mlx5e: IPoIB, Use hash-table to map between QPN to child netdev") Signed-off-by: Feras Daoud <ferasda@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Or Gerlitz [Wed, 22 Nov 2017 19:09:05 +0000 (21:09 +0200)]
net/mlx5e: Support offloading TC NIC hairpin flows
We refer to TC NIC rule that involves forwarding as "hairpin".
All hairpin rules from the current NIC device (called "func" in
the code) to a given NIC device ("peer") are steered into the
same hairpin RQ/SQ pair.
The hairpin pair is set on demand and removed when there are no
TC rules that need it.
Here's a TC rule that matches on icmp, does header re-write of the
dst mac and hairpin from RX/enp1s2f1 to TX/enp1s2f2 (enp1s2f1/2 are
two mlx5 devices):
tc filter add dev enp1s2f1 protocol ip parent ffff: prio 2
flower skip_sw ip_proto icmp
action pedit ex munge eth dst set 10:22:33:44:55:66 pipe
action mirred egress redirect dev enp1s2f2
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Or Gerlitz [Sun, 12 Nov 2017 13:52:25 +0000 (15:52 +0200)]
net/mlx5e: Basic setup of hairpin object
Add the code to do basic setup for hairpin object which
will later serve offloading TC flows.
This includes calling the mlx5 core to create/destroy the hairpin
pair object and setting the HW transport objects that will be used
for steering matched flows to go through hairpin.
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Unlike conventional RQs/SQs, the memory used for the packet and descriptor
buffers is allocated by the firmware and not the driver. The driver sets
the overall data size (log).
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Yang Shi [Mon, 8 Jan 2018 19:52:54 +0000 (03:52 +0800)]
net: tipc: remove unused hardirq.h
Preempt counter APIs have been split out, currently, hardirq.h just
includes irq_enter/exit APIs which are not used by TIPC at all.
So, remove the unused hardirq.h.
Signed-off-by: Yang Shi <yang.s@alibaba-inc.com> Acked-by: Ying Xue <ying.xue@windriver.com> Tested-by: Ying Xue <ying.xue@windriver.com> Cc: Jon Maloy <jon.maloy@ericsson.com> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: David S. Miller <davem@davemloft.net>
Yang Shi [Mon, 8 Jan 2018 19:52:53 +0000 (03:52 +0800)]
net: ovs: remove unused hardirq.h
Preempt counter APIs have been split out, currently, hardirq.h just
includes irq_enter/exit APIs which are not used by openvswitch at all.
So, remove the unused hardirq.h.
Signed-off-by: Yang Shi <yang.s@alibaba-inc.com> Acked-by: Pravin B Shelar <pshelar@ovn.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: dev@openvswitch.org Signed-off-by: David S. Miller <davem@davemloft.net>
Yang Shi [Mon, 8 Jan 2018 19:52:52 +0000 (03:52 +0800)]
net: caif: remove unused hardirq.h
Preempt counter APIs have been split out, currently, hardirq.h just
includes irq_enter/exit APIs which are not used by caif at all.
So, remove the unused hardirq.h.
Signed-off-by: Yang Shi <yang.s@alibaba-inc.com> Cc: Dmitry Tarnyagin <dmitry.tarnyagin@lockless.no> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: David S. Miller <davem@davemloft.net>
Gal Pressman [Sun, 7 Jan 2018 10:08:40 +0000 (12:08 +0200)]
8139cp: Replace WARN_ONCE with netdev_WARN_ONCE
Use the more appropriate netdev_WARN_ONCE instead of WARN_ONCE macro.
Signed-off-by: Gal Pressman <galp@mellanox.com> Reviewed-by: Saeed Mahameed <saeedm@mellanox.com> Cc: Realtek linux nic maintainers <nic_swsd@realtek.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Gal Pressman [Sun, 7 Jan 2018 10:08:39 +0000 (12:08 +0200)]
bnx2x: Replace WARN_ONCE with netdev_WARN_ONCE
Use the more appropriate netdev_WARN_ONCE instead of WARN_ONCE macro.
Signed-off-by: Gal Pressman <galp@mellanox.com> Reviewed-by: Saeed Mahameed <saeedm@mellanox.com> Cc: Ariel Elior <ariel.elior@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Gal Pressman [Sun, 7 Jan 2018 10:08:38 +0000 (12:08 +0200)]
e1000: Replace WARN_ONCE with netdev_WARN_ONCE
Use the more appropriate netdev_WARN_ONCE instead of WARN_ONCE macro.
Signed-off-by: Gal Pressman <galp@mellanox.com> Reviewed-by: Saeed Mahameed <saeedm@mellanox.com> Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Gal Pressman [Sun, 7 Jan 2018 10:08:35 +0000 (12:08 +0200)]
net: Fix netdev_WARN_ONCE macro
netdev_WARN_ONCE is broken (whoops..), this fix will remove the
unnecessary "condition" parameter, add the missing comma and change
"arg" to "args".
Fixes: 375ef2b1f0d0 ("net: Introduce netdev_*_once functions") Signed-off-by: Gal Pressman <galp@mellanox.com> Reviewed-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
====================
Netfilter/IPVS updates for net-next
The following patchset contains Netfilter/IPVS updates for your
net-next tree:
1) Free hooks via call_rcu to speed up netns release path, from
Florian Westphal.
2) Reduce memory footprint of hook arrays, skip allocation if family is
not present - useful in case decnet support is not compiled built-in.
Patches from Florian Westphal.
3) Remove defensive check for malformed IPv4 - including ihl field - and
IPv6 headers in x_tables and nf_tables.
4) Add generic flow table offload infrastructure for nf_tables, this
includes the netlink control plane and support for IPv4, IPv6 and
mixed IPv4/IPv6 dataplanes. This comes with NAT support too. This
patchset adds the IPS_OFFLOAD conntrack status bit to indicate that
this flow has been offloaded.
5) Add secpath matching support for nf_tables, from Florian.
6) Save some code bytes in the fast path for the nf_tables netdev,
bridge and inet families.
7) Allow one single NAT hook per point and do not allow to register NAT
hooks in nf_tables before the conntrack hook, patches from Florian.
8) Seven patches to remove the struct nf_af_info abstraction, instead
we perform direct calls for IPv4 which is faster. IPv6 indirections
are still needed to avoid dependencies with the 'ipv6' module, but
these now reside in struct nf_ipv6_ops.
9) Seven patches to handle NFPROTO_INET from the Netfilter core,
hence we can remove specific code in nf_tables to handle this
pseudofamily.
10) No need for synchronize_net() call for nf_queue after conversion
to hook arrays. Also from Florian.
11) Call cond_resched_rcu() when dumping large sets in ipset to avoid
softlockup. Again from Florian.
12) Pass lockdep_nfnl_is_held() to rcu_dereference_protected(), patch
from Florian Westphal.
13) Fix matching of counters in ipset, from Jozsef Kadlecsik.
14) Missing nfnl lock protection in the ip_set_net_exit path, also
from Jozsef.
15) Move connlimit code that we can reuse from nf_tables into
nf_conncount, from Florian Westhal.
And asorted cleanups:
16) Get rid of nft_dereference(), it only has one single caller.
17) Add nft_set_is_anonymous() helper function.
18) Remove NF_ARP_FORWARD leftover chain definition in nf_tables_arp.
19) Remove unnecessary comments in nf_conntrack_h323_asn1.c
From Varsha Rao.
20) Remove useless parameters in frag_safe_skb_hp(), from Gao Feng.
21) Constify layer 4 conntrack protocol definitions, function
parameters to register/unregister these protocol trackers, and
timeouts. Patches from Florian Westphal.
22) Remove nlattr_size indirection, from Florian Westphal.
23) Add fall-through comments as -Wimplicit-fallthrough needs this,
from Gustavo A. R. Silva.
24) Use swap() macro to exchange values in ipset, patch from
Gustavo A. R. Silva.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Yafang Shao [Sun, 7 Jan 2018 06:31:47 +0000 (14:31 +0800)]
net: tracepoint: exposing sk_faimily in tracepoint inet_sock_set_state
As of now, there're two sk_family are traced with sock:inet_sock_set_state,
which are AF_INET and AF_INET6.
So the sk_family are exposed as well.
Then we can conveniently use it to do the filter.
Both sk_family and sk_protocol are showed in the printk message, so we need
not expose them as tracepoint arguments.
Suggested-by: Brendan Gregg <brendan.d.gregg@gmail.com> Suggested-by: Song Liu <songliubraving@fb.com> Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Reviewed-by: Song Liu <songliubraving@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Guillaume Nault [Fri, 5 Jan 2018 18:47:14 +0000 (19:47 +0100)]
l2tp: adjust comments about L2TPv3 offsets
The "offset" option has been removed by
commit 900631ee6a26 ("l2tp: remove configurable payload offset").
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr> Acked-by: James Chapman <jchapman@katalix.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Russell King [Fri, 5 Jan 2018 16:07:10 +0000 (16:07 +0000)]
net: phy: fix wrong masks to phy_modify()
The mask argument for phy_modify() in several locations was inverted.
Fixes: fea23fb591cc ("net: phy: convert read-modify-write to phy_modify()") Reported-by: Heiner Kallweit <hkallweit1@gmail.com> Tested-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
Zhu Yanjun [Fri, 5 Jan 2018 04:06:39 +0000 (23:06 -0500)]
forcedeth: remove duplicate structure member in rx
Since both first_rx and rx_ring are the head of rx ring, it not
necessary to use two structure members to statically indicate
the head of rx ring. So first_rx is removed.
CC: Srinivas Eeda <srinivas.eeda@oracle.com> CC: Joe Jin <joe.jin@oracle.com> CC: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Stefano Brivio [Thu, 4 Jan 2018 23:38:05 +0000 (00:38 +0100)]
tcp: Split BUG_ON() in tcp_tso_should_defer() into two assertions
The two conditions triggering BUG_ON() are somewhat unrelated:
the tcp_skb_pcount() check is meant to catch TSO flaws, the
second one checks sanity of congestion window bookkeeping.
Split them into two separate BUG_ON() assertions on two lines,
so that we know which one actually triggers, when they do.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David Ahern [Thu, 4 Jan 2018 22:03:54 +0000 (14:03 -0800)]
net: ipv6: Allow connect to linklocal address from socket bound to vrf
Allow a process bound to a VRF to connect to a linklocal address.
Currently, this fails because of a mismatch between the scope of the
linklocal address and the sk_bound_dev_if inherited by the VRF binding:
$ ssh -6 fe80::70b8:cff:fedd:ead8%eth1
ssh: connect to host fe80::70b8:cff:fedd:ead8%eth1 port 22: Invalid argument
Relax the scope check to allow the socket to be bound to the same L3
device as the scope id.
This makes ipv6 linklocal consistent with other relaxed checks enabled
by commits 1ff23beebdd3 ("net: l3mdev: Allow send on enslaved interface")
and 7bb387c5ab12a ("net: Allow IP_MULTICAST_IF to set index to L3 slave").
Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Sergei Shtylyov [Thu, 4 Jan 2018 21:26:46 +0000 (00:26 +0300)]
sh_eth: remove sh_eth_plat_data::edmac_endian
Since the commit 888cc8c20cf ("sh_eth: remove EDMAC_BIG_ENDIAN") (geez,
I didn't realize that was 2 years ago!) the initializers in the SuperH
platform code for the 'sh_eth_plat_data::edmac_endian' stopped to matter,
so we can remove that field for good (not sure if it was ever useful --
SH7786 Ether has been reported to have the same EDMAC descriptor/register
endiannes as configured for the SuperH CPU)...
Signed-off-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 8 Jan 2018 19:06:20 +0000 (14:06 -0500)]
Merge branch 'hns3-next'
Peng Li says:
====================
add some new features and fix some bugs for HNS3 driver
This patchset adds some new features support and fixes some bugs:
[Patch 1/20] adds support to enable/disable vlan filter with ethtool
[Patch 2/20] disables VFs change rxvlan offload status
[Patch 3/20 - 13/120 fix bugs and refine some codes for packet
statistics, support query with both ifconfig and ethtool.
[Patch 14/20 - 20/20] fix some other bugs.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Fuyun Liang [Fri, 5 Jan 2018 10:18:22 +0000 (18:18 +0800)]
net: hns3: fix for not setting pause parameters
Pause parameters include source address, transmit gap and pause time.
The default value of the pause source address is zero in the hardware.
Default pause parameters need to be set to the hardware. Also, when
setting new mac address, the pause source address need to be updated.
Fixes: 9dc2145d910e ("net: hns3: Add support for PFC setting in TM module") Signed-off-by: Fuyun Liang <liangfuyun1@huawei.com> Signed-off-by: Peng Li <lipeng321@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Fuyun Liang [Fri, 5 Jan 2018 10:18:21 +0000 (18:18 +0800)]
net: hns3: add MTU initialization for hardware
When initializing the MAC, the MTU vlaue need to be set to the hardware
too. Otherwise, the MTU value of software will be different from the MTU
value of hardware.
Fixes: 46a3df9f9718 ("net: hns3: Add HNS3 Acceleration Engine & Compatibility Layer Support") Signed-off-by: Fuyun Liang <liangfuyun1@huawei.com> Signed-off-by: Peng Li <lipeng321@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Fuyun Liang [Fri, 5 Jan 2018 10:18:20 +0000 (18:18 +0800)]
net: hns3: fix for changing MTU
when changing MTU, The new MTU must need to be set to netdevice.
Fixes: a8e8b7ff3517 ("net: hns3: Add support to change MTU in HNS3 hardware") Signed-off-by: Fuyun Liang <liangfuyun1@huawei.com> Signed-off-by: Peng Li <lipeng321@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Fuyun Liang [Fri, 5 Jan 2018 10:18:19 +0000 (18:18 +0800)]
net: hns3: fix for setting MTU
When setting MTU, actually what we do is configuring the max frame size
for the hardware. ETH_HLEN、ETH_FCS_LEN and VLAN_HLEN must need to be
considered. And the frame size which is less than the default value
should not be set to the hardware. Because in the hardware, the the max
frame size not only controls the RX packet size, but also controls the
TX packet size. the RX packets whose size are greater than the setting
value will be dropped.
This patch fixes the bug setting a error max frame size to hardware.
Fixes: 46a3df9f9718 ("net: hns3: Add HNS3 Acceleration Engine & Compatibility Layer Support") Signed-off-by: Fuyun Liang <liangfuyun1@huawei.com> Signed-off-by: Peng Li <lipeng321@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Fuyun Liang [Fri, 5 Jan 2018 10:18:18 +0000 (18:18 +0800)]
net: hns3: fix for updating fc_mode_last_time
commit a9c782822166 ("net: hns3: add support for set_pauseparam")
adds set_pauseparam support for ethtool cmd, but forgets to update
fc_mode_last_time when PFC mode is disabled in hclge_cfg_pauseparam().
The wrong fc_mode_last_time will be used to update flow control mode
when lldpad has been running. As a result, when using the ethtool
command "-a", user will get a wrong pause parameter.
This patch adds the fc_mode_last_time update when PFC mode is disabled.
Fixes: a9c782822166 ("net: hns3: add support for set_pauseparam") Signed-off-by: Fuyun Liang <liangfuyun1@huawei.com> Signed-off-by: Peng Li <lipeng321@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jian Shen [Fri, 5 Jan 2018 10:18:14 +0000 (18:18 +0800)]
net: hns3: Fix an error macro definition of HNS3_TQP_STAT
The member "stats_offset" was designed to indicate the offset
of each member of struct ring_stats in struct hns3_enet_ring,
but forgot to add the offset of the member in struct ring_stats.
Fixes: 496d03e960a ("net: hns3: Add Ethtool support to HNS3 driver") Signed-off-by: Jian Shen <shenjian15@huawei.com> Signed-off-by: Peng Li <lipeng321@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jian Shen [Fri, 5 Jan 2018 10:18:13 +0000 (18:18 +0800)]
net: hns3: Fix a loop index error of tqp statistics query
An error loop index was used while querying statistics data
of tqps, which may cause call trace.
Fixes: 496d03e960ae ("net: hns3: Add Ethtool support to HNS3 driver") Signed-off-by: Jian Shen <shenjian15@huawei.com> Signed-off-by: Peng Li <lipeng321@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jian Shen [Fri, 5 Jan 2018 10:18:12 +0000 (18:18 +0800)]
net: hns3: Fix an error of total drop packet statistics
The dropped tx/rx packets number of each tqp should also
be counted into the total drop tx/rx packets numbers.
Fixes: 76ad4f0ee74 ("net: hns3: Add support of HNS3 Ethernet Driver for hip08 SoC") Signed-off-by: Jian Shen <shenjian15@huawei.com> Signed-off-by: Peng Li <lipeng321@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jian Shen [Fri, 5 Jan 2018 10:18:11 +0000 (18:18 +0800)]
net: hns3: Mask the packet statistics query when NIC is down
Update the HNS3_NIC_STATE_DOWN bit when NIC state changes.
When NIC is down, mask the packet statistics for querying
with ifconfig command. It's a common practice.
Signed-off-by: Jian Shen <shenjian15@huawei.com> Signed-off-by: Peng Li <lipeng321@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jian Shen [Fri, 5 Jan 2018 10:18:10 +0000 (18:18 +0800)]
net: hns3: Modify the update period of packet statistics
It takes more than 200 query response messages between
driver and IMP, while updating the packet statistics.
It's too heavy for IMP to update it per second.
Extend the update period of packet statistics data from
1 second to 300 seconds(if too long, the statistics may
overflow).
As a result, we need to update it while querying with
ifconfig tool to keep the statistics data fresh.
Signed-off-by: Jian Shen <shenjian15@huawei.com> Signed-off-by: Peng Li <lipeng321@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jian Shen [Fri, 5 Jan 2018 10:18:07 +0000 (18:18 +0800)]
net: hns3: Unify the strings display of packet statistics
Some members of packet statistics are named in different styles.
This patch unifies them with new internal name rules, the main
modification are below:
trans --> tx
rcv --> rx
rcb_q%d_tx --> txq#%d
rcb_q%d_rx --> rxq#%d
sw_err_cnt(tx side) --> tx_dropped
sw_err_cnt(rx side) --> rx_dropped
pkts --> packets
tx_err_cnt --> errors
rx_err_cnt --> errors
Signed-off-by: Jian Shen <shenjian15@huawei.com> Signed-off-by: Peng Li <lipeng321@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jian Shen [Fri, 5 Jan 2018 10:18:06 +0000 (18:18 +0800)]
net: hns3: Disable VFs change rxvlan offload status
Rxvlan offload status can only be changed by PF. Initialize
the value of NETIF_F_HW_VLAN_CTAG_RX bit of hw_features for
VFS to false, make sure user can't be able to change it.
Signed-off-by: Jian Shen <shenjian15@huawei.com> Signed-off-by: Peng Li <lipeng321@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
This series introduces the MAPv4 packet format for checksum
offload plus some other minor changes.
Patches 1-3 are cleanups.
Patch 4 renames the ingress format to data format so that all data
formats can be configured using this going forward.
Patch 5 uses the pacing helper to improve TCP transmit performance.
Patch 6-9 defines the the MAPv4 for checksum offload for RX and TX.
A new header and trailer format are used as part of MAPv4.
For RX checksum offload, only the 1's complement of the IP payload
portion is computed by hardware. The meta data from RX header is
used to verify the checksum field in the packet. Note that the
IP packet and its field itself is not modified by hardware.
This gives metadata to help with the RX checksum. For TX, the
required metadata is filled up so hardware can compute the
checksum.
Patch 10 enables GSO on rmnet devices
v1->v2: Fix sparse errors reported by kbuild test robot
v2->v3: Update the commit message for Patch 5 based on Eric's comments
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Real devices may support scatter gather(SG), so enable SG on rmnet
devices to use GSO. GSO reduces CPU cycles by 20% for a rate of
146Mpbs for a single stream TCP connection.
Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org> Signed-off-by: David S. Miller <davem@davemloft.net>
net: qualcomm: rmnet: Add support for TX checksum offload
TX checksum offload applies to TCP / UDP packets which are not
fragmented using the MAPv4 checksum trailer. The following needs to be
done to have checksum computed in hardware -
1. Set the checksum start offset and inset offset.
2. Set the csum_enabled bit
3. Compute and set 1's complement of partial checksum field in
transport header.
Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org> Signed-off-by: David S. Miller <davem@davemloft.net>
net: qualcomm: rmnet: Handle command packets with checksum trailer
When using the MAPv4 packet format in conjunction with MAP commands,
a dummy DL checksum trailer will be appended to the packet. Before
this packet is sent out as an ACK, the DL checksum trailer needs to be
removed.
Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org> Signed-off-by: David S. Miller <davem@davemloft.net>
net: qualcomm: rmnet: Add support for RX checksum offload
When using the MAPv4 packet format, receive checksum offload can be
enabled in hardware. The checksum computation over pseudo header is
not offloaded but the rest of the checksum computation over
the payload is offloaded. This applies only for TCP / UDP packets
which are not fragmented.
rmnet validates the TCP/UDP checksum for the packet using the checksum
from the checksum trailer added to the packet by hardware. The
validation performed is as following -
1. Perform 1's complement over the checksum value from the trailer
2. Compute 1's complement checksum over IPv4 / IPv6 header and
subtracts it from the value from step 1
3. Computes 1's complement checksum over IPv4 / IPv6 pseudo header and
adds it to the value from step 2
4. Subtracts the checksum value from the TCP / UDP header from the
value from step 3.
5. Compares the value from step 4 to the checksum value from the
TCP / UDP header.
6. If the comparison in step 5 succeeds, CHECKSUM_UNNECESSARY is set
and the packet is passed on to network stack. If there is a
failure, then the packet is passed on as such without modifying
the ip_summed field.
The checksum field is also checked for UDP checksum 0 as per RFC 768
and for unexpected TCP checksum of 0.
If checksum offload is disabled when using MAPv4 packet format in
receive path, the packet is queued as is to network stack without
the validations above.
Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org> Signed-off-by: David S. Miller <davem@davemloft.net>
net: qualcomm: rmnet: Define the MAPv4 packet formats
The MAPv4 packet format adds support for RX / TX checksum offload.
For a bi-directional UDP stream at a rate of 570 / 146 Mbps, roughly
10% CPU cycles are saved.
For receive path, there is a checksum trailer appended to the end of
the MAP packet. The valid field indicates if hardware has computed
the checksum. csum_start_offset indicates the offset from the start
of the IP header from which hardware has computed checksum.
csum_length is the number of bytes over which the checksum was
computed and the resulting value is csum_value.
In the transmit path, a header is appended between the end of the MAP
header and the start of the IP packet. csum_start_offset is the offset
in bytes from which hardware will compute the checksum if the
csum_enabled bit is set. udp_ip4_ind indicates if the checksum
value of 0 is valid or not. csum_insert_offset is the offset from the
csum_start_offset where hardware will insert the computed checksum.
The use of this additional packet format for checksum offload is
explained in subsequent patches.
Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org> Signed-off-by: David S. Miller <davem@davemloft.net>
The real device over which the rmnet devices are installed also
aggregate multiple IP packets and sends them as a single large
aggregate frame to the hardware. This causes degraded throughput
for TCP TX due to bufferbloat.
To overcome this problem, pacing shift value of 8 is set using the
sk_pacing_shift_update() helper. This value was determined based
on experiments with a single stream TCP TX using iperf for a
duration of 30s.
Pacing shift | Observed data rate (Mbps)
10 | 9
9 | 140
8 | 146 (Max link rate)
Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org> Signed-off-by: David S. Miller <davem@davemloft.net>
net: qualcomm: rmnet: Remove invalid condition while stamping mux id
rmnet devices cannot have a mux id of 255. This is validated when
assigning the mux id to the rmnet devices. As a result, checking for
mux id 255 does not apply in egress path.
Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org> Signed-off-by: David S. Miller <davem@davemloft.net>
netfilter: ipset: use swap macro instead of _manually_ swapping values
Make use of the swap macro and remove unnecessary variables tmp.
This makes the code easier to read and maintain.
This code was detected with the help of Coccinelle.
Signed-off-by: Gustavo A. R. Silva <garsilva@embeddedor.com> Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Add new instruction for the nf_tables VM that allows us to specify what
flows are offloaded into a given flow table via name. This new
instruction creates the flow entry and adds it to the flow table.
Only established flows, ie. we have seen traffic in both directions, are
added to the flow table. You can still decide to offload entries at a
later stage via packet counting or checking the ct status in case you
want to offload assured conntracks.
This new extension depends on the conntrack subsystem.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch adds the IPv4 flow table type, that implements the datapath
flow table to forward IPv4 traffic. Rationale is:
1) Look up for the packet in the flow table, from the ingress hook.
2) If there's a hit, decrement ttl and pass it on to the neighbour layer
for transmission.
3) If there's a miss, packet is passed up to the classic forwarding
path.
This patch also supports layer 3 source and destination NAT.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch defines the API to interact with flow tables, this allows to
add, delete and lookup for entries in the flow table. This also adds the
generic garbage code that removes entries that have expired, ie. no
traffic has been seen for a while.
Users of the flow table infrastructure can delete entries via
flow_offload_dead(), which sets the dying bit, this signals the garbage
collector to release an entry from user context.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch introduces a netlink control plane to create, delete and dump
flow tables. Flow tables are identified by name, this name is used from
rules to refer to an specific flow table. Flow tables use the rhashtable
class and a generic garbage collector to remove expired entries.
This also adds the infrastructure to add different flow table types, so
we can add one for each layer 3 protocol family.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The timer of such conntrack entries look like stopped from userspace.
In practise, to make sure the conntrack entry does not go away, the
conntrack timer is periodically set to an arbitrary large value that
gets refreshed on every iteration from the garbage collector, so it
never expires- and they display no internal state in the case of TCP
flows. This allows us to save a bitcheck from the packet path via
nf_ct_is_expired().
Conntrack entries that have been offloaded to the flow table
infrastructure cannot be deleted/flushed via ctnetlink. The flow table
infrastructure is also responsible for releasing this conntrack entry.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
netfilter: remove defensive check on malformed packets from raw sockets
Users cannot forge malformed IPv4/IPv6 headers via raw sockets that they
can inject into the stack. Specifically, not for IPv4 since 55888dfb6ba7
("AF_RAW: Augment raw_send_hdrinc to expand skb to fit iphdr->ihl
(v2)"). IPv6 raw sockets also ensure that packets have a well-formed
IPv6 header available in the skbuff.
At quick glance, br_netfilter also validates layer 3 headers and it
drops malformed both IPv4 and IPv6 packets.
Therefore, let's remove this defensive check all over the place.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
netfilter: move reroute indirection to struct nf_ipv6_ops
We cannot make a direct call to nf_ip6_reroute() because that would result
in autoloading the 'ipv6' module because of symbol dependencies.
Therefore, define reroute indirection in nf_ipv6_ops where this really
belongs to.
For IPv4, we can indeed make a direct function call, which is faster,
given IPv4 is built-in in the networking code by default. Still,
CONFIG_INET=n and CONFIG_NETFILTER=y is possible, so define empty inline
stub for IPv4 in such case.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
netfilter: move route indirection to struct nf_ipv6_ops
We cannot make a direct call to nf_ip6_route() because that would result
in autoloading the 'ipv6' module because of symbol dependencies.
Therefore, define route indirection in nf_ipv6_ops where this really
belongs to.
For IPv4, we can indeed make a direct function call, which is faster,
given IPv4 is built-in in the networking code by default. Still,
CONFIG_INET=n and CONFIG_NETFILTER=y is possible, so define empty inline
stub for IPv4 in such case.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
netfilter: remove saveroute indirection in struct nf_afinfo
This is only used by nf_queue.c and this function comes with no symbol
dependencies with IPv6, it just refers to structure layouts. Therefore,
we can replace it by a direct function call from where it belongs.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
netfilter: move checksum_partial indirection to struct nf_ipv6_ops
We cannot make a direct call to nf_ip6_checksum_partial() because that
would result in autoloading the 'ipv6' module because of symbol
dependencies. Therefore, define checksum_partial indirection in
nf_ipv6_ops where this really belongs to.
For IPv4, we can indeed make a direct function call, which is faster,
given IPv4 is built-in in the networking code by default. Still,
CONFIG_INET=n and CONFIG_NETFILTER=y is possible, so define empty inline
stub for IPv4 in such case.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
netfilter: move checksum indirection to struct nf_ipv6_ops
We cannot make a direct call to nf_ip6_checksum() because that would
result in autoloading the 'ipv6' module because of symbol dependencies.
Therefore, define checksum indirection in nf_ipv6_ops where this really
belongs to.
For IPv4, we can indeed make a direct function call, which is faster,
given IPv4 is built-in in the networking code by default. Still,
CONFIG_INET=n and CONFIG_NETFILTER=y is possible, so define empty inline
stub for IPv4 in such case.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
netfilter: connlimit: split xt_connlimit into front and backend
This allows to reuse xt_connlimit infrastructure from nf_tables.
The upcoming nf_tables frontend can just pass in an nftables register
as input key, this allows limiting by any nft-supported key, including
concatenations.
For xt_connlimit, pass in the zone and the ip/ipv6 address.
netfilter: nf_tables: remove multihook chains and families
Since NFPROTO_INET is handled from the core, we don't need to maintain
extra infrastructure in nf_tables to handle the double hook
registration, one for IPv4 and another for IPv6.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
netfilter: nf_tables: explicit nft_set_pktinfo() call from hook path
Instead of calling this function from the family specific variant, this
reduces the code size in the fast path for the netdev, bridge and inet
families. After this change, we must call nft_set_pktinfo() upfront from
the chain hook indirection.
Before:
text data bss dec hex filename
2145 208 0 2353 931 net/netfilter/nf_tables_netdev.o
After:
text data bss dec hex filename
2125 208 0 2333 91d net/netfilter/nf_tables_netdev.o
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
netfilter: core: only allow one nat hook per hook point
The netfilter NAT core cannot deal with more than one NAT hook per hook
location (prerouting, input ...), because the NAT hooks install a NAT null
binding in case the iptables nat table (iptable_nat hooks) or the
corresponding nftables chain (nft nat hooks) doesn't specify a nat
transformation.
Null bindings are needed to detect port collsisions between NAT-ed and
non-NAT-ed connections.
This causes nftables NAT rules to not work when iptable_nat module is
loaded, and vice versa because nat binding has already been attached
when the second nat hook is consulted.
The netfilter core is not really the correct location to handle this
(hooks are just hooks, the core has no notion of what kinds of side
effects a hook implements), but its the only place where we can check
for conflicts between both iptables hooks and nftables hooks without
adding dependencies.
So add nat annotation to hook_ops to describe those hooks that will
add NAT bindings and then make core reject if such a hook already exists.
The annotation fills a padding hole, in case further restrictions appar
we might change this to a 'u8 type' instead of bool.
iptables error if nft nat hook active:
iptables -t nat -A POSTROUTING -j MASQUERADE
iptables v1.4.21: can't initialize iptables table `nat': File exists
Perhaps iptables or your kernel needs to be upgraded.
nftables error if iptables nat table present:
nft -f /etc/nftables/ipv4-nat
/usr/etc/nftables/ipv4-nat:3:1-2: Error: Could not process rule: File exists
table nat {
^^
Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
netfilter: xtables: add and use xt_request_find_table_lock
currently we always return -ENOENT to userspace if we can't find
a particular table, or if the table initialization fails.
Followup patch will make nat table init fail in case nftables already
registered a nat hook so this change makes xt_find_table_lock return
an ERR_PTR to return the errno value reported from the table init
function.
Add xt_request_find_table_lock as try_then_request_module replacement
and use it where needed.
Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
netfilter: don't allocate space for arp/bridge hooks unless needed
no need to define hook points if the family isn't supported.
Because we need these hooks for either nftables, arp/ebtables
or the 'call-iptables' hack we have in the bridge layer add two
new dependencies, NETFILTER_FAMILY_{ARP,BRIDGE}, and have the
users select them.
Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
which store the hook entry point locations for the various protocol
families and the hooks.
Using array results in compact c code when doing accesses, i.e.
x = rcu_dereference(net->nf.hooks[pf][hook]);
but its also wasting a lot of memory, as most families are
not used.
So split the array into those families that are used, which
are only 5 (instead of 13). In most cases, the 'pf' argument is
constant, i.e. gcc removes switch statement.
Florian Westphal [Thu, 30 Nov 2017 23:21:04 +0000 (00:21 +0100)]
netfilter: core: free hooks with call_rcu
Giuseppe Scrivano says:
"SELinux, if enabled, registers for each new network namespace 6
netfilter hooks."
Cost for this is high. With synchronize_net() removed:
"The net benefit on an SMP machine with two cores is that creating a
new network namespace takes -40% of the original time."
This patch replaces synchronize_net+kvfree with call_rcu().
We store rcu_head at the tail of a structure that has no fixed layout,
i.e. we cannot use offsetof() to compute the start of the original
allocation. Thus store this information right after the rcu head.
We could simplify this by just placing the rcu_head at the start
of struct nf_hook_entries. However, this structure is used in
packet processing hotpath, so only place what is needed for that
at the beginning of the struct.
Reported-by: Giuseppe Scrivano <gscrivan@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Florian Westphal [Thu, 30 Nov 2017 23:21:03 +0000 (00:21 +0100)]
netfilter: core: remove synchronize_net call if nfqueue is used
since commit 960632ece6949b ("netfilter: convert hook list to an array")
nfqueue no longer stores a pointer to the hook that caused the packet
to be queued. Therefore no extra synchronize_net() call is needed after
dropping the packets enqueued by the old rule blob.
Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Florian Westphal [Thu, 30 Nov 2017 23:21:02 +0000 (00:21 +0100)]
netfilter: core: make nf_unregister_net_hooks simple wrapper again
This reverts commit d3ad2c17b4047
("netfilter: core: batch nf_unregister_net_hooks synchronize_net calls").
Nothing wrong with it. However, followup patch will delay freeing of hooks
with call_rcu, so all synchronize_net() calls become obsolete and there
is no need anymore for this batching.
This revert causes a temporary performance degradation when destroying
network namespace, but its resolved with the upcoming call_rcu conversion.
Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Florian Westphal [Thu, 30 Nov 2017 20:08:05 +0000 (21:08 +0100)]
netfilter: ipset: add resched points during set listing
When sets are extremely large we can get softlockup during ipset -L.
We could fix this by adding cond_resched_rcu() at the right location
during iteration, but this only works if RCU nesting depth is 1.
At this time entire variant->list() is called under under rcu_read_lock_bh.
This used to be a read_lock_bh() but as rcu doesn't really lock anything,
it does not appear to be needed, so remove it (ipset increments set
reference count before this, so a set deletion should not be possible).
Reported-by: Li Shuang <shuali@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>