David S. Miller [Mon, 13 Jan 2014 19:23:02 +0000 (11:23 -0800)]
Merge branch 'ip_forward_pmtu'
Hannes Frederic Sowa says:
====================
path mtu hardening patches
After a lot of back and forth I want to propose these changes regarding
path mtu hardening and give an outline why I think this is the best way
how to proceed:
This set contains the following patches:
* ipv4: introduce ip_dst_mtu_maybe_forward and protect forwarding path against pmtu spoofing
* ipv6: introduce ip6_dst_mtu_forward and protect forwarding path with it
* ipv4: introduce hardened ip_no_pmtu_disc mode
The first one switches the forwarding path of IPv4 to use the interface
mtu by default and ignore a possible discovered path mtu. It provides
a sysctl to switch back to the original behavior (see discussion below).
The second patch does the same thing unconditionally for IPv6. I don't
provide a knob for IPv6 to switch to original behavior (please see
below).
The third patch introduces a hardened pmtu mode, where only pmtu
information are accepted where the protocol is able to do more stringent
checks on the icmp piggyback payload (please see the patch commit msg
for further details).
Why is this change necessary?
First of all, RFC 1191 4. Router specification says:
"When a router is unable to forward a datagram because it exceeds the
MTU of the next-hop network and its Don't Fragment bit is set, the
router is required to return an ICMP Destination Unreachable message
to the source of the datagram, with the Code indicating
"fragmentation needed and DF set". ..."
For some time now fragmentation has been considered problematic, e.g.:
* http://www.hpl.hp.com/techreports/Compaq-DEC/WRL-87-3.pdf
* http://tools.ietf.org/search/rfc4963
Most of them seem to agree that fragmentation should be avoided because
of efficiency, data corruption or security concerns.
Recently it was shown possible that correctly guessing IP ids could lead
to data injection on DNS packets:
<https://sites.google.com/site/hayashulman/files/fragmentation-poisoning.pdf>
While we can try to completly stop fragmentation on the end host
(this is e.g. implemented via IP_PMTUDISC_INTERFACE), we cannot stop
fragmentation completly on the forwarding path. On the end host the
application has to deal with MTUs and has to choose fallback methods
if fragmentation could be an attack vector. This is already the case for
most DNS software, where a maximum UDP packet size can be configured. But
until recently they had no control over local fragmentation and could
thus emit fragmented packets.
On the forwarding path we can just try to delay the fragmentation to
the last hop where this is really necessary. Current kernel already does
that but only because routers don't receive feedback of path mtus, these are
only send back to the end host system. But it is possible to maliciously
insert path mtu inforamtion via ICMP packets which have an icmp echo_reply
payload, because we cannot validate those notifications against local
sockets. DHCP clients which establish an any-bound RAW-socket could also
start processing unwanted fragmentation-needed packets.
Why does IPv4 has a knob to revert to old behavior while IPv6 doesn't?
IPv4 does fragmentation on the path while IPv6 does always respond with
packet-too-big errors. The interface MTU will always be greater than
the path MTU information. So we would discard packets we could actually
forward because of malicious information. After this change we would
let the hop, which really could not forward the packet, notify the host
of this problem.
IPv4 allowes fragmentation mid-path. In case someone does use a software
which tries to discover such paths and assumes that the kernel is handling
the discovered pmtu information automatically. This should be an extremly
rare case, but because I could not exclude the possibility this knob is
provided. Also this software could insert non-locked mtu information
into the kernel. We cannot distinguish that from path mtu information
currently. Premature fragmentation could solve some problems in wrongly
configured networks, thus this switch is provided.
One frag-needed packet could reduce the path mtu down to 522 bytes
(route/min_pmtu).
Misc:
IPv6 neighbor discovery could advertise mtu information for an
interface. These information update the ipv6-specific interface mtu and
thus get used by the forwarding path.
Tunnel and xfrm output path will still honour path mtu and also respond
with Packet-too-Big or fragmentation-needed errors if needed.
Changelog for all patches:
v2)
* enabled ip_forward_use_pmtu by default
* reworded
v3)
* disabled ip_forward_use_pmtu by default
* reworded
v4)
* renamed ip_dst_mtu_secure to ip_dst_mtu_maybe_forward
* updated changelog accordingly
* removed unneeded !!(... & ...) double negations
v2)
* by default we honour pmtu information
3)
* only honor interface mtu
* rewritten and simplified
* no knob to fall back to old mode any more
v2)
* reworded Documentation
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This new ip_no_pmtu_disc mode only allowes fragmentation-needed errors
to be honored by protocols which do more stringent validation on the
ICMP's packet payload. This knob is useful for people who e.g. want to
run an unmodified DNS server in a namespace where they need to use pmtu
for TCP connections (as they are used for zone transfers or fallback
for requests) but don't want to use possibly spoofed UDP pmtu information.
Currently the whitelisted protocols are TCP, SCTP and DCCP as they check
if the returned packet is in the window or if the association is valid.
Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: David Miller <davem@davemloft.net> Cc: John Heffner <johnwheffner@gmail.com> Suggested-by: Florian Weimer <fweimer@redhat.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
ipv6: introduce ip6_dst_mtu_forward and protect forwarding path with it
In the IPv6 forwarding path we are only concerend about the outgoing
interface MTU, but also respect locked MTUs on routes. Tunnel provider
or IPSEC already have to recheck and if needed send PtB notifications
to the sending host in case the data does not fit into the packet with
added headers (we only know the final header sizes there, while also
using path MTU information).
The reason for this change is, that path MTU information can be injected
into the kernel via e.g. icmp_err protocol handler without verification
of local sockets. As such, this could cause the IPv6 forwarding path to
wrongfully emit Packet-too-Big errors and drop IPv6 packets.
Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: David Miller <davem@davemloft.net> Cc: John Heffner <johnwheffner@gmail.com> Cc: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
ipv4: introduce ip_dst_mtu_maybe_forward and protect forwarding path against pmtu spoofing
While forwarding we should not use the protocol path mtu to calculate
the mtu for a forwarded packet but instead use the interface mtu.
We mark forwarded skbs in ip_forward with IPSKB_FORWARDED, which was
introduced for multicast forwarding. But as it does not conflict with
our usage in unicast code path it is perfect for reuse.
I moved the functions ip_sk_accept_pmtu, ip_sk_use_pmtu and ip_skb_dst_mtu
along with the new ip_dst_mtu_maybe_forward to net/ip.h to fix circular
dependencies because of IPSKB_FORWARDED.
Because someone might have written a software which does probe
destinations manually and expects the kernel to honour those path mtus
I introduced a new per-namespace "ip_forward_use_pmtu" knob so someone
can disable this new behaviour. We also still use mtus which are locked on a
route for forwarding.
The reason for this change is, that path mtus information can be injected
into the kernel via e.g. icmp_err protocol handler without verification
of local sockets. As such, this could cause the IPv4 forwarding path to
wrongfully emit fragmentation needed notifications or start to fragment
packets along a path.
Tunnel and ipsec output paths clear IPCB again, thus IPSKB_FORWARDED
won't be set and further fragmentation logic will use the path mtu to
determine the fragmentation size. They also recheck packet size with
help of path mtu discovery and report appropriate errors.
Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: David Miller <davem@davemloft.net> Cc: John Heffner <johnwheffner@gmail.com> Cc: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Joe Perches [Thu, 9 Jan 2014 06:42:25 +0000 (22:42 -0800)]
qlcnic: Convert vmalloc/memset to kcalloc
vmalloc is a limited resource. Don't use it unnecessarily.
It seems this allocation should work with kcalloc.
Remove unnecessary memset(,0,) of buf as it's completely
overwritten as the previously only unset field in
struct qlcnic_pci_func_cfg is now set to 0.
Use kfree instead of vfree.
Use ETH_ALEN instead of 6.
Signed-off-by: Joe Perches <joe@perches.com> Acked-by: Jitendra Kalsaria <jitendra.kalsaria@qlogic.com> Signed-off-by: David S. Miller <davem@davemloft.net>
That code has been around for ages without being used.
CC: Jay Vosburgh <fubar@us.ibm.com> CC: Andy Gospodarek <andy@greyhouse.net> Signed-off-by: Veaceslav Falico <vfalico@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
bonding: convert 3ad to use pr_warn instead of pr_warning
CC: Jay Vosburgh <fubar@us.ibm.com> CC: Andy Gospodarek <andy@greyhouse.net> Signed-off-by: Veaceslav Falico <vfalico@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
It's a huge mess currently, that is really hard to read. This cleanup
doesn't touch the logic at all, it only breaks easy-to-fix long lines and
updates comment styles.
CC: Jay Vosburgh <fubar@us.ibm.com> CC: Andy Gospodarek <andy@greyhouse.net> Signed-off-by: Veaceslav Falico <vfalico@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Sun, 12 Jan 2014 04:53:03 +0000 (20:53 -0800)]
Merge branch 'alx_stats'
Sabrina Dubroca says:
====================
alx: add statistics
Currently, the alx driver doesn't support statistics [1,2]. The
original alx driver [3] that Johannes Berg modified provided
statistics. This patch is an adaptation of the statistics code from
the original driver to the alx driver included in the kernel.
v4:
- modified the assignements of hw stats to netstats (Ben Hutchings)
- added comments to describe the stats fields (copied from atlx)
v3:
- renamed __alx_update_hw_stats to alx_update_hw_stats (Stephen Hemminger)
v2:
- use u64 instead of unsigned long (Ben Hutchings)
- implement ndo_get_stats64 instead of ndo_get_stats (Ben Hutchings)
- use EINVAL instead of ENOTSUPP (Ben Hutchings)
- add BUILD_BUG_ON to check the size of the stats (Johannes Berg, Ben
Hutchings)
- add a comment regarding persistence of the stats (Stephen Hemminger)
- align assignments in __alx_update_hw_stats
Sabrina Dubroca [Thu, 9 Jan 2014 09:09:31 +0000 (10:09 +0100)]
alx: add stats to ethtool
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Reviewed-by: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Sabrina Dubroca [Thu, 9 Jan 2014 09:09:30 +0000 (10:09 +0100)]
alx: add alx_get_stats64 operation
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Reviewed-by: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Sun, 12 Jan 2014 04:51:10 +0000 (20:51 -0800)]
Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-next
Jeff Kirsher says:
====================
Intel Wired LAN Driver Updates
This series contains updates to i40e and now i40evf.
Most notable is Jacob's patch to add PTP support to i40e.
Mitch cleans up additional memcpy's and use struct assignment instead.
Then fixes long lines to appease checkpatch.pl. Mitch then provides
a fix to keep us from spamming the log with confusing errors. If you
use ip to change the MAC address of a VF while the VF driver is loaded,
closing the VF interface or unloading the VF driver will cause the VF
driver to remove the MAC filter for its original (now invalid) MAC
address.
Jesse cleans up macros which are no longer needed or used.
I (Jeff) cleanup function header comments to ensure Doxygen/kdoc works
correctly to generate documentation without warnings.
Anjali fixes a bug where ethtool set-channels would return failure when
configuring only one Rx queue. Then fixes a bug where the driver was
erroneously exiting the driver unload path if one part of the unload
failed.
Shannon fixes if the IPV6EXADD but is set in the Rx descriptor status,
there was an optional extension header with an alternate IP address
detected and the hardware checksum was not handling the alternate IP
address correctly. Then adjusts the ITR max and min values to match
the hardware max value and recommended min value. Shannon makes sure
to clear the PXE mode after the adminq is initialized.
v2:
- fix patch 14 "i40e: enable PTP" to address Richard Cochran's spelling
catch and Ben Hutchings Kconfig, SIOCGHWTSTAMP and sizeof() suggestions
- added Paul Gortmaker's i40evf fix patch
v3:
- fix patch 14 "i40e: enable PTP" to address Ben Hutchings concerns about
a race with PTP init and cleanup and i40e_get_ts_info().
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
drivers/net/ethernet/intel/i40evf/i40e_txrx.c: In function 'i40e_clean_rx_irq':
drivers/net/ethernet/intel/i40evf/i40e_txrx.c:818:3: error: implicit declaration of function 'prefetch'
make[5]: *** [drivers/net/ethernet/intel/i40evf/i40e_txrx.o] Error 1
due to an implicit assumption that the prototype from linux/prefetch.h
will be present.
Cc: Mitch Williams <mitch.a.williams@intel.com> Cc: Greg Rose <gregory.v.rose@intel.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Acked-by: Shannon Nelson <shannon.nelson@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Shannon Nelson [Sat, 21 Dec 2013 05:44:49 +0000 (05:44 +0000)]
i40e: call clear_pxe after adminq is initialized
In the latest firmware the clear_pxe_mode function will use the
AdminQ request, so call this after AdminQ is set up rather than
relying on i40e_pf_reset() to clear the PXE mode.
Change-ID: Ice8cba2e9cbc3c7bde0a0bcf8eaf5009abef040b Signed-off-by: Shannon Nelson <shannon.nelson@intel.com> Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Shannon Nelson [Sat, 21 Dec 2013 05:44:47 +0000 (05:44 +0000)]
i40e: adjust ITR max and min values
Set the ITR max and min values to match the hardware max value
and the recommended min value. These values are shifted right
one bit because the register counts in 2 usec units, so leave
a comment to explain.
Change-ID: I289c27955cf6c566a6d21b95c3110b88cbb15dad Signed-off-by: Shannon Nelson <shannon.nelson@intel.com> Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Shannon Nelson [Sat, 21 Dec 2013 05:44:46 +0000 (05:44 +0000)]
i40e: check for possible incorrect ipv6 checksum
If the IPV6EXADD bit is set in the Rx descriptor status, there
was an optional extension header with an alternate IP address
detected. The HW checksum offload doesn't handle the alternate
IP address correctly so likely comes up with the wrong answer.
Thus, if the bit is set we ignore the checksum offload value.
Change-ID: I70ff8d38cdcddccf44107691cae13d0c07c284c8 Signed-off-by: Shannon Nelson <shannon.nelson@intel.com> Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Mitch Williams [Sat, 21 Dec 2013 05:44:45 +0000 (05:44 +0000)]
i40e: allow VF to remove any MAC filter
If you use ip to change the MAC address of a VF while the VF
driver is loaded, closing the VF interface or unloading the VF
driver will cause the VF driver to remove the MAC filter for its
original (now invalid) MAC address. This would cause the PF
driver to kick an error message to the log, and back to the VF
driver.
Since the VF driver has not really done anything naughty, let's
not punish it. Don't check for MAC address overrides on the
delete operation, just make sure it's a valid address. This keeps
us from spamming the log with confusing errors.
Change-ID: I1f051bd4014e50855457d928c9ee8b0766981b2f Signed-off-by: Mitch Williams <mitch.a.williams@intel.com> Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
i40e: do not bail when disabling if Tx queue disable fails
Fix a bug where the driver was erroneously exiting the driver unload
path if one part of the unload failed. Instead of the original way
the driver should always continue when disabling and be sure to disable
all queues.
David S. Miller [Fri, 10 Jan 2014 22:59:34 +0000 (17:59 -0500)]
Merge tag 'batman-adv-for-davem' of git://git.open-mesh.org/linux-merge
Included changes:
- substitute FSF address with URL
- deselect current bat-GW when GW-client mode gets deactivated
- send every DHCP packet using bat-unicast messages when GW-client mode is
enabled
- implement the Extended Isolation mechanism (it is an enhancement of the
already existing batman-AP-isolation). This mechanism allows the user to drop
packets exchanged by selected clients by using netfilter marks.
- fix typ0 in header guard
- minor code cleanups
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 10 Jan 2014 22:38:33 +0000 (17:38 -0500)]
Merge branch 'tcp_metrics_saddr'
Christoph Paasch says:
====================
Make tcp-metrics source-address aware
Currently tcp-metrics only stores per-destination addresses. This brings
problems, when a host has multiple interfaces (e.g., a smartphone having
WiFi/3G):
For example, a host contacting a server over WiFi will store the tcp-metrics
per destination IP. If then the host contacts the same server over 3G, the
same tcp-metrics will be used, although the path-characteristics are completly
different (e.g., the ssthresh is probably not the same).
In case of TFO this is not a problem, as the server will provide us a new cookie
once he saw our SYN+DATA with an incorrect cookie.
It may be (in case of carrier-grade NAT), that we keep the same public IP but
have a different private IP. Thus, we better reuse the old cookie even if our
source-IP has changed. However, this scenario is probably very uncommon, as
carriers try to provide the same src-IP to the clients behind their CGN.
Patches 1 + 2 add the source-IP to the tcp metrics.
Patches 3 to 5 modify the netlink-api to support the source-IP. From now on,
when using the command "ip tcp_metrics delete address ADDRESS" all entries
which match this destination IP will be deleted.
Today's iproute2 will complain when doing "ip tcp_metrics flush PREFIX" if
several entries are present for the same destination-IP but with different
source-IPs:
root@client:~/test# ip tcp_metrics
10.2.1.2 age 3.640sec rtt 16250us rttvar 15000us cwnd 10
10.2.1.2 age 4.030sec rtt 18750us rttvar 15000us cwnd 10
root@client:~/test# ip tcp_metrics flush 10.2.1.2/16
Failed to send flush request
: No such process
Follow-up patches will modify iproute2 to handle this correctly and allow
specifying the source-IP in the get/del commands.
v2: Added the patch that allows to selectively get/del of tcp-metrics based
on src-IP and moved the patch that adds the new netlink attribute before
the other patches.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
tcp: metrics: Allow selective get/del of tcp-metrics based on src IP
We want to be able to get/del tcp-metrics based on the src IP. This
patch adds the necessary parsing of the netlink attribute and if the
source address is set, it will match on this one too.
Signed-off-by: Christoph Paasch <christoph.paasch@uclouvain.be> Signed-off-by: David S. Miller <davem@davemloft.net>
We add the source-address to the tcp-metrics, so that different metrics
will be used per source/destination-pair. We use the destination-hash to
store the metric inside the hash-table. That way, deleting and dumping
via "ip tcp_metrics" is easy.
Signed-off-by: Christoph Paasch <christoph.paasch@uclouvain.be> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 10 Jan 2014 19:53:33 +0000 (14:53 -0500)]
Merge branch 'for-davem' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next
John W. Linville says:
====================
Please pull these updates for the 3.14 stream!
For the mac80211 bits, Johannes says:
"Felix adds some helper functions for P2P NoA software tracking, Joe
fixes alignment (but as this apparently never caused issues I didn't
send it to 3.13), Kyeyoon/Jouni add QoS-mapping support (a Hotspot 2.0
feature), Weilong fixed a bunch of checkpatch errors and I get to play
fire-fighter or so and clean up other people's locking issues. I also
added nl80211 vendor-specific events, as we'd discussed at the wireless
summit."
For the iwlwifi bits, Emmanuel says:
"I have here a rework of the interrupt handling to meet RT kernel
requirements - basically we don't take any lock in the primary interrupt
handler. This gave me a good reason to clean things up a bit on the way.
There is also a fix of the QoS mapping along with a few workarounds for
hardware / firmware issues that are hard to hit.
Three fixes suggested by static analyzers, and other various stuff.
Most importantly, I update the Copyright note to include the new year."
For the bluetooth bits, Gustavo says:
"More patches to 3.14. The bulk of changes here is the 6LoWPAN support for
Bluetooth LE Devices. The commits that touches net/ieee802154/ are already
acked by David Miller. Other than that we have some RFCOMM fixes and
improvements plus fixes and clean ups all over the tree."
Beyond that, ath9k, brcmfmac, mwifiex, and wil6210 get their usual
level of attention. The wl1251 driver gets a number of updates,
and there are a handful of other bits here and there.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Paul Durrant [Wed, 8 Jan 2014 12:41:58 +0000 (12:41 +0000)]
xen-netback: stop vif thread spinning if frontend is unresponsive
The recent patch to improve guest receive side flow control (ca2f09f2) had a
slight flaw in the wait condition for the vif thread in that any remaining
skbs in the guest receive side netback internal queue would prevent the
thread from sleeping. An unresponsive frontend can lead to a permanently
non-empty internal queue and thus the thread will spin. In this case the
thread should really sleep until the frontend becomes responsive again.
This patch adds an extra flag to the vif which is set if the shared ring
is full and cleared when skbs are drained into the shared ring. Thus,
if the thread runs, finds the shared ring full and can make no progress the
flag remains set. If the flag remains set then the thread will sleep,
regardless of a non-empty queue, until the next event from the frontend.
Signed-off-by: Paul Durrant <paul.durrant@citrix.com> Cc: Wei Liu <wei.liu2@citrix.com> Cc: Ian Campbell <ian.campbell@citrix.com> Cc: David Vrabel <david.vrabel@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Huqiu reported current sh_sir driver doesn't
call free_irq() in spite of using request_irq().
This patch replaces request_irq() into devm_request_irq()
to solve this issue
Reported-by: Huqiu Liu<huqiuliu@gmail.com> Signed-off-by: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Huqiu reported current sh_irda driver doesn't
call free_irq() in spite of using request_irq().
This patch replaces request_irq() into devm_request_irq()
to solve this issue
Reported-by: Huqiu Liu<huqiuliu@gmail.com> Signed-off-by: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 10 Jan 2014 02:36:01 +0000 (21:36 -0500)]
Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nftables
Pablo Neira Ayuso says:
====================
nf_tables updates for net-next
The following patchset contains the following nf_tables updates,
mostly updates from Patrick McHardy, they are:
* Add the "inet" table and filter chain type for this new netfilter
family: NFPROTO_INET. This special table/chain allows IPv4 and IPv6
rules, this should help to simplify the burden in the administration
of dual stack firewalls. This also includes several patches to prepare
the infrastructure for this new table and a new meta extension to
match the layer 3 and 4 protocol numbers, from Patrick McHardy.
* Load both IPv4 and IPv6 conntrack modules in nft_ct if the rule is used
in NFPROTO_INET, as we don't certainly know which one would be used,
also from Patrick McHardy.
* Do not allow to delete a table that contains sets, otherwise these
sets become orphan, from Patrick McHardy.
* Hold a reference to the corresponding nf_tables family module when
creating a table of that family type, to avoid the module deletion
when in use, from Patrick McHardy.
* Update chain counters before setting the chain policy to ensure that
we don't leave the chain in inconsistent state in case of errors (aka.
restore chain atomicity). This also fixes a possible leak if it fails
to allocate the chain counters if no counters are passed to be restored,
from Patrick McHardy.
* Don't check for overflows in the table counter if we are just renaming
a chain, from Patrick McHardy.
* Replay the netlink request after dropping the nfnl lock to load the
module that supports provides a chain type, from Patrick.
* Fix chain type module references, from Patrick.
* Several cleanups, function renames, constification and code
refactorizations also from Patrick McHardy.
* Add support to set the connmark, this can be used to set it based on
the meta mark (similar feature to -j CONNMARK --restore), from
Kristian Evensen.
* A couple of fixes to the recently added meta/set support and nft_reject,
and fix missing chain type unregistration if we fail to register our
the family table/filter chain type, from myself.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
James Chapman [Mon, 6 Jan 2014 10:17:08 +0000 (10:17 +0000)]
netfilter: introduce l2tp match extension
Introduce an xtables add-on for matching L2TP packets. Supports L2TPv2
and L2TPv3 over IPv4 and IPv6. As well as filtering on L2TP tunnel-id
and session-id, the filtering decision can also include the L2TP
packet type (control or data), protocol version (2 or 3) and
encapsulation type (UDP or IP).
The most common use for this will likely be to filter L2TP data
packets of individual L2TP tunnels or sessions. While a u32 match can
be used, the L2TP protocol headers are such that field offsets differ
depending on bits set in the header, making rules for matching generic
L2TP connections cumbersome. This match extension takes care of all
that.
Signed-off-by: James Chapman <jchapman@katalix.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
David S. Miller [Thu, 9 Jan 2014 20:13:12 +0000 (15:13 -0500)]
Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-next
Jeff Kirsher says:
====================
Intel Wired LAN Driver Updates
This series contains updates to i40e only.
Anjali provides a fix where interrupts were not being re-enabled on ICR0
even though they were auto masked by hardware. Then provides a fix to
cleanup RSS initialization because it was doing some extra work, so
remove the extra work and any bugs it created when managing number of
queues. Since hardware requires a full packet template to be pointed to
when adding hardware flow filters, add the template and use it for
programming filters.
Jesse provides a fix to replace the use of driver specific defines with
kernel ETH_ALEN defines. Then disables packet split because with the
use of GRO, we do not need the extra bus overhead. Fixes spelling
error in code comment.
Kamil provides a fix for the driver where the hardware expects the MAC
address in a very specific format and the driver was filing the data
incorrectly.
Mitch provides a fix to resolve a panic on reset by adding checks to
VSI->rx_rings. Then shortens alloc_rx_buff_failed and
alloc_rx_page_failed variables since both part of an RX specific
structure so just remove the _rx part of the name. Then fixes
badly formatted lines, long lines and mis-formatted lines.
Shannon provides a fix to call AQ to release any reservation held by this
PF on the NVM resource lock on startup, in order to clear anything that
might have been left over from a previous run. Then removes interrupt on
AQ error since nearly everything we do is synchronous, using the
interrupt-on-error bit is unnecessary and causing unneeded interrupts.
Adds code to handle the ability to send messages among the physical
function interfaces by the admin queue.
Catherine sets the MFP flag earlier in software init and uses that flag
to decide if other hardware work-arounds are required which turns
off flow director in MFP mode.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Wei Yongjun [Wed, 8 Jan 2014 10:13:14 +0000 (18:13 +0800)]
openvswitch: Use kmem_cache_free() instead of kfree()
memory allocated by kmem_cache_alloc() should be freed using
kmem_cache_free(), not kfree().
Fixes: e298e5057006 ('openvswitch: Per cpu flow stats.') Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Acked-by: Jesse Gross <jesse@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Patrick McHardy [Thu, 9 Jan 2014 18:42:43 +0000 (18:42 +0000)]
netfilter: nf_tables: rename nft_do_chain_pktinfo() to nft_do_chain()
We don't encode argument types into function names and since besides
nft_do_chain() there are only AF-specific versions, there is no risk
of confusion.
Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Patrick McHardy [Thu, 9 Jan 2014 18:42:38 +0000 (18:42 +0000)]
netfilter: nf_tables: minor nf_chain_type cleanups
Minor nf_chain_type cleanups:
- reorder struct to plug a hoe
- rename struct module member to "owner" for consistency
- rename nf_hookfn array to "hooks" for consistency
- reorder initializers for better readability
Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Patrick McHardy [Thu, 9 Jan 2014 18:42:34 +0000 (18:42 +0000)]
netfilter: nf_tables: fix chain type module reference handling
The chain type module reference handling makes no sense at all: we take
a reference immediately when the module is registered, preventing the
module from ever being unloaded.
Fix by taking a reference when we're actually creating a chain of the
chain type and release the reference when destroying the chain.
Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Chain counter validation is performed after the chain policy has
potentially been changed. Move counter validation/setting before
changing of the chain policy to fix this.
Additionally fix a memory leak if chain counter allocation fails
for new chains, remove an unnecessary free_percpu() and move
counter allocation for new chains
Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Patrick McHardy [Thu, 9 Jan 2014 18:42:31 +0000 (18:42 +0000)]
netfilter: nf_tables: split chain policy validation from actually setting it
Currently nf_tables_newchain() atomicity is broken because of having
validation of some netlink attributes performed after changing attributes
of the chain. The chain policy is (currently) fine, but split it up as
preparation for the following fixes and to avoid future mistakes.
Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
netfilter: nft_meta: fix lack of validation of the input register
We have to validate that the input register is in the range of
allowed registers, otherwise we can take a incorrect register
value as input that may lead us to a crash.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
netfilter: nft_ct: Add support to set the connmark
This patch adds kernel support for setting properties of tracked
connections. Currently, only connmark is supported. One use-case
for this feature is to provide the same functionality as
-j CONNMARK --save-mark in iptables.
Some restructuring was needed to implement the set op. The new
structure follows that of nft_meta.
Signed-off-by: Kristian Evensen <kristian.evensen@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Fast Channel Change across bands was enabled for
AR9462 recently, but this is causing baseband issues.
Disable it until this feature is tested well. Also,
remove the feature bit for AR9565 since it is
a single-band card and doesn't support this feature.
Cc: stable@vger.kernel.org Reported-by: Oleksij Rempel <linux@rempel-privat.de> Signed-off-by: Sujith Manoharan <c_manoha@qca.qualcomm.com> Signed-off-by: John W. Linville <linville@tuxdriver.com>
Accessing the current channel definition in mac80211
when processing RX packets is problematic because it
could have been updated when a scan is issued. Since a
channel change involves flushing the existing packets
in the RX queue before a chip-reset is done, they would
be processed using the wrong band/channel information.
To avoid this, use the current channel information
maintained in the driver.
Cc: stable@vger.kernel.org Reported-by: Oleksij Rempel <linux@rempel-privat.de> Signed-off-by: Sujith Manoharan <c_manoha@qca.qualcomm.com> Signed-off-by: John W. Linville <linville@tuxdriver.com>
Bing Zhao [Wed, 8 Jan 2014 23:45:56 +0000 (15:45 -0800)]
mwifiex: fix potential buffer overflow in dt configuration
If cfgdata length exceeds the command buffer size we will end up
getting buffer overflow problem. Fix it by checking the buffer
size less the command header length.
Reviewed-by: Paul Stewart <pstew@chromium.org> Signed-off-by: Bing Zhao <bzhao@marvell.com> Signed-off-by: John W. Linville <linville@tuxdriver.com>
As soon as skb is ready to be reaped, prefetch 1-st cache line.
This accelerates data access that is performed later, during the
packet classification by the driver and IP stack.
Signed-off-by: Vladimir Kondratiev <qca_vkondrat@qca.qualcomm.com> Signed-off-by: John W. Linville <linville@tuxdriver.com>
Use hardware capabilities to limit IRQ generation to about 15 per msec
It corresponds to about 7 packets/IRQ when running iperf with default
parameters at 1.3Gbps
Do not enable this feature in the sniffer (monitor) mode, because
interrupt moderation cause timestamp accuracy deterioration.
For the sniffer flow, it is important to get precise timestamp.
Signed-off-by: Vladimir Kondratiev <qca_vkondrat@qca.qualcomm.com> Signed-off-by: John W. Linville <linville@tuxdriver.com>
Pavel Machek [Tue, 7 Jan 2014 12:13:28 +0000 (13:13 +0100)]
wl1251: fix NULL pointer dereference
wl1251: fix NULL pointer dereference
Signed-off-by: Pali Rohár <pali.rohar@gmail.com> Reported-by: Felipe Contreras <felipe.contreras@gmail.com> Signed-off-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: John W. Linville <linville@tuxdriver.com>
David Gnedt [Tue, 7 Jan 2014 12:11:27 +0000 (13:11 +0100)]
wl1251: enforce changed hw encryption support on monitor state change
The firmware doesn't support per packet encryption selection, so disable hw
encryption support completely while a monitor interface is present to support
injection of packets (which shouldn't get encrypted by hw).
To enforce the changed hw encryption support force a disassociation on
non-monitor interfaces.
For disassociation a workaround using hw connection monitor is employed,
which temporary enables hw connection manager flag.
Signed-off-by: David Gnedt <david.gnedt@davizone.at> Signed-off-by: Pali Rohár <pali.rohar@gmail.com> Signed-off-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: John W. Linville <linville@tuxdriver.com>
David Gnedt [Tue, 7 Jan 2014 12:10:51 +0000 (13:10 +0100)]
wl1251: disable retry and ACK policy for injected packets
Set the retry limit to 0 and disable the ACK policy for injected packets.
Signed-off-by: David Gnedt <david.gnedt@davizone.at> Signed-off-by: Pali Rohár <pali.rohar@gmail.com> Signed-off-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: John W. Linville <linville@tuxdriver.com>
David Gnedt [Tue, 7 Jan 2014 12:10:14 +0000 (13:10 +0100)]
wl1251: enable tx path in monitor mode if necessary for packet injection
If necessary enable the tx path in monitor mode for packet injection using
the JOIN command with BSS_TYPE_STA_BSS and zero BSSID.
Signed-off-by: David Gnedt <david.gnedt@davizone.at> Signed-off-by: Pali Rohár <pali.rohar@gmail.com> Signed-off-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: John W. Linville <linville@tuxdriver.com>
David Gnedt [Tue, 7 Jan 2014 12:09:42 +0000 (13:09 +0100)]
wl1251: fix channel switching in monitor mode
Use the ENABLE_RX command for channel switching when no interface is present
(monitor mode only).
The advantage of ENABLE_RX is that it leaves the tx data path disabled in
firmware, whereas the usual JOIN command seems to transmit some frames at
firmware level.
Signed-off-by: David Gnedt <david.gnedt@davizone.at> Signed-off-by: Pali Rohár <pali.rohar@gmail.com> Signed-off-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: John W. Linville <linville@tuxdriver.com>
David Gnedt [Tue, 7 Jan 2014 12:09:06 +0000 (13:09 +0100)]
wl1251: disable power saving in monitor mode
Force power saving off while monitor interface is present.
Signed-off-by: David Gnedt <david.gnedt@davizone.at> Signed-off-by: Pali Rohár <pali.rohar@gmail.com> Signed-off-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: John W. Linville <linville@tuxdriver.com>
Port multicast address filtering from wl1271 driver.
It sets up the hardware multicast address filter in configure_filter() with
addresses supplied through prepare_multicast().
Signed-off-by: David Gnedt <david.gnedt@davizone.at> Signed-off-by: Pali Rohár <pali.rohar@gmail.com> Signed-off-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: John W. Linville <linville@tuxdriver.com>
David Gnedt [Tue, 7 Jan 2014 12:07:52 +0000 (13:07 +0100)]
wl1251: configure hardware en-/decryption for monitor mode
Disable hardware encryption (DF_ENCRYPTION_DISABLE) and decryption
(DF_SNIFF_MODE_ENABLE) via wl1251_acx_feature_cfg while monitor interface is
present.
Signed-off-by: David Gnedt <david.gnedt@davizone.at> Signed-off-by: Pali Rohár <pali.rohar@gmail.com> Signed-off-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: John W. Linville <linville@tuxdriver.com>
David Gnedt [Tue, 7 Jan 2014 12:06:58 +0000 (13:06 +0100)]
wl1251: split RX and TX data path initialisation
Split up data path initialisation into RX and TX data path initialisation
functions. This change is required for channel switching in monitor mode.
Signed-off-by: David Gnedt <david.gnedt@davizone.at> Signed-off-by: Pali Rohár <pali.rohar@gmail.com> Signed-off-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: John W. Linville <linville@tuxdriver.com>
David Gnedt [Tue, 7 Jan 2014 12:06:24 +0000 (13:06 +0100)]
wl1251: implement hardware ARP filtering
Update hardware ARP filter configuration on BSS_CHANGED_ARP_FILTER
notification from mac80211.
Ported from wl1271 driver.
Signed-off-by: David Gnedt <david.gnedt@davizone.at> Signed-off-by: Pali Rohár <pali.rohar@gmail.com> Signed-off-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: John W. Linville <linville@tuxdriver.com>
David Gnedt [Tue, 7 Jan 2014 12:05:45 +0000 (13:05 +0100)]
wl1251: retry power save entry
Port of the power save entry retry code from wl1251 driver version included
in the Maemo Fremantle kernel.
This tries to enable power save mode up to 3 times before failing.
Signed-off-by: David Gnedt <david.gnedt@davizone.at> Signed-off-by: Pali Rohár <pali.rohar@gmail.com> Signed-off-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: John W. Linville <linville@tuxdriver.com>
David Gnedt [Tue, 7 Jan 2014 12:04:53 +0000 (13:04 +0100)]
wl1251: fix scan behaviour while not associated
With a dissasociated card I often encoutered very long scan delays.
My guess is that it has something to do with the cards DTIM handling and
another firmware bug mentioned in the TI WLAN driver, which is described as
the card may never end scanning if the channel is overloaded because it
can't send probe requests. I think the firmware somehow also tries to
receive DTIM messages when the BSSID is not set. Therefore most of the time
it waits for DTIM messages and can't do scanning work.
Anyway we can workaround this misbehaviour by setting the HIGH_PRIORITY
bit for scans in disassociated state.
Signed-off-by: David Gnedt <david.gnedt@davizone.at> Signed-off-by: Pali Rohár <pali.rohar@gmail.com> Signed-off-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: John W. Linville <linville@tuxdriver.com>
Jesse Brandeburg [Wed, 18 Dec 2013 13:46:02 +0000 (13:46 +0000)]
i40e: Add a dummy packet template
The hardware requires a full packet template to be pointed to when adding
hardware flow filters. This patch adds the template and uses it for
programming filters.
Mitch Williams [Wed, 18 Dec 2013 13:45:59 +0000 (13:45 +0000)]
i40e: shorten wordy fields
The alloc_rx_buff_failed and alloc_rx_page_failed variables
are both part of an rx specific structure so just remove
the _rx part of the name. No functional changes.
Change-ID: Icffa2f5d13c6f2b1e09cf45b9472b83c9dae8fc6 Signed-off-by: Mitch Williams <mitch.a.williams@intel.com> Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Shannon Nelson [Wed, 18 Dec 2013 13:45:57 +0000 (13:45 +0000)]
i40e: remove interrupt on AQ error
Since nearly everything we do is synchronous, using the interrupt-on-error bit
is unnecessary and causing unneeded interrupts. If anyone wants to use the bit
they can turn it on in individual AQ requests using the cmd_details parameter.
Change-ID: I4690a9c561d3e0836aeadb4f88f8a8702b1d1366 Signed-off-by: Shannon Nelson <shannon.nelson@intel.com> Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Shannon Nelson [Wed, 18 Dec 2013 13:45:56 +0000 (13:45 +0000)]
i40e: release NVM resource reservation on startup
Call AQ to release any reservation held by this PF on the NVM resource
lock on startup, in order to clear anything that might have been left over from
a previous run. The lock is only cleared by the requestor calling for it to be
cleared, on power-on, or firmware reset. This should help limit the need for
rebooting a customer machine if something goes wrong on a firmware update or
some other action.
Change-ID: I8c8473e601d4ef512dda7baa77a6e75f2e5fea49 Signed-off-by: Shannon Nelson <shannon.nelson@intel.com> Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>