vxlan: Avoid creating fdb entry with NULL destination
Commit afbd8bae9c798c5cdbe4439d3a50536b5438247c
vxlan: add implicit fdb entry for default destination
creates an implicit fdb entry for default destination. This results
in an invalid fdb entry if default destination is not specified.
For ex:
ip link add vxlan1 type vxlan id 100
creates the following fdb entry
00:00:00:00:00:00 dev vxlan1 dst 0.0.0.0 self permanent
This patch fixes this issue by creating an fdb entry only if a
valid default destination is specified.
Signed-off-by: Sridhar Samudrala <sri@us.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 1b7fdd2ab5852 ("tcp: do not use cached RTT for RTT estimation")
did not correctly account for the fact that crtt is the RTT shifted
left 3 bits. Fix the calculation to consistently reflect this fact.
Signed-off-by: Neal Cardwell <ncardwell@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Yuchung Cheng <ycheng@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Acked-By: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
It has recently turned up that we have a number of long standing bugs
in the network stack cleanup code with use of the loopback device
after it has been freed that have not turned up because in most cases
the storage allocated to the loopback device is not reused, when those
accesses happen.
Set looback_dev to NULL to trigger oopses instead of silent data corrupt
when we hit this class of bug.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 17 Sep 2013 01:42:37 +0000 (21:42 -0400)]
Merge branch 'sfc-3.12' of git://git.kernel.org/pub/scm/linux/kernel/git/bwh/sfc
Ben Hutchings says:
====================
Some bug fixes and future-proofing for the recently added SFC9120
support:
1. Minimal support for the 40G configuration.
2. Disable the incomplete PTP/hardware timestamping support.
3. Reset MAC stats properly after a firmware upgrade.
4. Re-check the datapath firmware capabilities after the controller is
reset.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann [Mon, 16 Sep 2013 10:36:02 +0000 (12:36 +0200)]
net: sctp: rfc4443: do not report ICMP redirects to user space
Adapt the same behaviour for SCTP as present in TCP for ICMP redirect
messages. For IPv6, RFC4443, section 2.4. says:
...
(e) An ICMPv6 error message MUST NOT be originated as a result of
receiving the following:
...
(e.2) An ICMPv6 redirect message [IPv6-DISC].
...
Therefore, do not report an error to user space, just invoke dst's redirect
callback and leave, same for IPv4 as done in TCP as well. The implication
w/o having this patch could be that the reception of such packets would
generate a poll notification and in worst case it could even tear down the
whole connection. Therefore, stop updating sk_err on redirects.
Reported-by: Duan Jiong <duanj.fnst@cn.fujitsu.com> Reported-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Suggested-by: Vlad Yasevich <vyasevich@gmail.com> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net: usb: cdc_ether: use usb.h macros whenever possible
Use USB_DEVICE_AND_INTERFACE_INFO and USB_VENDOR_AND_INTERFACE_INFO
macros to reduce boilerplate.
Signed-off-by: Fabio Porcedda <fabio.porcedda@gmail.com> Acked-by: Oliver Neukum <oliver@neukum.org> Signed-off-by: David S. Miller <davem@davemloft.net>
net: usb: cdc_ether: fix checkpatch errors and warnings
Signed-off-by: Fabio Porcedda <fabio.porcedda@gmail.com> Acked-by: Oliver Neukum <oliver@neukum.org> Signed-off-by: David S. Miller <davem@davemloft.net>
net: usb: cdc_ether: Use wwan interface for Telit modules
Signed-off-by: Fabio Porcedda <fabio.porcedda@gmail.com> Cc: <stable@vger.kernel.org> # 3.0+ as far back as it applies cleanly Acked-by: Oliver Neukum <oliver@neukum.org> Signed-off-by: David S. Miller <davem@davemloft.net>
ip6_tunnels: raddr and laddr are inverted in nl msg
IFLA_IPTUN_LOCAL and IFLA_IPTUN_REMOTE were inverted.
Introduced by c075b13098b3 (ip6tnl: advertise tunnel param via rtnl).
Signed-off-by: Ding Zhi <zhi.ding@6wind.com> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
bgmac: implement unaligned addressing for DMA rings that support it
This is important patch for new devices that support unaligned
addressing. That devices suffer from the backward-compatibility bug in
DMA engine. In theory we should be able to use old mechanism, but in
practice DMA address seems to be randomly copied into status register
when hardware reaches end of a ring. This breaks reading slot number
from status register and we can't use DMA anymore.
Signed-off-by: Rafał Miłecki <zajec5@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Without this patch it is impossible to read et_swtype, because the 1
byte space is needed for the terminating null byte. The max expected
value is 0xF, so now it should be possible to read decimal form ("15")
and hex form ("0xF").
Signed-off-by: Hauke Mehrtens <hauke@hauke-m.de> Signed-off-by: Rafał Miłecki <zajec5@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Some devices (BCM4749, BCM5357, BCM53572) have internal switch that
requires initialization. We already have code for this, but because
of the typo in code it was never working. This resulted in network not
working for some routers and possibility of soft-bricking them.
Use correct bit for switch initialization and fix typo in the define.
Signed-off-by: Rafał Miłecki <zajec5@gmail.com> Cc: stable@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>
Olaf Hering [Fri, 13 Sep 2013 21:52:01 +0000 (14:52 -0700)]
drivers/net/ethernet/ibm/ehea/ehea_main.c: add alias entry for portN properties
Use separate table for alias entries in the ehea module, otherwise the
probe() function will operate on the separate ports instead of the
lhea-"root" entry of the device-tree
[ Thadeu notes that: "... this issue might happen with the generation of
initrd, when the scripts check for /sys/class/net/eth0/device/modalias,
which links to the port device at
/sys/devices/ibmebus/23c00400.lhea/port0/" ]
Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Olaf Hering <ohering@suse.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Thadeu Lima de Souza Cascardo <cascardo@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Oliver Smith [Mon, 16 Sep 2013 18:30:57 +0000 (20:30 +0200)]
netfilter: ipset: Fix serious failure in CIDR tracking
This fixes a serious bug affecting all hash types with a net element -
specifically, if a CIDR value is deleted such that none of the same size
exist any more, all larger (less-specific) values will then fail to
match. Adding back any prefix with a CIDR equal to or more specific than
the one deleted will fix it.
Steps to reproduce:
ipset -N test hash:net
ipset -A test 1.1.0.0/16
ipset -A test 2.2.2.0/24
ipset -T test 1.1.1.1 #1.1.1.1 IS in set
ipset -D test 2.2.2.0/24
ipset -T test 1.1.1.1 #1.1.1.1 IS NOT in set
This is due to the fact that the nets counter was unconditionally
decremented prior to the iteration that shifts up the entries. Now, we
first check if there is a proceeding entry and if not, decrement it and
return. Otherwise, we proceed to iterate and then zero the last element,
which, in most cases, will already be zero.
Signed-off-by: Oliver Smith <oliver@8.c.9.b.0.7.4.0.1.0.0.2.ip6.arpa> Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
netfilter: ipset: Consistent userspace testing with nomatch flag
The "nomatch" commandline flag should invert the matching at testing,
similarly to the --return-nomatch flag of the "set" match of iptables.
Until now it worked with the elements with "nomatch" flag only. From
now on it works with elements without the flag too, i.e:
# ipset n test hash:net
# ipset a test 10.0.0.0/24 nomatch
# ipset t test 10.0.0.1
10.0.0.1 is NOT in set test.
# ipset t test 10.0.0.1 nomatch
10.0.0.1 is in set test.
# ipset a test 192.168.0.0/24
# ipset t test 192.168.0.1
192.168.0.1 is in set test.
# ipset t test 192.168.0.1 nomatch
192.168.0.1 is NOT in set test.
Before the patch the results were
...
# ipset t test 192.168.0.1
192.168.0.1 is in set test.
# ipset t test 192.168.0.1 nomatch
192.168.0.1 is in set test.
Wei Yang [Sun, 15 Sep 2013 13:53:00 +0000 (21:53 +0800)]
cxgb4: remove workqueue when driver registration fails
When driver registration fails, we need to clean up the resources allocated
before. cxgb4 missed to destroy the workqueue allocated at the very beginning.
This patch destroies the workqueue when registration fails.
Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Neil Horman [Fri, 13 Sep 2013 15:05:33 +0000 (11:05 -0400)]
bonding: Make alb learning packet interval configurable
running bonding in ALB mode requires that learning packets be sent periodically,
so that the switch knows where to send responding traffic. However, depending
on switch configuration, there may not be any need to send traffic at the
default rate of 3 packets per second, which represents little more than wasted
data. Allow the ALB learning packet interval to be made configurable via sysfs
Signed-off-by: Neil Horman <nhorman@tuxdriver.com> Acked-by: Acked-by: Veaceslav Falico <vfalico@redhat.com> CC: Jay Vosburgh <fubar@us.ibm.com> CC: Andy Gospodarek <andy@greyhouse.net> CC: "David S. Miller" <davem@davemloft.net> Signed-off-by: Andy Gospodarek <andy@greyhouse.net> Signed-off-by: David S. Miller <davem@davemloft.net>
Andy Shevchenko [Fri, 13 Sep 2013 15:00:58 +0000 (18:00 +0300)]
atm: nicstar: fix regression made by previous patch
The commit 8390f814 "atm: nicstar: re-use native mac_pton() helper" did a
usefull thing. However, mac_pton() returns 1 in the case of the successfully
parsed input. This patch fixes a typo.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes sparse warnings when incorrectly handling the port number
and using int instead of unsigned int iterating through &vn->sock_list[].
Keeping the port as __be16 also makes things clearer wrt endianess.
Also, it was pointed out that vxlan_get_rx_port() had unnecessary checks
which got removed.
Signed-off-by: Joseph Gasparakis <joseph.gasparakis@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
o At the time of firmware hang "adapter->need_fw_reset" variable gets
set but after re-initialization of firmware OR at the time of VF
re-initialization that variable was not getting cleared which
was leading to failure in VF reset recovery.Fix it by clearing
this variable before re-initializing VF
Signed-off-by: Manish Chopra <manish.chopra@qlogic.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Hong Zhiguo [Sat, 14 Sep 2013 14:42:28 +0000 (22:42 +0800)]
bridge: fix NULL pointer deref of br_port_get_rcu
The NULL deref happens when br_handle_frame is called between these
2 lines of del_nbp:
dev->priv_flags &= ~IFF_BRIDGE_PORT;
/* --> br_handle_frame is called at this time */
netdev_rx_handler_unregister(dev);
In br_handle_frame the return of br_port_get_rcu(dev) is dereferenced
without check but br_port_get_rcu(dev) returns NULL if:
!(dev->priv_flags & IFF_BRIDGE_PORT)
Eric Dumazet pointed out the testing of IFF_BRIDGE_PORT is not necessary
here since we're in rcu_read_lock and we have synchronize_net() in
netdev_rx_handler_unregister. So remove the testing of IFF_BRIDGE_PORT
and by the previous patch, make sure br_port_get_rcu is called in
bridging code.
Signed-off-by: Hong Zhiguo <zhiguohong@tencent.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jon Mason [Wed, 11 Sep 2013 18:22:40 +0000 (11:22 -0700)]
tg3: Use pci_dev pm_cap
Use the already existing pm_cap variable in struct pci_dev for
determining the power management offset. This saves the driver from
having to keep track of an extra variable.
Signed-off-by: Jon Mason <jdmason@kudzu.us> Cc: Nithin Nayak Sujir <nsujir@broadcom.com> Cc: Michael Chan <mchan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jon Mason [Wed, 11 Sep 2013 18:22:39 +0000 (11:22 -0700)]
bnx2x: Use pci_dev pm_cap
Use the already existing pm_cap variable in struct pci_dev for
determining the power management offset. This saves the driver from
having to keep track of an extra variable.
Signed-off-by: Jon Mason <jdmason@kudzu.us> Cc: Eilon Greenstein <eilong@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Yijing Wang [Wed, 11 Sep 2013 02:07:00 +0000 (10:07 +0800)]
alx: remove redundant D0 power state set
Pci_enable_device_mem() will set device power state to D0,
so it's no need to do it again in alx_probe().
Also remove redundant PM Cap find code, because pci core
has been saved the pci device pm cap value.
Signed-off-by: Yijing Wang <wangyijing@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Avoid unneeded local string buffers for constructing debug output. Also
cleans up debug calls that contain a single parameter so that they cannot
be accidentally parsed as format strings.
Signed-off-by: Kees Cook <keescook@chromium.org> Cc: Karsten Keil <isdn@linux-pingi.de> Cc: David Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Libo Chen <libo.chen@huawei.com> Cc: Chas Williams <chas@cmf.nrl.navy.mil> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 13 Sep 2013 23:35:24 +0000 (19:35 -0400)]
Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net
Jeff Kirsher says:
====================
Intel Wired LAN Driver Updates
This series contains updates to ixgbe and e1000e.
Jacob provides a ixgbe patch to fix the configure_rx patch to properly
disable RSC hardware logic when a user disables it. Previously we only
disabled RSC in the queue settings, but this does not fully disable
hardware RSC logic which can lead to unexpected performance issues.
Emil provides three fixes for ixgbe. First fixes the ethtool loopback
test when DCB is enabled, where the frames may be modified on Tx
(by adding VLAN tag) which will fail the check on receive. Then a fix
for QSFP+ modules, limit the speed setting to advertise only one speed
at a time since the QSFP+ modules do not support auto negotiation.
Lastly, resolve an issue where the driver will display incorrect info
for QSFP+ modules that were inserted after the driver has been loaded.
David Ertman provides to fixes for e1000e, one removes a comparison to
the boolean value true where evaluating the lvalue will produce the
same result. The other fixes an error in the calculation of the
rar_entry_count, which causes a write of unkown/undefined register
space in the MAC to unknown/undefined register space in the PHY.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
David Ertman [Thu, 5 Sep 2013 04:24:25 +0000 (04:24 +0000)]
e1000e: fix overrun of PHY RAR array
When copying the MAC RAR registers to PHY there is an error in the
calculation of the rar_entry_count, which causes a write of unknown/
undefined register space in the MAC to unknown/undefined register space in
the PHY.
This patch fixes the overrun with writing to the PHY RAR and also fixes the
ethtool offline register tests so that the correctly addressed registers
have the appropriate bitmasks for R/W and RO bits for affected parts.
Shawn Rader gets credit for finding and fixing the register overrun.
Signed-off-by: Dave Ertman <davidx.m.ertman@intel.com> CC: Shawn Rader <shawn.t.rader@intel.com> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
David Ertman [Fri, 30 Aug 2013 05:45:25 +0000 (05:45 +0000)]
e1000e: cleanup boolean comparison to true
Removing a comparison to the boolean value true where simply interrogating
the lvalue will produce the same result.
Signed-off-by: David Ertman <davidx.m.ertman@intel.com> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Emil Tantilov [Sat, 31 Aug 2013 03:08:20 +0000 (03:08 +0000)]
ixgbe: fix ethtool reporting of supported links for SFP modules
This patch resolves an issue where the driver will display incorrect info
for Q/SFP+ modules that were inserted after the driver has been loaded.
This patch adds a call to identify_phy() in ixgbe_get_settings() prior to
calling get_link_capabilities() which needs the PHY data in order to
determine the correct settings.
Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com> Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Emil Tantilov [Fri, 30 Aug 2013 07:55:24 +0000 (07:55 +0000)]
ixgbe: limit setting speed to only one at a time for QSFP modules
QSFP+ modules do not support auto negotiation and should advertise only
one speed at a time.
This patch adds logic in ethtool to allow setting and reporting the
advertised speed at either 1Gbps or 10Gbps, but not both. Also limits
the speed set in ixgbe_sfp_link_config_subtask() to highest supported.
Previously the link was set to whatever the supported speeds were.
Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com> Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Emil Tantilov [Tue, 23 Jul 2013 01:56:58 +0000 (01:56 +0000)]
ixgbe: fix ethtool loopback diagnostic with DCB enabled
This patch disables DCB prior to running the loopback test.
When DCB is enabled the frames may be modified on Tx (by adding vlan tag)
which will fail the check on Rx.
Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com> Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com> Tested-by: Jack Morgan <jack.morgan@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Jacob Keller [Wed, 17 Jul 2013 02:53:23 +0000 (02:53 +0000)]
ixgbe: fully disable hardware RSC logic when disabling RSC
This patch modifies the configure_rx path in order to properly disable RSC
hardware logic when the user disables it. Previously we only disabled RSC in the
queue settings, but this does not fully disable hardware RSC logic which can
lead to some unexpected performance issues.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Phil Oester [Fri, 13 Sep 2013 01:04:16 +0000 (18:04 -0700)]
netfilter: nf_nat_proto_icmpv6:: fix wrong comparison in icmpv6_manip_pkt
In commit 58a317f1 (netfilter: ipv6: add IPv6 NAT support), icmpv6_manip_pkt
was added with an incorrect comparison of ICMP codes to types. This causes
problems when using NAT rules with the --random option. Correct the
comparison.
This closes netfilter bugzilla #851, reported by Alexander Neumann.
Signed-off-by: Phil Oester <kernel@linuxace.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Michal Kubeček [Wed, 11 Sep 2013 08:17:27 +0000 (10:17 +0200)]
netfilter: nf_conntrack: use RCU safe kfree for conntrack extensions
Commit 68b80f11 (netfilter: nf_nat: fix RCU races) introduced
RCU protection for freeing extension data when reallocation
moves them to a new location. We need the same protection when
freeing them in nf_ct_ext_free() in order to prevent a
use-after-free by other threads referencing a NAT extension data
via bysource list.
Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
net/mlx4_en: Check device state when setting coalescing
When the device is down, CQs are freed. We must check the device state
to avoid issuing firmware commands on non existing CQs.
CC: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Eugenia Emantayev <eugenia@mellanox.com> Signed-off-by: Amir Vadai <amirv@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Herbert Xu [Thu, 12 Sep 2013 07:12:05 +0000 (17:12 +1000)]
bridge: Clamp forward_delay when enabling STP
At some point limits were added to forward_delay. However, the
limits are only enforced when STP is enabled. This created a
scenario where you could have a value outside the allowed range
while STP is disabled, which then stuck around even after STP
is enabled.
This patch fixes this by clamping the value when we enable STP.
I had to move the locking around a bit to ensure that there is
no window where someone could insert a value outside the range
while we're in the middle of enabling STP.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Cheers, Signed-off-by: David S. Miller <davem@davemloft.net>
This changes the message_age_timer calculation to use the BPDU's max age as
opposed to the local bridge's max age. This is in accordance with section
8.6.2.3.2 Step 2 of the 802.1D-1998 sprecification.
With the current implementation, when running with very large bridge
diameters, convergance will not always occur even if a root bridge is
configured to have a longer max age.
Tested successfully on bridge diameters of ~200.
Signed-off-by: Chris Healy <cphealy@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
This patch proposes to remove the IRQF_DISABLED flag from
drivers/net/ethernet/dec/tulip/de4x5.c
It's a NOOP since 2.6.35 and it will be removed one day.
Signed-off-by: Michael Opdenacker <michael.opdenacker@free-electrons.com> Acked-by: Grant Grundler <grundler@parisc-linux.org> Signed-off-by: David S. Miller <davem@davemloft.net>
This patch proposes to remove the IRQF_DISABLED flag from
drivers/net/ethernet/ibm/ehea/ehea_main.c
It's a NOOP since 2.6.35 and it will be removed one day.
Signed-off-by: Michael Opdenacker <michael.opdenacker@free-electrons.com> Acked-by: Thadeu Lima de Souza Cascardo <cascardo@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
This patch proposes to remove the IRQF_DISABLED flag from
drivers/net/ethernet/adi/bfin_mac.c.
It's a NOOP since 2.6.35 and it will be removed one day.
Signed-off-by: Michael Opdenacker <michael.opdenacker@free-electrons.com> Reviewed-by: Jingoo Han <jg1.han@samsung.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David Vrabel [Wed, 11 Sep 2013 13:52:48 +0000 (14:52 +0100)]
xen-netback: count number required slots for an skb more carefully
When a VM is providing an iSCSI target and the LUN is used by the
backend domain, the generated skbs for direct I/O writes to the disk
have large, multi-page skb->data but no frags.
With some lengths and starting offsets, xen_netbk_count_skb_slots()
would be one short because the simple calculation of
DIV_ROUND_UP(skb_headlen(), PAGE_SIZE) was not accounting for the
decisions made by start_new_rx_buffer() which does not guarantee
responses are fully packed.
For example, a skb with length < 2 pages but which spans 3 pages would
be counted as requiring 2 slots but would actually use 3 slots.
Miscounting the number of slots means netback may push more responses
than the number of available requests. This will cause the frontend
to get very confused and report "Too many frags/slots". The frontend
never recovers and will eventually BUG.
Fix this by counting the number of required slots more carefully. In
xen_netbk_count_skb_slots(), more closely follow the algorithm used by
xen_netbk_gop_skb() by introducing xen_netbk_count_frag_slots() which
is the dry-run equivalent of netbk_gop_frag_copy().
Signed-off-by: David Vrabel <david.vrabel@citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 989038e217e94161862a959e82f9a1ecf8dda152 ("tg3: Don't turn off
led on 5719 serdes port 0") added code to skip turning led off on port
0 of the 5719 since it powered down other ports. This workaround needs
to be enabled on the 5720 as well.
Cc: stable@vger.kernel.org Signed-off-by: Nithin Nayak Sujir <nsujir@broadcom.com> Signed-off-by: Michael Chan <mchan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann [Wed, 11 Sep 2013 14:58:36 +0000 (16:58 +0200)]
net: sctp: fix ipv6 ipsec encryption bug in sctp_v6_xmit
Alan Chester reported an issue with IPv6 on SCTP that IPsec traffic is not
being encrypted, whereas on IPv4 it is. Setting up an AH + ESP transport
does not seem to have the desired effect:
SCTP + IPv4:
22:14:20.809645 IP (tos 0x2,ECT(0), ttl 64, id 0, offset 0, flags [DF], proto AH (51), length 116)
192.168.0.2 > 192.168.0.5: AH(spi=0x00000042,sumlen=16,seq=0x1): ESP(spi=0x00000044,seq=0x1), length 72
22:14:20.813270 IP (tos 0x2,ECT(0), ttl 64, id 0, offset 0, flags [DF], proto AH (51), length 340)
192.168.0.5 > 192.168.0.2: AH(spi=0x00000043,sumlen=16,seq=0x1):
This problem was seen with both Racoon and Racoon2. Other people have seen
this with OpenSwan. When IPsec is configured to encrypt all upper layer
protocols the SCTP connection does not initialize. After using Wireshark to
follow packets, this is because the SCTP packet leaves Box A unencrypted and
Box B believes all upper layer protocols are to be encrypted so it drops
this packet, causing the SCTP connection to fail to initialize. When IPsec
is configured to encrypt just SCTP, the SCTP packets are observed unencrypted.
In fact, using `socat sctp6-listen:3333 -` on one end and transferring "plaintext"
string on the other end, results in cleartext on the wire where SCTP eventually
does not report any errors, thus in the latter case that Alan reports, the
non-paranoid user might think he's communicating over an encrypted transport on
SCTP although he's not (tcpdump ... -X):
Only in /proc/net/xfrm_stat we can see XfrmInTmplMismatch increasing on the
receiver side. Initial follow-up analysis from Alan's bug report was done by
Alexey Dobriyan. Also thanks to Vlad Yasevich for feedback on this.
SCTP has its own implementation of sctp_v6_xmit() not calling inet6_csk_xmit().
This has the implication that it probably never really got updated along with
changes in inet6_csk_xmit() and therefore does not seem to invoke xfrm handlers.
SCTP's IPv4 xmit however, properly calls ip_queue_xmit() to do the work. Since
a call to inet6_csk_xmit() would solve this problem, but result in unecessary
route lookups, let us just use the cached flowi6 instead that we got through
sctp_v6_get_dst(). Since all SCTP packets are being sent through sctp_packet_transmit(),
we do the route lookup / flow caching in sctp_transport_route(), hold it in
tp->dst and skb_dst_set() right after that. If we would alter fl6->daddr in
sctp_v6_xmit() to np->opt->srcrt, we possibly could run into the same effect
of not having xfrm layer pick it up, hence, use fl6_update_dst() in sctp_v6_get_dst()
instead to get the correct source routed dst entry, which we assign to the skb.
Also source address routing example from 625034113 ("sctp: fix sctp to work with
ipv6 source address routing") still works with this patch! Nevertheless, in RFC5095
it is actually 'recommended' to not use that anyway due to traffic amplification [1].
So it seems we're not supposed to do that anyway in sctp_v6_xmit(). Moreover, if
we overwrite the flow destination here, the lower IPv6 layer will be unable to
put the correct destination address into IP header, as routing header is added in
ipv6_push_nfrag_opts() but then probably with wrong final destination. Things aside,
result of this patch is that we do not have any XfrmInTmplMismatch increase plus on
the wire with this patch it now looks like:
This fixes Kernel Bugzilla 24412. This security issue seems to be present since
2.6.18 kernels. Lets just hope some big passive adversary in the wild didn't have
its fun with that. lksctp-tools IPv6 regression test suite passes as well with
this patch.
Reported-by: Alan Chester <alan.chester@tekelec.com> Reported-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Cc: Steffen Klassert <steffen.klassert@secunet.com> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Acked-by: Vlad Yasevich <vyasevich@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
- memory of tun security were leaked
- use after free since the flow gc timer was not deleted and the tfile
were not detached
This patch solves the above issues.
Reported-by: Wannes Rombouts <wannes.rombouts@epitech.eu> Cc: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
igb: Read flow control for i350 from correct EEPROM section
Flow control is defined in the four EEPROM sections but the driver only reads
from section 0.
Signed-off-by: Todd Fujinaka <todd.fujinaka@intel.com> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
igb: Add additional get_phy_id call for i354 devices
This patch fixes a problem where some ports can fail to initialize on a
cold boot. This patch adds an additional call to read the PHY id for i354
devices in order workaround the hardware problem.
Signed-off-by: Carolyn Wyborny <carolyn.wyborny@intel.com> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 12 Sep 2013 07:45:43 +0000 (03:45 -0400)]
Merge tag 'master-2013-09-09' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless
John W. Linville says:
====================
This is a pull request for a few early fixes for the 3.12 stream.
Alexey Khoroshilov corrects a use-after-free issue on rtl8187 found
by the Linux Driver Verification project.
Arend van Spriel provides a brcmfmac patch to fix a build issue
reported by Randy Dunlap.
Hauke Mehrtens offers a bcma fix to properly account for the storage
width of error code values before checking them.
Solomon Peachy brings a pair of cw1200 fixes to avoid hangs in that
driver with SPI devices. One avoids transfers in interrupt context,
the other fixes a locking issue.
Stanislaw Gruszka changes the initialization of the rt2800 driver to
avoid a freeze, addressing a bug in the Red Hat bugzilla.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
RTL_GIGA_MAC_VER_36 - 8168f as well - has not been reported to behave the
same.
Tested-by: David R <david@unsolicited.net> Tested-by: Frédéric Leroy <fredo@starox.org> Cc: Hayes Wang <hayeswang@realtek.com> Signed-off-by: Francois Romieu <romieu@fr.zoreil.com> Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts the Linux for Workgroups thing. And no, before somebody
asks, we're not doing Linux95. Not for a few years, at least.
Sure, the flag added some color to the logo, and could have remained as
a testament to my leet gimp skills. But no. And I'll do this early, to
avoid the chance of forgetting when I'm doing the actual rc1 release on
the road.
Merge tag 'ecryptfs-3.12-rc1-crypt-ctx' of git://git.kernel.org/pub/scm/linux/kernel/git/tyhicks/ecryptfs
Pull eCryptfs fixes from Tyler Hicks:
"Two small fixes to the code that initializes the per-file crypto
contexts"
* tag 'ecryptfs-3.12-rc1-crypt-ctx' of git://git.kernel.org/pub/scm/linux/kernel/git/tyhicks/ecryptfs:
ecryptfs: avoid ctx initialization race
ecryptfs: remove check for if an array is NULL
Merge branch 'for-v3.12-fix' of git://git.linaro.org/people/mszyprowski/linux-dma-mapping
Pull DMA-mapping fix from Marek Szyprowski:
"A build bugfix for the device tree support for reserved memory
regions. Due to superfluous include the common code failed to build
on ARM64 and MIPS architectures.
The patch that caused the build break has lived at linux-next for
about two weeks and noone noticed the issue, what convinced me that
everything was ok"
* 'for-v3.12-fix' of git://git.linaro.org/people/mszyprowski/linux-dma-mapping:
drivers: of: fix build break if asm/dma-contiguous.h is missing
Merge tag 'for-3.12' of git://git.linaro.org/people/sumitsemwal/linux-dma-buf
Pull dma-buf updates from Sumit Semwal:
"Yet another small one - dma-buf framework now supports size discovery
of the buffer via llseek"
* tag 'for-3.12' of git://git.linaro.org/people/sumitsemwal/linux-dma-buf:
dma-buf: Expose buffer size to userspace (v2)
dma-buf: Check return value of anon_inode_getfile
Merge first patch-bomb from Andrew Morton:
- Some pidns/fork/exec tweaks
- OCFS2 updates
- Most of MM - there remain quite a few memcg parts which depend on
pending core cgroups changes. Which might have been already merged -
I'll check tomorrow...
- Various misc stuff all over the place
- A few block bits which I never got around to sending to Jens -
relatively minor things.
- MAINTAINERS maintenance
- A small number of lib/ updates
- checkpatch updates
- epoll
- firmware/dmi-scan
- Some kprobes work for S390
- drivers/rtc updates
- hfsplus feature work
- vmcore feature work
- rbtree upgrades
- AOE updates
- pktcdvd cleanups
- PPS
- memstick
- w1
- New "inittmpfs" feature, which does the obvious
- More IPC work from Davidlohr.
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (303 commits)
lz4: fix compression/decompression signedness mismatch
ipc: drop ipc_lock_check
ipc, shm: drop shm_lock_check
ipc: drop ipc_lock_by_ptr
ipc, shm: guard against non-existant vma in shmdt(2)
ipc: document general ipc locking scheme
ipc,msg: drop msg_unlock
ipc: rename ids->rw_mutex
ipc,shm: shorten critical region for shmat
ipc,shm: cleanup do_shmat pasta
ipc,shm: shorten critical region for shmctl
ipc,shm: make shmctl_nolock lockless
ipc,shm: introduce shmctl_nolock
ipc: drop ipcctl_pre_down
ipc,shm: shorten critical region in shmctl_down
ipc,shm: introduce lockless functions to obtain the ipc object
initmpfs: use initramfs if rootfstype= or root= specified
initmpfs: make rootfs use tmpfs when CONFIG_TMPFS enabled
initmpfs: move rootfs code from fs/ramfs/ to init/
initmpfs: move bdi setup from init_rootfs to init_ramfs
...
LZ4 compression and decompression functions require different in
signedness input/output parameters: unsigned char for compression and
signed char for decompression.
Change decompression API to require "(const) unsigned char *".
After previous cleanups and optimizations, this function is no longer
heavily used and we don't have a good reason to keep it. Update the few
remaining callers and get rid of it.
As suggested by Andrew, add a generic initial locking scheme used
throughout all sysv ipc mechanisms. Documenting the ids rwsem, how rcu
can be enough to do the initial checks and when to actually acquire the
kern_ipc_perm.lock spinlock.
I found that adding it to util.c was generic enough.
Clean up some of the messy do_shmat() spaghetti code, getting rid of
out_free and out_put_dentry labels. This makes shortening the critical
region of this function in the next patch a little easier to do and read.
With the *_INFO, *_STAT, IPC_RMID and IPC_SET commands already optimized,
deal with the remaining SHM_LOCK and SHM_UNLOCK commands. Take the
shm_perm lock after doing the initial auditing and security checks. The
rest of the logic remains unchanged.
While the INFO cmd doesn't take the ipc lock, the STAT commands do acquire
it unnecessarily. We can do the permissions and security checks only
holding the rcu lock.
Similar to semctl and msgctl, when calling msgctl, the *_INFO and *_STAT
commands can be performed without acquiring the ipc object.
Add a shmctl_nolock() function and move the logic of *_INFO and *_STAT out
of msgctl(). Since we are just moving functionality, this change still
takes the lock and it will be properly lockless in the next patch.
ipc,shm: introduce lockless functions to obtain the ipc object
This is the third and final patchset that deals with reducing the amount
of contention we impose on the ipc lock (kern_ipc_perm.lock). These
changes mostly deal with shared memory, previous work has already been
done for semaphores and message queues:
With these patches applied, a custom shm microbenchmark stressing shmctl
doing IPC_STAT with 4 threads a million times, reduces the execution
time by 50%. A similar run, this time with IPC_SET, reduces the
execution time from 3 mins and 35 secs to 27 seconds.
Patches 1-8: replaces blindly taking the ipc lock for a smarter
combination of rcu and ipc_obtain_object, only acquiring the spinlock
when updating.
Patch 9: renames the ids rw_mutex to rwsem, which is what it already was.
Patch 10: is a trivial mqueue leftover cleanup
Patch 11: adds a brief lock scheme description, requested by Andrew.
This patch:
Add shm_obtain_object() and shm_obtain_object_check(), which will allow us
to get the ipc object without acquiring the lock. Just as with other
forms of ipc, these functions are basically wrappers around
ipc_obtain_object*().
Rob Landley [Wed, 11 Sep 2013 21:26:13 +0000 (14:26 -0700)]
initmpfs: use initramfs if rootfstype= or root= specified
Command line option rootfstype=ramfs to obtain old initramfs behavior, and
use ramfs instead of tmpfs for stub when root= defined (for cosmetic
reasons).
[akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Rob Landley <rob@landley.net> Cc: Jeff Layton <jlayton@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Stephen Warren <swarren@nvidia.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Jim Cromie <jim.cromie@gmail.com> Cc: Sam Ravnborg <sam@ravnborg.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Rob Landley [Wed, 11 Sep 2013 21:26:10 +0000 (14:26 -0700)]
initmpfs: move rootfs code from fs/ramfs/ to init/
When the rootfs code was a wrapper around ramfs, having them in the same
file made sense. Now that it can wrap another filesystem type, move it in
with the init code instead.
This also allows a subsequent patch to access rootfstype= command line
arg.
Signed-off-by: Rob Landley <rob@landley.net> Cc: Jeff Layton <jlayton@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Stephen Warren <swarren@nvidia.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Jim Cromie <jim.cromie@gmail.com> Cc: Sam Ravnborg <sam@ravnborg.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Rob Landley [Wed, 11 Sep 2013 21:26:08 +0000 (14:26 -0700)]
initmpfs: move bdi setup from init_rootfs to init_ramfs
Even though ramfs hasn't got a backing device, commit e0bf68ddec4f ("mm:
bdi init hooks") added one anyway, and put the initialization in
init_rootfs() since that's the first user, leaving it out of init_ramfs()
to avoid duplication.
But initmpfs uses init_tmpfs() instead, so move the init into the
filesystem's init function, add a "once" guard to prevent duplicate
initialization, and call the filesystem init from rootfs init.
This goes part of the way to allowing ramfs to be built as a module.
[akpm@linux-foundation.org; using bit 1 was odd] Signed-off-by: Rob Landley <rob@landley.net> Cc: Jeff Layton <jlayton@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Stephen Warren <swarren@nvidia.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Jim Cromie <jim.cromie@gmail.com> Cc: Sam Ravnborg <sam@ravnborg.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jan Kara [Wed, 11 Sep 2013 21:26:05 +0000 (14:26 -0700)]
lib/radix-tree.c: make radix_tree_node_alloc() work correctly within interrupt
With users of radix_tree_preload() run from interrupt (block/blk-ioc.c is
one such possible user), the following race can happen:
radix_tree_preload()
...
radix_tree_insert()
radix_tree_node_alloc()
if (rtp->nr) {
ret = rtp->nodes[rtp->nr - 1];
<interrupt>
...
radix_tree_preload()
...
radix_tree_insert()
radix_tree_node_alloc()
if (rtp->nr) {
ret = rtp->nodes[rtp->nr - 1];
And we give out one radix tree node twice. That clearly results in radix
tree corruption with different results (usually OOPS) depending on which
two users of radix tree race.
We fix the problem by making radix_tree_node_alloc() always allocate fresh
radix tree nodes when in interrupt. Using preloading when in interrupt
doesn't make sense since all the allocations have to be atomic anyway and
we cannot steal nodes from process-context users because some users rely
on radix_tree_insert() succeeding after radix_tree_preload().
in_interrupt() check is somewhat ugly but we cannot simply key off passed
gfp_mask as that is acquired from root_gfp_mask() and thus the same for
all preload users.
Another part of the fix is to avoid node preallocation in
radix_tree_preload() when passed gfp_mask doesn't allow waiting. Again,
preallocation in such case doesn't make sense and when preallocation would
happen in interrupt we could possibly leak some allocated nodes. However,
some users of radix_tree_preload() require following radix_tree_insert()
to succeed. To avoid unexpected effects for these users,
radix_tree_preload() only warns if passed gfp mask doesn't allow waiting
and we provide a new function radix_tree_maybe_preload() for those users
which get different gfp mask from different call sites and which are
prepared to handle radix_tree_insert() failure.
Signed-off-by: Jan Kara <jack@suse.cz> Cc: Jens Axboe <jaxboe@fusionio.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The driver core clears the driver data to NULL after device_release or on
probe failure. Thus, it is not needed to manually clear the device driver
data to NULL.