Recently many changes have been done inside the driver
so this patch updates the driver's doc for example reviewing
information for the rx and tx processes that are managed
by napi method, adding new information for missing glue-logic files
etc.
Signed-off-by: Giuseppe Cavallaro <peppe.cavallaro@st.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Tue, 18 Nov 2014 07:06:20 +0000 (23:06 -0800)]
tcp: make connect() mem charging friendly
While working on sk_forward_alloc problems reported by Denys
Fedoryshchenko, we found that tcp connect() (and fastopen) do not call
sk_wmem_schedule() for SYN packet (and/or SYN/DATA packet), so
sk_forward_alloc is negative while connect is in progress.
We can fix this by calling regular sk_stream_alloc_skb() both for the
SYN packet (in tcp_connect()) and the syn_data packet in
tcp_send_syn_data()
Then, tcp_send_syn_data() can avoid copying syn_data as we simply
can manipulate syn_data->cb[] to remove SYN flag (and increment seq)
Instead of open coding memcpy_fromiovecend(), simply use this helper.
This leaves in socket write queue clean fast clone skbs.
This was tested against our fastopen packetdrill tests.
Reported-by: Denys Fedoryshchenko <nuclearcat@nuclearcat.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jason Wang [Tue, 18 Nov 2014 05:20:41 +0000 (13:20 +0800)]
tun: return NET_XMIT_DROP for dropped packets
After commit 5d097109257c03a71845729f8db6b5770c4bbedc
("tun: only queue packets on device"), NETDEV_TX_OK was returned for
dropped packets. This will confuse pktgen since dropped packets were
counted as sent ones.
Fixing this by returning NET_XMIT_DROP to let pktgen count it as error
packet.
Cc: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Rick Jones [Mon, 17 Nov 2014 22:04:29 +0000 (14:04 -0800)]
icmp: Remove some spurious dropped packet profile hits from the ICMP path
If icmp_rcv() has successfully processed the incoming ICMP datagram, we
should use consume_skb() rather than kfree_skb() because a hit on the likes
of perf -e skb:kfree_skb is not called-for.
Signed-off-by: Rick Jones <rick.jones2@hp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 18 Nov 2014 20:24:53 +0000 (15:24 -0500)]
Merge tag 'linux-can-next-for-3.19-20141117' of git://gitorious.org/linux-can/linux-can-next
Marc Kleine-Budde says:
====================
this is a pull request of 9 patches for net-next/master.
All 9 patches are by Roger Quadros and update the c_can platform
driver. First by improving the initialization sequence of the message
RAM, making use of syscon/regmap. In the later patches support for
various TI SoCs is added.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch series is a followup to:
<1415350967-2238-1-git-send-email-LW@KARO-electronics.de>
[PATCHv4 1/1] net: fec: fix regression on i.MX28 introduced by rx_copybreak support
to apply the cleanup patches that were originally sent along with the
bugfix patch.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Lothar Waßmann [Mon, 17 Nov 2014 09:51:22 +0000 (10:51 +0100)]
net: fec: use swab32s() instead of cpu_to_be32()
when swap_buffer() is being called, we know for sure, that we need to
byte swap the data. Furthermore, this function is called for swapping
data in both directions. Thus cpu_to_be32() is semantically not
correct for all use cases. Use swab32s() to reflect this.
Signed-off-by: Lothar Waßmann <LW@KARO-electronics.de> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 18 Nov 2014 18:44:06 +0000 (13:44 -0500)]
Merge branch 'ebpf_maps'
Alexei Starovoitov says:
====================
implementation of eBPF maps
v1->v2:
renamed flags for MAP_UPDATE_ELEM command to be more concise,
clarified commit logs and improved comments in patches 1,3,7
per discussions with Daniel
Old v1 cover:
this set of patches adds implementation of HASH and ARRAY types of eBPF maps
which were described in manpage in commit b4fc1a460f30("Merge branch 'bpf-next'")
The difference vs previous version of these patches from August:
- added 'flags' attribute to BPF_MAP_UPDATE_ELEM
- in HASH type implementation removed per-map kmem_cache.
I was doing kmem_cache_create() for every map to enable selective slub
debugging to check for overflows and leaks. Now it's not needed, so just
use normal kmalloc() for map elements.
- added ARRAY type which was mentioned in manpage, but wasn't public yet
- added map testsuite and removed temporary bits from test_stubs
Note, eBPF programs cannot be attached to events yet.
It will come in the next set.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
. check error conditions and sanity of hash and array map APIs
. check large maps (that kernel gracefully switches to vmalloc from kmalloc)
. check multi-process parallel access and stress test
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
fix errno of BPF_MAP_LOOKUP_ELEM command as bpf manpage
described it in commit b4fc1a460f30("Merge branch 'bpf-next'"):
-----
BPF_MAP_LOOKUP_ELEM
int bpf_lookup_elem(int fd, void *key, void *value)
{
union bpf_attr attr = {
.map_fd = fd,
.key = ptr_to_u64(key),
.value = ptr_to_u64(value),
};
return bpf(BPF_MAP_LOOKUP_ELEM, &attr, sizeof(attr));
}
bpf() syscall looks up an element with given key in a map fd.
If element is found it returns zero and stores element's value
into value. If element is not found it returns -1 and sets
errno to ENOENT.
and further down in manpage:
ENOENT For BPF_MAP_LOOKUP_ELEM or BPF_MAP_DELETE_ELEM, indicates that
element with given key was not found.
-----
In general all BPF commands return ENOENT when map element is not found
(including BPF_MAP_GET_NEXT_KEY and BPF_MAP_UPDATE_ELEM with
flags == BPF_MAP_UPDATE_ONLY)
Subsequent patch adds a testsuite to check return values for all of
these combinations.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
add new map type BPF_MAP_TYPE_ARRAY and its implementation
- optimized for fastest possible lookup()
. in the future verifier/JIT may recognize lookup() with constant key
and optimize it into constant pointer. Can optimize non-constant
key into direct pointer arithmetic as well, since pointers and
value_size are constant for the life of the eBPF program.
In other words array_map_lookup_elem() may be 'inlined' by verifier/JIT
while preserving concurrent access to this map from user space
- two main use cases for array type:
. 'global' eBPF variables: array of 1 element with key=0 and value is a
collection of 'global' variables which programs can use to keep the state
between events
. aggregation of tracing events into fixed set of buckets
- all array elements pre-allocated and zero initialized at init time
- key as an index in array and can only be 4 byte
- map_delete_elem() returns EINVAL, since elements cannot be deleted
- map_update_elem() replaces elements in an non-atomic way
(for atomic updates hashtable type should be used instead)
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
add new map type BPF_MAP_TYPE_HASH and its implementation
- maps are created/destroyed by userspace. Both userspace and eBPF programs
can lookup/update/delete elements from the map
- eBPF programs can be called in_irq(), so use spin_lock_irqsave() mechanism
for concurrent updates
- key/value are opaque range of bytes (aligned to 8 bytes)
- user space provides 3 configuration attributes via BPF syscall:
key_size, value_size, max_entries
- map takes care of allocating/freeing key/value pairs
- map_update_elem() must fail to insert new element when max_entries
limit is reached to make sure that eBPF programs cannot exhaust memory
- map_update_elem() replaces elements in an atomic way
- optimized for speed of lookup() which can be called multiple times from
eBPF program which itself is triggered by high volume of events
. in the future JIT compiler may recognize lookup() call and optimize it
further, since key_size is constant for life of eBPF program
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
bpf: add 'flags' attribute to BPF_MAP_UPDATE_ELEM command
the current meaning of BPF_MAP_UPDATE_ELEM syscall command is:
either update existing map element or create a new one.
Initially the plan was to add a new command to handle the case of
'create new element if it didn't exist', but 'flags' style looks
cleaner and overall diff is much smaller (more code reused), so add 'flags'
attribute to BPF_MAP_UPDATE_ELEM command with the following meaning:
#define BPF_ANY 0 /* create new element or update existing */
#define BPF_NOEXIST 1 /* create new element if it didn't exist */
#define BPF_EXIST 2 /* update existing element */
bpf_update_elem(fd, key, value, BPF_NOEXIST) call can fail with EEXIST
if element already exists.
bpf_update_elem(fd, key, value, BPF_EXIST) can fail with ENOENT
if element doesn't exist.
Userspace will call it as:
int bpf_update_elem(int fd, void *key, void *value, __u64 flags)
{
union bpf_attr attr = {
.map_fd = fd,
.key = ptr_to_u64(key),
.value = ptr_to_u64(value),
.flags = flags;
};
David S. Miller [Tue, 18 Nov 2014 18:37:29 +0000 (13:37 -0500)]
Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-next
Jeff Kirsher says:
====================
Intel Wired LAN Driver Updates 2014-11-18
This series contains updates to i40e only.
Shannon provides a patch to clean up the driver to only warn once that
PTP is not supported when linked at 100Mbps.
Mitch provides a fix for i40e where the VF interrupt processing takes
a long time and it is possible that we could lose a VFLR event if it
happens while processing a VFLR on another VF. To correct this situation,
we enable the VFLR interrupt cause before we begin processing any pending
resets.
Neerav provides several patches to update DCB support in i40e. When
there are DCB configuration changes based on DCBx, the firmware suspends
the port's Tx and generates an event to the PF. The PF is then
responsible to reconfigure the PF VSIs and switching topology as per the
updated DCB configuration and then resume the port's Tx by calling the
"Resume Port Tx" AQ command, so add this call to the flow that handles
DCB re-configuration in the PF. Allow the driver to query and use DCB
configuration from firmware when firmware DCBx agent is in CEE mode.
Add a check whether LLDP Agent's default AdminStatus is enabled or
disabled on a given port, and sets DCBx status to disabled if the
status is disabled. Fix an issue when the port TC configuration
changes as a result of DCBx and the driver modifies the enabled TCs for
the VEBs it manages but does not update the enabled_tc value that
was cached on a per VEB basis. Add a new PF state so that if a port's
Tx is in suspended state the Tx queue disable flow would just put the
request for the queue to be disabled and return without waiting for the
queue to be actually disabled. Allows the driver to enable/disable
the XPS based on the number of TCs being enabled for the given VSI.
v2: Dropped patch "i40e: Handle a single mss packet with more than 8 frags"
while we rework the patch after we test a bit more based on feedback from
Eric Dumazet.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Denis Kirjanov [Mon, 17 Nov 2014 20:07:41 +0000 (23:07 +0300)]
PPC: bpf_jit_comp: Unify BPF_MOD | BPF_X and BPF_DIV | BPF_X
Reduce duplicated code by unifying
BPF_ALU | BPF_MOD | BPF_X and BPF_ALU | BPF_DIV | BPF_X
CC: Alexei Starovoitov<alexei.starovoitov@gmail.com> CC: Daniel Borkmann<dborkman@redhat.com> CC: Philippe Bergheaud<felix@linux.vnet.ibm.com> Signed-off-by: Denis Kirjanov <kda@linux-powerpc.org> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Neerav Parikh [Wed, 12 Nov 2014 00:19:02 +0000 (00:19 +0000)]
i40e: Set XPS bit mask to zero in DCB mode
Due to DCBX configuration change if the VSI needs to use more than 1 TC;
it needs to disable the XPS maps that were set when operating in 1 TC mode.
Without disabling XPS the netdev layer will select queues based on those
settings and not use the TC queue mapping to make the queue selection.
This patch allows the driver to enable/disable the XPS based on the number
of TCs being enabled for the given VSI.
Change-ID: Idc4dec47a672d2a509f6d7fe11ed1ee65b4f0e08 Signed-off-by: Neerav Parikh <neerav.parikh@intel.com> Tested-By: Jack Morgan <jack.morgan@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Neerav Parikh [Wed, 12 Nov 2014 00:18:57 +0000 (00:18 +0000)]
i40e: Prevent link flow control settings when PFC is enabled
When PFC is enabled we should not proceed with setting the link flow control
parameters. Also, always report the link flow Tx/Rx settings as off when
PFC is enabled.
Change-ID: Ib09ec58afdf0b2e587ac9d8851a5c80ad58206c4 Signed-off-by: Neerav Parikh <neerav.parikh@intel.com> Tested-By: Jack Morgan <jack.morgan@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Neerav Parikh [Wed, 12 Nov 2014 00:18:51 +0000 (00:18 +0000)]
i40e: Do not disable/enable FCoE VSI with DCB reconfig
FCoE VSI Tx queue disable times out when reconfiguring as a result of
DCB TC configuration change event.
The hardware allows us to skip disabling and enabling of Tx queues for
VSIs with single TC enabled. As FCoE VSI is configured to have only
single TC we skip it from disable/enable flow.
Change-ID: Ia73ff3df8785ba2aa3db91e6f2c9005e61ebaec2 Signed-off-by: Neerav Parikh <neerav.parikh@intel.com> Tested-By: Jack Morgan <jack.morgan@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Neerav Parikh [Wed, 12 Nov 2014 00:18:46 +0000 (00:18 +0000)]
i40e: Modify Tx disable wait flow in case of DCB reconfiguration
When DCB TC configuration changes the firmware suspends the port's Tx.
Now, as DCB TCs may have changed the PF driver tries to reconfigure the
TC configuration of the VSIs it manages. As part of this process it disables
the VSI queues but the Tx queue disable will not complete as the port's
Tx has been suspended. So, waiting for Tx queues to go to disable state
in this flow may lead to detection of Tx queue disable timeout errors.
Hence, this patch adds a new PF state so that if a port's Tx is in
suspended state the Tx queue disable flow would just put the request for
the queue to be disabled and return without waiting for the queue to be
actually disabled.
Once the VSI(s) TC reconfiguration has been done and driver has called
firmware AQC "Resume PF Traffic" the driver checks the Tx queues requested
to be disabled are actually disabled before re-enabling them again.
Change-ID: If3e03ce4813a4e342dbd5a1eb1d2861e952b7544 Signed-off-by: Neerav Parikh <neerav.parikh@intel.com> Tested-By: Jack Morgan <jack.morgan@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Neerav Parikh [Wed, 12 Nov 2014 00:18:41 +0000 (00:18 +0000)]
i40e: Update VEB's enabled_tc after reconfiguration
When the port TC configuration changes as a result of DCBx the driver
modifies the enabled TCs for the VEBs it manages. But, in the process
it did not update the enabled_tc value that it caches on a per VEB basis.
So, when the next reconfiguration event occurs where the number of TC
value is same as the value cached in enabled_tc for a given VEB; driver
does not modify it's TC configuration by calling appropriate AQ command
believing it is running with the same configuration as requested.
Now, as the VEB is not actually enabled for the TCs that are there any
TC configuration command for VSI attached to that VEB with TCs that are
not enabled for the VEB fails.
This patch fixes this issue.
Change-ID: Ife5694469b05494228e0d850429ea1734738cf29 Signed-off-by: Neerav Parikh <neerav.parikh@intel.com> Tested-By: Jack Morgan <jack.morgan@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Neerav Parikh [Wed, 12 Nov 2014 00:18:30 +0000 (00:18 +0000)]
i40e: Check for LLDP AdminStatus before querying DCBX
This patch adds a check whether LLDP Agent's default AdminStatus is
enabled or disabled on a given port. If it is disabled then it sets
the DCBX status to disabled as well; and would not query firmware for
any DCBX configuration data.
Change-ID: I73c0b9f0adbf4cae177d14914b20a48c9a8f50fd Signed-off-by: Neerav Parikh <neerav.parikh@intel.com> Tested-By: Jack Morgan <jack.morgan@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Neerav Parikh [Wed, 12 Nov 2014 00:18:25 +0000 (00:18 +0000)]
i40e: Add support to firmware CEE DCBX mode
This patch allows i40e driver to query and use DCB configuration from
firmware when firmware DCBX agent is in CEE mode.
Change-ID: I30f92a67eb890f0f024f35339696e6e83d49a274 Signed-off-by: Neerav Parikh <neerav.parikh@intel.com> Tested-By: Jack Morgan <jack.morgan@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Neerav Parikh [Wed, 12 Nov 2014 00:18:20 +0000 (00:18 +0000)]
i40e: Resume Port Tx after DCB event
When there are DCB configuration changes based on DCBX the firmware suspends
the port's Tx and generates an event to the PF. The PF is then responsible
to reconfigure the PF VSIs and switching topology as per the updated DCB
configuration and then resume the port's Tx by calling the "Resume Port Tx"
AQ command.
This patch adds this call to the flow that handles DCB re-configuration in
the PF.
Change-ID: I5b860ad48abfbf379b003143c4d3453e2ed5cc1c Signed-off-by: Neerav Parikh <neerav.parikh@intel.com> Tested-By: Jack Morgan <jack.morgan@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Mitch Williams [Tue, 11 Nov 2014 03:15:04 +0000 (03:15 +0000)]
i40e: re-enable VFLR interrupt sooner
VF interrupt processing takes a looooong time, and it's possible that we
could lose a VFLR event if it happens while we're processing a VFLR on
another VF. This would leave the VF in a semi-permanent reset state,
which would not be cleared until yet another VF experiences a VFLR.
To correct this situation, we enable the VFLR interrupt cause before we
begin processing any pending resets. This means that any VFLR that
occurs during reset processing will generate another interrupt and this
routine will get called again.
This change may cause a spurious interrupt when multiple VFLRs occur
very close together in time. If this happens, then this routine will be
called again and it will detect no outstanding VFLR events and do
nothing. No harm, no foul.
Change-ID: Id0451f3e6e73a2cf6db1668296c71e129b59dc19 Signed-off-by: Mitch Williams <mitch.a.williams@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Shannon Nelson [Tue, 11 Nov 2014 03:15:03 +0000 (03:15 +0000)]
i40e: only warn once of PTP nonsupport in 100Mbit speed
Only warn once that PTP is not supported when linked at 100Mbit.
Yes, using a static this way means that this once-only message is not
port specific, but once only for the life of the driver, regardless of
the number of ports. That should be plenty.
Change-ID: Ie6476530056df408452e195ef06afd4f57caa4b2 Signed-off-by: Shannon Nelson <shannon.nelson@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Roger Quadros [Mon, 17 Nov 2014 12:09:03 +0000 (14:09 +0200)]
net: can: c_can: Add support for TI am4372 DCAN
AM4372 SoC has 2 DCAN modules. Add compatible id and
raminit driver data for it. The driver data is same as AM3352
but this gives us flexibility to add AM4372 specific quirks
if required later.
Signed-off-by: Roger Quadros <rogerq@ti.com> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
Roger Quadros [Fri, 14 Nov 2014 15:40:13 +0000 (17:40 +0200)]
can: c_can: Disable pins when CAN interface is down
DRA7 CAN IP suffers from a problem which causes it to be prevented
from fully turning OFF (i.e. stuck in transition) if the module was
disabled while there was traffic on the CAN_RX line.
To work around this issue we select the SLEEP pin state by default
on probe and use the DEFAULT pin state on CAN up and back to the
SLEEP pin state on CAN down.
Signed-off-by: Roger Quadros <rogerq@ti.com> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
Roger Quadros [Fri, 7 Nov 2014 14:49:19 +0000 (16:49 +0200)]
can: c_can: Add support for START pulse in RAMINIT sequence
Some SoCs e.g. (TI DRA7xx) need a START pulse to start the
RAMINIT sequence i.e. START bit must be set and cleared before
checking for the DONE bit status.
Signed-off-by: Roger Quadros <rogerq@ti.com> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
Roger Quadros [Fri, 14 Nov 2014 15:37:39 +0000 (17:37 +0200)]
can: c_can: Add syscon/regmap RAMINIT mechanism
Some TI SoCs like DRA7 have a RAMINIT register specification
different from the other AMxx SoCs and as expected by the
existing driver.
To add more insanity, this register is shared with other
IPs like DSS, PCIe and PWM.
Provides a more generic mechanism to specify the RAMINIT
register location and START/DONE bit position and use the
syscon/regmap framework to access the register.
Signed-off-by: Roger Quadros <rogerq@ti.com> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
Roger Quadros [Fri, 7 Nov 2014 14:49:16 +0000 (16:49 +0200)]
can: c_can: Introduce c_can_driver_data structure
We want to have more data than just can_dev_id to be present
in the driver data e.g. TI platforms need RAMINIT register
description. Introduce the c_can_driver_data structure and move
the can_dev_id into it.
Tidy up the way it is used on probe().
Signed-off-by: Roger Quadros <rogerq@ti.com> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
Roger Quadros [Fri, 7 Nov 2014 14:49:15 +0000 (16:49 +0200)]
can: c_can: Add timeout to c_can_hw_raminit_ti()
TI's RAMINIT DONE mechanism is buggy on AM43xx SoC and may not always
be set after the START bit is set. Although it seems to work fine even
in that case. So add a timeout mechanism to c_can_hw_raminit_wait_ti().
Don't bail out in that failure case but just print an error message.
Signed-off-by: Roger Quadros <rogerq@ti.com> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
David S. Miller [Sun, 16 Nov 2014 20:59:19 +0000 (15:59 -0500)]
Merge branch 'rss_key_fill'
Eric Dumazet says:
====================
net: provide common RSS key infrastructure
RSS (Receive Side Scaling) uses a 40 bytes key to provide hash for incoming
packets to select appropriate incoming queue on NIC.
Hash algo (Toeplitz) is also well known and documented by Microsoft
(search for "Verifying the RSS Hash Calculation")
Problem is that some drivers use a well known key.
It makes very easy for attackers to target one particular RX queue,
knowing that number of RX queues is a power of two, or at least some
small number.
Other drivers use a random value per port, making difficult
tuning on bonding setups.
Lets add a common infrastructure, so that host gets an unique
RSS key, and drivers do not have to worry about this.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Sun, 16 Nov 2014 14:23:15 +0000 (06:23 -0800)]
ixgbe: use netdev_rss_key_fill() helper
Use of well known RSS key increases attack surface.
Switch to a random one, using generic helper so that all
ports share a common key.
Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Cc: Jesse Brandeburg <jesse.brandeburg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Sun, 16 Nov 2014 14:23:14 +0000 (06:23 -0800)]
igb: use netdev_rss_key_fill() helper
Use of well known RSS key increases attack surface.
Switch to a random one, using generic helper so that all
ports share a common key.
Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Cc: Jesse Brandeburg <jesse.brandeburg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Sun, 16 Nov 2014 14:23:13 +0000 (06:23 -0800)]
i40e: use netdev_rss_key_fill() helper
Use of well known RSS key increases attack surface.
Switch to a random one, using generic helper so that all
ports share a common key.
Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Cc: Jesse Brandeburg <jesse.brandeburg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Sun, 16 Nov 2014 14:23:12 +0000 (06:23 -0800)]
fm10k: use netdev_rss_key_fill() helper
Use of well known RSS key increases attack surface.
Switch to a random one, using generic helper so that all
ports share a common key.
Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Cc: Jesse Brandeburg <jesse.brandeburg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Sun, 16 Nov 2014 14:23:11 +0000 (06:23 -0800)]
e100e: use netdev_rss_key_fill() helper
Use of well known RSS key increases attack surface.
Switch to a random one, using generic helper so that all
ports share a common key.
Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Cc: Jesse Brandeburg <jesse.brandeburg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Sun, 16 Nov 2014 14:23:10 +0000 (06:23 -0800)]
be2net:use netdev_rss_key_fill() helper
Use netdev_rss_key_fill() helper, as it provides better support for some
bonding setups.
Rename rss_hkey local variable to rss_key to have consistent name among
drivers.
Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Sathya Perla <sathya.perla@emulex.com> Cc: Subbu Seetharaman <subbu.seetharaman@emulex.com> Cc: Ajit Khaparde <ajit.khaparde@emulex.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Sun, 16 Nov 2014 14:23:08 +0000 (06:23 -0800)]
tg3: use netdev_rss_key_fill() helper
Use of well known RSS key increases attack surface.
Switch to a random one, using generic helper so that all
ports share a common key.
Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Prashant Sreedharan <prashant@broadcom.com> Cc: Michael Chan <mchan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Sun, 16 Nov 2014 14:23:05 +0000 (06:23 -0800)]
net: provide a per host RSS key generic infrastructure
RSS (Receive Side Scaling) typically uses Toeplitz hash and a 40 or 52 bytes
RSS key.
Some drivers use a constant (and well known key), some drivers use a random
key per port, making bonding setups hard to tune. Well known keys increase
attack surface, considering that number of queues is usually a power of two.
This patch provides infrastructure to help drivers doing the right thing.
netdev_rss_key_fill() should be used by drivers to initialize their RSS key,
even if they provide ethtool -X support to let user redefine the key later.
A new /proc/sys/net/core/netdev_rss_key file can be used to get the host
RSS key even for drivers not providing ethtool -x support, in case some
applications want to precisely setup flows to match some RX queues.
David S. Miller [Sun, 16 Nov 2014 20:48:01 +0000 (15:48 -0500)]
Merge branch 'mv88e6171_temps'
Andrew Lunn says:
====================
Add temperature reading and registers dump to mv88e6171
These patches centralize the temperature sensor reading code, and then
make use of it with the mv88e6171 which has a compatible
sensor. Additionally, support is added for reading the mv88e6171 via
ethtool.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Andrew Lunn [Sat, 15 Nov 2014 21:24:52 +0000 (22:24 +0100)]
net: dsa: mv88e6171: Add support for reading the temperature
This chip also has a temperature sensor which can be read using the
common code. In order to use it, add the needed mutex protection for
accessing registers via the shared code.
Signed-off-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
Andrew Lunn [Sat, 15 Nov 2014 21:24:51 +0000 (22:24 +0100)]
net: dsa: Centralise code for reading the temperature sensor
The method to read the temperature used in the mve6123_61_65 driver
can also be used for other chips. Move the code into the shared code
base of mv88e6xxx.c.
Signed-off-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
Holger Brunck [Fri, 14 Nov 2014 17:33:19 +0000 (18:33 +0100)]
tipc: allow one link per bearer to neighboring nodes
There is no reason to limit the amount of possible links to a
neighboring node to 2. If we have more then two bearers we can also
establish more links.
Signed-off-by: Holger Brunck <holger.brunck@keymile.com> Reviewed-By: Jon Maloy <jon.maloy@ericsson.com>
cc: Ying Xue <ying.xue@windriver.com>
cc: Erik Hugne <erik.hugne@ericsson.com>
cc: netdev@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>
Commit e5a2c899 introduced an alternative_call, arch_fast_hash2,
that selects between __jhash2 and __intel_crc4_2_hash based on the
X86_FEATURE_XMM4_2.
Unfortunately, the alternative_call system does not appear to be
suitable for use with C functions, as register usage is not handled
properly for the called functions. The __jhash2 function in particular
clobbers registers that are not preserved when called via
alternative_call, resulting in a panic for direct callers of
arch_fast_hash2 on older CPUs lacking sse4_2. It is possible that
__intel_crc4_2_hash works merely by chance because it uses fewer
registers.
This commit was suggested as the source of the problem by Jesse
Gross <jesse@nicira.com>.
Signed-off-by: Jay Vosburgh <jay.vosburgh@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net>
1) sunhme driver lacks DMA mapping error checks, based upon a report by
Meelis Roos.
2) Fix memory leak in mvpp2 driver, from Sudip Mukherjee.
3) DMA memory allocation sizes are wrong in systemport ethernet driver,
fix from Florian Fainelli.
4) Fix use after free in mac80211 defragmentation code, from Johannes
Berg.
5) Some networking uapi headers missing from Kbuild file, from Stephen
Hemminger.
6) TUN driver gets csum_start offset wrong when VLAN accel is enabled,
and macvtap has a similar bug, from Herbert Xu.
7) Adjust several tunneling drivers to set dev->iflink after registry,
because registry sets that to -1 overwriting whatever we did. From
Steffen Klassert.
8) Geneve forgets to set inner tunneling type, causing GSO segmentation
to fail on some NICs. From Jesse Gross.
9) Fix several locking bugs in stmmac driver, from Fabrice Gasnier and
Giuseppe CAVALLARO.
10) Fix spurious timeouts with NewReno on low traffic connections, from
Marcelo Leitner.
11) Fix descriptor updates in enic driver, from Govindarajulu
Varadarajan.
12) PPP calls bpf_prog_create() with locks held, which isn't kosher.
Fix from Takashi Iwai.
13) Fix NULL deref in SCTP with malformed INIT packets, from Daniel
Borkmann.
14) psock_fanout selftest accesses past the end of the mmap ring, fix
from Shuah Khan.
15) Fix PTP timestamping for VLAN packets, from Richard Cochran.
16) netlink_unbind() calls in netlink pass wrong initial argument, from
Hiroaki SHIMODA.
17) vxlan socket reuse accidently reuses a socket when the address
family is different, so we have to explicitly check this, from
Marcelo Lietner.
18) Fix missing include in nft_reject_bridge.c breaking the build on ppc
and other architectures, from Guenter Roeck.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (75 commits)
vxlan: Do not reuse sockets for a different address family
smsc911x: power-up phydev before doing a software reset.
lib: rhashtable - Remove weird non-ASCII characters from comments
net/smsc911x: Fix delays in the PHY enable/disable routines
net/smsc911x: Fix rare soft reset timeout issue due to PHY power-down mode
netlink: Properly unbind in error conditions.
net: ptp: fix time stamp matching logic for VLAN packets.
cxgb4 : dcb open-lldp interop fixes
selftests/net: psock_fanout seg faults in sock_fanout_read_ring()
net: bcmgenet: apply MII configuration in bcmgenet_open()
net: bcmgenet: connect and disconnect from the PHY state machine
net: qualcomm: Fix dependency
ixgbe: phy: fix uninitialized status in ixgbe_setup_phy_link_tnx
net: phy: Correctly handle MII ioctl which changes autonegotiation.
ipv6: fix IPV6_PKTINFO with v4 mapped
net: sctp: fix memory leak in auth key management
net: sctp: fix NULL pointer dereference in af->from_addr_param on malformed packet
net: ppp: Don't call bpf_prog_create() in ppp_lock
net/mlx4_en: Advertize encapsulation offloads features only when VXLAN tunnel is set
cxgb4 : Fix bug in DCB app deletion
...
Linus Torvalds [Fri, 14 Nov 2014 00:57:25 +0000 (16:57 -0800)]
Merge branch 'akpm' (fixes from Andrew Morton)
Merge misc fixes from Andrew Morton:
"15 fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
MAINTAINERS: add IIO include files
kernel/panic.c: update comments for print_tainted
mem-hotplug: reset node present pages when hot-adding a new pgdat
mem-hotplug: reset node managed pages when hot-adding a new pgdat
mm/debug-pagealloc: correct freepage accounting and order resetting
fanotify: fix notification of groups with inode & mount marks
mm, compaction: prevent infinite loop in compact_zone
mm: alloc_contig_range: demote pages busy message from warn to info
mm/slab: fix unalignment problem on Malta with EVA due to slab merge
mm/page_alloc: restrict max order of merging on isolated pageblock
mm/page_alloc: move freepage counting logic to __free_one_page()
mm/page_alloc: add freepage on isolate pageblock to correct buddy list
mm/page_alloc: fix incorrect isolation behavior by rechecking migratetype
mm/compaction: skip the range until proper target pageblock is met
zram: avoid kunmap_atomic() of a NULL pointer
Linus Torvalds [Fri, 14 Nov 2014 00:36:42 +0000 (16:36 -0800)]
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull Ceph fixes from Sage Weil:
"There is an overflow bug fix for cephfs from Zheng, a fix for handling
large authentication ticket buffers in libceph from Ilya, and a few
fixes for the request handling code from Ilya that affect RBD volumes"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
libceph: change from BUG to WARN for __remove_osd() asserts
libceph: clear r_req_lru_item in __unregister_linger_request()
libceph: unlink from o_linger_requests when clearing r_osd
libceph: do not crash on large auth tickets
ceph: fix flush tid comparision
Linus Torvalds [Fri, 14 Nov 2014 00:19:14 +0000 (16:19 -0800)]
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid
Pull HID fixes from Jiri Kosina:
- fix for an oops in HID core upon repeated subdriver insertion/removal
under certain circumstances, by Benjamin Tissoires
- quirk for another Elan Touchscreen device, by Adel Gadllah
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid:
HID: core: cleanup .claimed field on disconnect
HID: usbhid: enable always-poll quirk for Elan Touchscreen 0103
Xie XiuQi [Thu, 13 Nov 2014 23:19:44 +0000 (15:19 -0800)]
kernel/panic.c: update comments for print_tainted
Commit 69361eef9056 ("panic: add TAINT_SOFTLOCKUP") added the 'L' flag,
but failed to update the comments for print_tainted(). So, update the
comments.
Tang Chen [Thu, 13 Nov 2014 23:19:41 +0000 (15:19 -0800)]
mem-hotplug: reset node present pages when hot-adding a new pgdat
When memory is hot-added, all the memory is in offline state. So clear
all zones' present_pages because they will be updated in online_pages()
and offline_pages(). Otherwise, /proc/zoneinfo will corrupt:
Tang Chen [Thu, 13 Nov 2014 23:19:39 +0000 (15:19 -0800)]
mem-hotplug: reset node managed pages when hot-adding a new pgdat
In free_area_init_core(), zone->managed_pages is set to an approximate
value for lowmem, and will be adjusted when the bootmem allocator frees
pages into the buddy system.
But free_area_init_core() is also called by hotadd_new_pgdat() when
hot-adding memory. As a result, zone->managed_pages of the newly added
node's pgdat is set to an approximate value in the very beginning.
Even if the memory on that node has node been onlined,
/sys/device/system/node/nodeXXX/meminfo has wrong value:
This patch fixes this problem by reset node managed pages to 0 after
hot-adding a new node.
1. Move reset_managed_pages_done from reset_node_managed_pages() to
reset_all_zones_managed_pages()
2. Make reset_node_managed_pages() non-static
3. Call reset_node_managed_pages() in hotadd_new_pgdat() after pgdat
is initialized
Joonsoo Kim [Thu, 13 Nov 2014 23:19:36 +0000 (15:19 -0800)]
mm/debug-pagealloc: correct freepage accounting and order resetting
One thing I did in this patch is fixing freepage accounting. If we
clear guard page and link it onto isolate buddy list, we should not
increase freepage count. This patch adds conditional branch to skip
counting in this case. Without this patch, this overcounting happens
frequently if guard order is set and CMA is used.
Another thing fixed in this patch is the target to reset order. In
__free_one_page(), we check the buddy page whether it is a guard page or
not. And, if so, we should clear guard attribute on the buddy page and
reset order of it to 0. But, current code resets original page's order
rather than buddy one's. Maybe, this doesn't have any problem, because
whole merged page's order will be re-assigned soon. But, it is better
to correct code.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Gioh Kim <gioh.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jan Kara [Thu, 13 Nov 2014 23:19:33 +0000 (15:19 -0800)]
fanotify: fix notification of groups with inode & mount marks
fsnotify() needs to merge inode and mount marks lists when notifying
groups about events so that ignore masks from inode marks are reflected
in mount mark notifications and groups are notified in proper order
(according to priorities).
Currently the sorting of the lists done by fsnotify_add_inode_mark() /
fsnotify_add_vfsmount_mark() and fsnotify() differed which resulted
ignore masks not being used in some cases.
Fix the problem by always using the same comparison function when
sorting / merging the mark lists.
Thanks to Heinrich Schuchardt for improvements of my patch.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=87721 Signed-off-by: Jan Kara <jack@suse.cz> Reported-by: Heinrich Schuchardt <xypron.glpk@gmx.de> Tested-by: Heinrich Schuchardt <xypron.glpk@gmx.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Vlastimil Babka [Thu, 13 Nov 2014 23:19:30 +0000 (15:19 -0800)]
mm, compaction: prevent infinite loop in compact_zone
Several people have reported occasionally seeing processes stuck in
compact_zone(), even triggering soft lockups, in 3.18-rc2+.
Testing a revert of commit e14c720efdd7 ("mm, compaction: remember
position within pageblock in free pages scanner") fixed the issue,
although the stuck processes do not appear to involve the free scanner.
Finally, by code inspection, the bug was found in isolate_migratepages()
which uses a slightly different condition to detect if the migration and
free scanners have met, than compact_finished(). That has not been a
problem until commit e14c720efdd7 allowed the free scanner position
between individual invocations to be in the middle of a pageblock.
In a relatively rare case, the migration scanner position can end up at
the beginning of a pageblock, with the free scanner position in the
middle of the same pageblock. If it's the migration scanner's turn,
isolate_migratepages() exits immediately (without updating the
position), while compact_finished() decides to continue compaction,
resulting in a potentially infinite loop. The system can recover only
if another process creates enough high-order pages to make the watermark
checks in compact_finished() pass.
This patch fixes the immediate problem by bumping the migration
scanner's position to meet the free scanner in isolate_migratepages(),
when both are within the same pageblock. This causes compact_finished()
to terminate properly. A more robust check in compact_finished() is
planned as a cleanup for better future maintainability.
Fixes: e14c720efdd73 ("mm, compaction: remember position within pageblock in free pages scanner) Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reported-by: P. Christeas <xrg@linux.gr> Tested-by: P. Christeas <xrg@linux.gr> Link: http://marc.info/?l=linux-mm&m=141508604232522&w=2 Reported-by: Norbert Preining <preining@logic.at> Tested-by: Norbert Preining <preining@logic.at> Link: https://lkml.org/lkml/2014/11/4/904 Reported-by: Pavel Machek <pavel@ucw.cz> Link: https://lkml.org/lkml/2014/11/7/164 Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm: alloc_contig_range: demote pages busy message from warn to info
Having test_pages_isolated failure message as a warning confuses users
into thinking that it is more serious than it really is. In reality, if
called via CMA, allocation will be retried so a single
test_pages_isolated failure does not prevent allocation from succeeding.
Demote the warning message to an info message and reformat it such that
the text "failed" does not appear and instead a less worrying "PFNS
busy" is used.
This message is trivially reproducible on a 10GB x86 machine on 3.16.y
kernels configured with CONFIG_DMA_CMA.
Signed-off-by: Michal Nazarewicz <mina86@mina86.com> Cc: Laurent Pinchart <laurent.pinchart@ideasonboard.com> Cc: Peter Hurley <peter@hurleysoftware.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joonsoo Kim [Thu, 13 Nov 2014 23:19:25 +0000 (15:19 -0800)]
mm/slab: fix unalignment problem on Malta with EVA due to slab merge
Unlike SLUB, sometimes, object isn't started at the beginning of the
slab in SLAB. This causes the unalignment problem after slab merging is
supported by commit 12220dea07f1 ("mm/slab: support slab merge").
Following is the report from Markos that fail to boot on Malta with EVA.
alloc_unbound_pwq() allocates slab object from pool_workqueue. This
kmem_cache requires 256 bytes alignment, but, current merging code
doesn't honor that, and merge it with kmalloc-256. kmalloc-256 requires
only cacheline size alignment so that above failure occurs. However, in
x86, kmalloc-256 is luckily aligned in 256 bytes, so the problem didn't
happen on it.
To fix this problem, this patch introduces alignment mismatch check in
find_mergeable(). This will fix the problem.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reported-by: Markos Chandras <Markos.Chandras@imgtec.com> Tested-by: Markos Chandras <Markos.Chandras@imgtec.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joonsoo Kim [Thu, 13 Nov 2014 23:19:21 +0000 (15:19 -0800)]
mm/page_alloc: restrict max order of merging on isolated pageblock
Current pageblock isolation logic could isolate each pageblock
individually. This causes freepage accounting problem if freepage with
pageblock order on isolate pageblock is merged with other freepage on
normal pageblock. We can prevent merging by restricting max order of
merging to pageblock order if freepage is on isolate pageblock.
A side-effect of this change is that there could be non-merged buddy
freepage even if finishing pageblock isolation, because undoing
pageblock isolation is just to move freepage from isolate buddy list to
normal buddy list rather than to consider merging. So, the patch also
makes undoing pageblock isolation consider freepage merge. When
un-isolation, freepage with more than pageblock order and it's buddy are
checked. If they are on normal pageblock, instead of just moving, we
isolate the freepage and free it in order to get merged.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Mel Gorman <mgorman@suse.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: Tang Chen <tangchen@cn.fujitsu.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Cc: Wen Congyang <wency@cn.fujitsu.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Laura Abbott <lauraa@codeaurora.org> Cc: Heesub Shin <heesub.shin@samsung.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Ritesh Harjani <ritesh.list@gmail.com> Cc: Gioh Kim <gioh.kim@lge.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joonsoo Kim [Thu, 13 Nov 2014 23:19:18 +0000 (15:19 -0800)]
mm/page_alloc: move freepage counting logic to __free_one_page()
All the caller of __free_one_page() has similar freepage counting logic,
so we can move it to __free_one_page(). This reduce line of code and
help future maintenance.
This is also preparation step for "mm/page_alloc: restrict max order of
merging on isolated pageblock" which fix the freepage counting problem
on freepage with more than pageblock order.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Mel Gorman <mgorman@suse.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: Tang Chen <tangchen@cn.fujitsu.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Cc: Wen Congyang <wency@cn.fujitsu.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Laura Abbott <lauraa@codeaurora.org> Cc: Heesub Shin <heesub.shin@samsung.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Ritesh Harjani <ritesh.list@gmail.com> Cc: Gioh Kim <gioh.kim@lge.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joonsoo Kim [Thu, 13 Nov 2014 23:19:14 +0000 (15:19 -0800)]
mm/page_alloc: add freepage on isolate pageblock to correct buddy list
In free_pcppages_bulk(), we use cached migratetype of freepage to
determine type of buddy list where freepage will be added. This
information is stored when freepage is added to pcp list, so if
isolation of pageblock of this freepage begins after storing, this
cached information could be stale. In other words, it has original
migratetype rather than MIGRATE_ISOLATE.
There are two problems caused by this stale information.
One is that we can't keep these freepages from being allocated.
Although this pageblock is isolated, freepage will be added to normal
buddy list so that it could be allocated without any restriction. And
the other problem is incorrect freepage accounting. Freepages on
isolate pageblock should not be counted for number of freepage.
Following is the code snippet in free_pcppages_bulk().
/* MIGRATE_MOVABLE list may include MIGRATE_RESERVEs */
__free_one_page(page, page_to_pfn(page), zone, 0, mt);
trace_mm_page_pcpu_drain(page, 0, mt);
if (likely(!is_migrate_isolate_page(page))) {
__mod_zone_page_state(zone, NR_FREE_PAGES, 1);
if (is_migrate_cma(mt))
__mod_zone_page_state(zone, NR_FREE_CMA_PAGES, 1);
}
As you can see above snippet, current code already handle second
problem, incorrect freepage accounting, by re-fetching pageblock
migratetype through is_migrate_isolate_page(page).
But, because this re-fetched information isn't used for
__free_one_page(), first problem would not be solved. This patch try to
solve this situation to re-fetch pageblock migratetype before
__free_one_page() and to use it for __free_one_page().
In addition to move up position of this re-fetch, this patch use
optimization technique, re-fetching migratetype only if there is isolate
pageblock. Pageblock isolation is rare event, so we can avoid
re-fetching in common case with this optimization.
This patch also correct migratetype of the tracepoint output.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Minchan Kim <minchan@kernel.org> Acked-by: Michal Nazarewicz <mina86@mina86.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Mel Gorman <mgorman@suse.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: Tang Chen <tangchen@cn.fujitsu.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Cc: Wen Congyang <wency@cn.fujitsu.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Laura Abbott <lauraa@codeaurora.org> Cc: Heesub Shin <heesub.shin@samsung.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Ritesh Harjani <ritesh.list@gmail.com> Cc: Gioh Kim <gioh.kim@lge.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joonsoo Kim [Thu, 13 Nov 2014 23:19:11 +0000 (15:19 -0800)]
mm/page_alloc: fix incorrect isolation behavior by rechecking migratetype
Before describing bugs itself, I first explain definition of freepage.
1. pages on buddy list are counted as freepage.
2. pages on isolate migratetype buddy list are *not* counted as freepage.
3. pages on cma buddy list are counted as CMA freepage, too.
Now, I describe problems and related patch.
Patch 1: There is race conditions on getting pageblock migratetype that
it results in misplacement of freepages on buddy list, incorrect
freepage count and un-availability of freepage.
Patch 2: Freepages on pcp list could have stale cached information to
determine migratetype of buddy list to go. This causes misplacement of
freepages on buddy list and incorrect freepage count.
Patch 4: Merging between freepages on different migratetype of
pageblocks will cause freepages accouting problem. This patch fixes it.
Without patchset [3], above problem doesn't happens on my CMA allocation
test, because CMA reserved pages aren't used at all. So there is no
chance for above race.
With patchset [3], I did simple CMA allocation test and get below
result:
- Virtual machine, 4 cpus, 1024 MB memory, 256 MB CMA reservation
- run kernel build (make -j16) on background
- 30 times CMA allocation(8MB * 30 = 240MB) attempts in 5 sec interval
- Result: more than 5000 freepage count are missed
With patchset [3] and this patchset, I found that no freepage count are
missed so that I conclude that problems are solved.
On my simple memory offlining test, these problems also occur on that
environment, too.
This patch (of 4):
There are two paths to reach core free function of buddy allocator,
__free_one_page(), one is free_one_page()->__free_one_page() and the
other is free_hot_cold_page()->free_pcppages_bulk()->__free_one_page().
Each paths has race condition causing serious problems. At first, this
patch is focused on first type of freepath. And then, following patch
will solve the problem in second type of freepath.
In the first type of freepath, we got migratetype of freeing page
without holding the zone lock, so it could be racy. There are two cases
of this race.
1. pages are added to isolate buddy list after restoring orignal
migratetype
CPU1 CPU2
get migratetype => return MIGRATE_ISOLATE
call free_one_page() with MIGRATE_ISOLATE
grab the zone lock
unisolate pageblock
release the zone lock
grab the zone lock
call __free_one_page() with MIGRATE_ISOLATE
freepage go into isolate buddy list,
although pageblock is already unisolated
This may cause two problems. One is that we can't use this page anymore
until next isolation attempt of this pageblock, because freepage is on
isolate buddy list. The other is that freepage accouting could be wrong
due to merging between different buddy list. Freepages on isolate buddy
list aren't counted as freepage, but ones on normal buddy list are
counted as freepage. If merge happens, buddy freepage on normal buddy
list is inevitably moved to isolate buddy list without any consideration
of freepage accouting so it could be incorrect.
2. pages are added to normal buddy list while pageblock is isolated.
It is similar with above case.
This also may cause two problems. One is that we can't keep these
freepages from being allocated. Although this pageblock is isolated,
freepage would be added to normal buddy list so that it could be
allocated without any restriction. And the other problem is same as
case 1, that it, incorrect freepage accouting.
This race condition would be prevented by checking migratetype again
with holding the zone lock. Because it is somewhat heavy operation and
it isn't needed in common case, we want to avoid rechecking as much as
possible. So this patch introduce new variable, nr_isolate_pageblock in
struct zone to check if there is isolated pageblock. With this, we can
avoid to re-check migratetype in common case and do it only if there is
isolated pageblock or migratetype is MIGRATE_ISOLATE. This solve above
mentioned problems.
Changes from v3:
Add one more check in free_one_page() that checks whether migratetype is
MIGRATE_ISOLATE or not. Without this, abovementioned case 1 could happens.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Minchan Kim <minchan@kernel.org> Acked-by: Michal Nazarewicz <mina86@mina86.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Mel Gorman <mgorman@suse.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: Tang Chen <tangchen@cn.fujitsu.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Cc: Wen Congyang <wency@cn.fujitsu.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Laura Abbott <lauraa@codeaurora.org> Cc: Heesub Shin <heesub.shin@samsung.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Ritesh Harjani <ritesh.list@gmail.com> Cc: Gioh Kim <gioh.kim@lge.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joonsoo Kim [Thu, 13 Nov 2014 23:19:07 +0000 (15:19 -0800)]
mm/compaction: skip the range until proper target pageblock is met
Commit 7d49d8868336 ("mm, compaction: reduce zone checking frequency in
the migration scanner") has a side-effect that changes the iteration
range calculation. Before the change, block_end_pfn is calculated using
start_pfn, but now it blindly adds pageblock_nr_pages to the previous
value.
This causes the problem that isolation_start_pfn is larger than
block_end_pfn when we isolate the page with more than pageblock order.
In this case, isolation would fail due to an invalid range parameter.
To prevent this, this patch implements skipping the range until a proper
target pageblock is met. Without this patch, CMA with more than
pageblock order always fails but with this patch it will succeed.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Minchan Kim <minchan@kernel.org> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Weijie Yang [Thu, 13 Nov 2014 23:19:05 +0000 (15:19 -0800)]
zram: avoid kunmap_atomic() of a NULL pointer
zram could kunmap_atomic() a NULL pointer in a rare situation: a zram
page becomes a full-zeroed page after a partial write io. The current
code doesn't handle this case and performs kunmap_atomic() on a NULL
pointer, which panics the kernel.
This patch fixes this issue.
Signed-off-by: Weijie Yang <weijie.yang@samsung.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Weijie Yang <weijie.yang.kh@gmail.com> Acked-by: Jerome Marchand <jmarchan@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Eric Dumazet [Thu, 13 Nov 2014 17:45:22 +0000 (09:45 -0800)]
tcp: limit GSO packets to half cwnd
In DC world, GSO packets initially cooked by tcp_sendmsg() are usually
big, as sk_pacing_rate is high.
When network is congested, cwnd can be smaller than the GSO packets
found in socket write queue. tcp_write_xmit() splits GSO packets
using the available cwnd, and we end up sending a single GSO packet,
consuming all available cwnd.
With GRO aggregation on the receiver, we might handle a single GRO
packet, sending back a single ACK.
1) This single ACK might be lost
TLP or RTO are forced to attempt a retransmit.
2) This ACK releases a full cwnd, sender sends another big GSO packet,
in a ping pong mode.
This behavior does not fill the pipes in the best way, because of
scheduling artifacts.
Make sure we always have at least two GSO packets in flight.
This allows us to safely increase GRO efficiency without risking
spurious retransmits.
Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>