git.proxmox.com Git - mirror

tests: Make OVS_WAIT_UNTIL and OVS_WAIT_WHILE failures easier to debug.

Until now, when OVS_WAIT_UNTIL or OVS_WAIT_WHILE ran, little information
was available: usually nothing at all in the log, unless the wait failed,
in which case there was a line number. This commit adds a note saying
what is being waited for in any case, and a message saying that the wait
failed if it does.

Signed-off-by: Ben Pfaff <blp@ovn.org>
Tested-by: Yifeng Sun <pkusunyifeng@gmail.com>
Reviewed-by: Yifeng Sun <pkusunyifeng@gmail.com>

ovs-router: fix router entry cast

The offsetof(struct ovs_router_entry, cr) should always be 0,
thus the else statement should never be reached.

Signed-off-by: William Tu <u9012063@gmail.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>

doc: Added OVS Conntrack tutorial

OVS supports connection tracker related match fields and actions.
Added a tutorial to demonstrate the basic use cases for some of these
match fields and actions.

Signed-off-by: Ashish Varma <ashishvarma.ovs@gmail.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>

Add unixctl option for ovn-northd

Signed-off-by: Venkata Anil <vkommadi@redhat.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>

learn: improve test case

Current learn test cases use only ovs-ofctl add/del flows.
The patch add a new test case for learn with delete_learned and
limit option enabled.

Signed-off-by: William Tu <u9012063@gmail.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>

Merge branch 'dpdk_merge' of https://github.com/istokes/ovs into HEAD

xlate: fix packets loopback caused by duplicate read of xcfgp.

Some functions, such as xlate_normal_mcast_send_mrouters, test xbundle
pointers equality to avoid sending packet back to in bundle. However,
xbundle pointers port from different xcfgp for same port are inequal.
This may lead to the packet loopback.

This commit stores xcfgp on ctx at first and always uses the same xcfgp
during one packet process period.

Signed-off-by: Huanle Han <hanxueluo@gmail.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>

ofctrl: Remove unused declaration.

Signed-off-by: Han Zhou <zhouhan@gmail.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>

ovn-nbctl: update manpage for lsp-set-type.

Signed-off-by: Han Zhou <zhouhan@gmail.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>

ovs-vswitchd: Avoid or suppress memory leak warning for glibc aio.

The asynchronous IO library in glibc starts threads that show up as memory
leaks in valgrind. This commit attempts to avoid the warnings by flushing
all the asynchronous I/O to the log file before exiting. This only does
part of the job for glibc since it keeps the threads around for some
undefined idle time before killing them, so in addition this commit adds a
valgrind suppression to stop displaying these warnings in any case.

Signed-off-by: Ben Pfaff <blp@ovn.org>
Acked-by: William Tu <u9012063@gmai.com>

ovs-vswitchd: Fire RCU callbacks before exit to reduce memory leak warnings.

ovs-vswitchd makes extensive use of RCU to defer freeing memory past the
latest time that it could be in use by a thread.  Until now, ovs-vswitchd
has not waited for RCU callbacks to fire before exiting.  This meant that
in many cases, when ovs-vswitchd exits, many blocks of memory are stuck in
RCU callback queues, which valgrind often reports as "possible" memory
leaks.

This commit adds a new function ovsrcu_exit() that waits and fires as many
RCU callbacks as it reasonably can.  It can only do so for the thread that
calls it and the thread that calls the callbacks, but generally speaking
ovs-vswitchd shuts down other threads before it exits anyway, so this is
pretty good.

In my testing this eliminates most valgrind warnings for tests that run
ovs-vswitchd.  This ought to make it easier to distinguish new leaks that
are real from existing non-leaks.

Signed-off-by: Ben Pfaff <blp@ovn.org>
Acked-by: William Tu <u9012063@gmai.com>

util: Document and rely on ovs_assert() always evaluating its argument.

The ovs_assert() macro always evaluates its argument, even when NDEBUG is
defined so that failure is ignored. This behavior wasn't documented, and
thus a lot of code didn't rely on it. This commit documents the behavior
and simplifies bits of code that heretofore didn't rely on it.

Signed-off-by: Ben Pfaff <blp@ovn.org>
Reviewed-by: Yifeng Sun <pkusunyifeng@gmail.com>

Support accepting and displaying table names in OVS tools.

OpenFlow has little-known support for naming tables.  Open vSwitch has
supported table names for ages, but it has never used or displayed them
outside of commands dedicated to table manipulation.  This commit adds
support for table names in ovs-ofctl.  When a table has a name, it displays
that name in flows and actions, so that, for example, the following:
    table=1, arp, actions=resubmit(,2)
might become:
    table=ingress_acl, arp, actions=resubmit(,mac_learning)
given appropriately named tables.

For backward compatibility, only interactive ovs-ofctl commands by default
display table names; to display them in scripts, use the new --names
option.

This feature was inspired by a talk that Kei Nohguchi presented at Open
vSwitch 2017 Fall Conference.

CC: Kei Nohguchi <kei@nohguchi.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>
Acked-by: Mark Michelson <mmichels@redhat.com>
Reviewed-by: Yifeng Sun <pkusunyifeng@gmail.com>

ofp-util: New data structure for mapping between table names and numbers.

This shares the infrastructure for mapping port names and numbers. It will
be used in an upcoming commit.

Signed-off-by: Ben Pfaff <blp@ovn.org>
Reviewed-by: Yifeng Sun <pkusunyifeng@gmail.com>
Acked-by: Mark Michelson <mmichels@redhat.com>

ofp-actions: Make formatting and parsing functions take a struct argument.

An upcoming commit will add another parameter for parsing and formatting
actions. It is much easier to add these parameters if they are
encapsulated in a struct, so this commit first makes that change.

Signed-off-by: Ben Pfaff <blp@ovn.org>
Reviewed-by: Yifeng Sun <pkusunyifeng@gmail.com>
Acked-by: Mark Michelson <mmichels@redhat.com>

classifier: Refactor interface for classifier_remove().

Until now, classifier_remove() returned either null or the classifier rule
passed to it, which is an unusual interface. This commit changes it to
return true if it succeeds or false on failure.

In addition, most of classifier_remove()'s callers know ahead of time that
it must succeed, even though most of them didn't bother with an assertion,
so this commit adds a classifier_remove_assert() function as a helper.

Signed-off-by: Ben Pfaff <blp@ovn.org>
Tested-by: Yifeng Sun <pkusunyifeng@gmail.com>
Reviewed-by: Yifeng Sun <pkusunyifeng@gmail.com>

netdev-dpdk: Add support for vHost dequeue zero copy (experimental)

Zero copy is disabled by default. To enable it, set the 'dq-zero-copy'
option to 'true' when configuring the Interface:

ovs-vsctl set Interface dpdkvhostuserclient0
options:vhost-server-path=/tmp/dpdkvhostuserclient0
options:dq-zero-copy=true

When packets from a vHost device with zero copy enabled are destined for
a single 'dpdk' port, the number of tx descriptors on that 'dpdk' port
must be set to a smaller value. 128 is recommended. This can be achieved
like so:

ovs-vsctl set Interface dpdkport options:n_txq_desc=128

Note: The sum of the tx descriptors of all 'dpdk' ports the VM will send
to should not exceed 128. Due to this requirement, the feature is
considered 'experimental'.

Testing of the patch showed a ~8% improvement when switching 512B
packets between vHost devices on different VMs on the same host when
zero copy was enabled on the transmitting device.

Signed-off-by: Ciara Loftus <ciara.loftus@intel.com>
Acked-by: Ilya Maximets <i.maximets@samsung.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>

classifier: Fix typo in comment.

Signed-off-by: Ben Pfaff <blp@ovn.org>
Reviewed-by: Yifeng Sun <pkusunyifeng@gmail.com>

Merge branch 'dpdk_merge' of https://github.com/istokes/ovs into HEAD

ovs-ofctl: Fix typo in comment.

Signed-off-by: Ben Pfaff <blp@ovn.org>

ovs-ofctl: Add "compose-packet" command for testing flow_compose().

I don't feel obligated to add a bunch of automatic tests for
flow_compose(), but this is handy for manual testing or for simple packet
generation.

Signed-off-by: Ben Pfaff <blp@ovn.org>
Tested-by: Yifeng Sun <pkusunyifeng@gmail.com>
Reviewed-by: Yifeng Sun <pkusunyifeng@gmail.com>

flow: Add some L7 payload data to most L4 protocols that accept it.

This makes traffic generated by flow_compose() look slightly more
realistic. It requires lots of updates to tests, but at least the tests
themselves should be slightly more realistic too.

At the same time, add --l7 and --l7-len options to ofproto/trace to allow
users to specify the amount or contents of payloads that they want.

Suggested-by: Brad Cowie <brad@cowie.nz>
Signed-off-by: Ben Pfaff <blp@ovn.org>
Tested-by: Yifeng Sun <pkusunyifeng@gmail.com>
Reviewed-by: Yifeng Sun <pkusunyifeng@gmail.com>

flow: Simplify flow_compose_l4().

Each of the cases in flow_compose_l4() separately tracked the number of
bytes of L4 data added to the packet. This commit makes the function do
that in a single place without per-protocol bookkeeping.

Signed-off-by: Ben Pfaff <blp@ovn.org>
Reviewed-by: Yifeng Sun <pkusunyifeng@gmail.com>

ofproto-dpif-trace: Generalize syntax for ofproto/trace.

ofproto/trace takes a bunch of options that have weird placement and
syntax. This commit changes the syntax so that the options can be placed
anywhere and consistently use a double-dash option prefix. For
compatibility, the previous syntax is also supported.

An upcoming commit will add new options and this change allows that
upcoming commit to be less confusing.

Signed-off-by: Ben Pfaff <blp@ovn.org>
Tested-by: Yifeng Sun <pkusunyifeng@gmail.com>
Reviewed-by: Yifeng Sun <pkusunyifeng@gmail.com>

ovs-vsctl, vtep-ctl: Free 'args' string on exit.

This avoids a memory leak warning from valgrind.

ovn-sbctl and ovn-nbctl already followed this pattern.

Signed-off-by: Ben Pfaff <blp@ovn.org>
Acked-by: William Tu <u9012063@gmail.com>

ofproto: Avoid use-after-free on error path in ofproto_flow_mod_learn().

In the case where the learned flow limit has been reached (below_limit ==
false), ofproto_flow_mod_uninit() would unref ofm->temp_rule (which is
also in the 'rule' local variable) before dereferencing rule->flow_cookie
for the log message. This fixes the problem.

(The greatest likely consequence of this bug was logging the wrong cookie
value.)

Signed-off-by: Ben Pfaff <blp@ovn.org>
Acked-by: William Tu <u9012063@gmail.com>

checkpatch.py: Fix Python style.

Fixes the following warnings:

../utilities/checkpatch.py:219:1: E302 expected 2 blank lines, found 1
../utilities/checkpatch.py:224:1: E302 expected 2 blank lines, found 1
../utilities/checkpatch.py:228:1: E302 expected 2 blank lines, found 1

CC: Justin Pettit <jpettit@ovn.org>
Fixes: 4e99b70dfae0 ("checkpatch.py: Add check for "xxx" in comments.")
Signed-off-by: Ben Pfaff <blp@ovn.org>
Acked-by: Justin Pettit <jpettit@ovn.org>

netdev-dpdk: Fix xstats leak on port destruction.

CC: Michal Weglicki <michalx.weglicki@intel.com>
Fixes: 971f4b394c6e ("netdev: Custom statistics.")
Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>

netdev-dpdk: Fix memory leak in netdev_dpdk_configure_xstats().

CC: Michal Weglicki <michalx.weglicki@intel.com>
Fixes: 971f4b394c6e ("netdev: Custom statistics.")
Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>

netdev-dpdk: Fix memory leak in netdev_dpdk_get_custom_stats().

CC: Michal Weglicki <michalx.weglicki@intel.com>
Fixes: 971f4b394c6e ("netdev: Custom statistics.")
Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>

vswitchd: show DPDK version

Show DPDK version if Open vSwitch is compiled with DPDK support.
Version can be retrieved with `ovs-vswitchd --version` or from OVS logs.
Small change in ovs-ctl to avoid breakage on output change.

Signed-off-by: Matteo Croce <mcroce@redhat.com>
Acked-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>

netdev-dpdk: fix port addition for ports sharing same PCI id

Some NICs have only one PCI address associated with multiple ports. This
patch extends the dpdk-devargs option's format to cater for such devices.

To achieve that, this patch uses a new syntax that will be adapted and
implemented in future DPDK release (likely, v18.05):
    http://dpdk.org/ml/archives/dev/2017-December/084234.html

And since it's the DPDK duty to parse the (complete and full) syntax
and this patch is more likely to serve as an intermediate workaround,
here I take a simpler and shorter syntax from it (note it's allowed to
have only one category being provided):
    class=eth,mac=00:11:22:33:44:55:66

Also, old compatibility is kept. Users can still go on with using the
PCI id to add a port (if that's enough for them). Meaning, this patch
will not break anything.

This patch is basically based on the one from Ciara:
    https://mail.openvswitch.org/pipermail/ovs-dev/2017-October/339496.html

Cc: Loftus Ciara <ciara.loftus@intel.com>
Cc: Thomas Monjalon <thomas@monjalon.net>
Cc: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Yuanhan Liu <yliu@fridaylinux.org>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>

netdev-dpdk: Fix requested MTU size validation.

This commit replaces MTU_TO_FRAME_LEN(mtu) with MTU_TO_MAX_FRAME_LEN(mtu)
in netdev_dpdk_set_mtu(), in order to determine if the total length of
the L2 frame with an MTU of ’mtu’ exceeds NETDEV_DPDK_MAX_PKT_LEN.

When setting an MTU we first check if the requested total frame length
(which includes associated L2 overhead) will exceed the maximum
frame length supported in netdev_dpdk_set_mtu(). The frame length is
calculated by MTU_TO_FRAME_LEN as MTU + ETHER_HEADER + ETHER_CRC. The MTU
for the device will be set at a later stage in dpdk_eth_dev_init() using
rte_eth_dev_set_mtu(mtu).

However when using rte_eth_dev_set_mtu(mtu) the calculation used to check
that the frame does not exceed the max frame length for that device varies
between DPDK device drivers. For example ixgbe driver calculates the
frame length for a given MTU as

mtu + ETHER_HDR_LEN + ETHER_CRC_LEN

i40e driver calculates it as

mtu + ETHER_HDR_LEN + ETHER_CRC_LEN + I40E_VLAN_TAG_SIZE * 2

em driver calculates it as

mtu + ETHER_HDR_LEN + ETHER_CRC_LEN + VLAN_TAG_SIZE

Currently it is possible to set an MTU for a netdev_dpdk device that exceeds
the upper limit MTU for that devices DPDK driver. This leads to a segfault.
This is because the frame length comparison as is, does not take into account
the addition of the vlan tag overhead expected in the drivers. The
netdev_dpdk_set_mtu() call will incorrectly succeed but the subsequent
dpdk_eth_dev_init() will fail before the queues have been created for the
DPDK device. This coupled with assumptions regarding reconfiguration
requirements for the netdev will lead to a segfault when the rxq is polled
for this device.

A simple way to avoid this is by using MTU_TO_MAX_FRAME_LEN(mtu) when
validating a requested MTU in netdev_dpdk_set_mtu().
MTU_TO_MAX_FRAME_LEN(mtu) is equivalent to the following:

mtu + ETHER_HDR_LEN + ETHER_CRC_LEN + (2 * VLAN_HEADER_LEN)

By using MTU_TO_MAX_FRAME_LEN at the netdev_dpdk_set_mtu() stage, OvS
now takes into account the maximum L2 overhead that a DPDK driver could
allow for in its frame size calculation. This allows OVS to flag an error
rather than the DPDK driver if the frame length exceeds the max DPDK frame
length. OVS can fail gracefully at this point and use the default MTU of
1500 to continue to configure the port.

Note: this fix is a work around, a better approach would be if DPDK devices
could report the maximum MTU value that can be requested on a per device
basis. This capability however is not currently available. A downside of
this patch is that the MTU upper limit will be reduced by 8 bytes for
DPDK devices that do not need to account for vlan tags in the frame length
driver calculations e.g. ixgbe devices upper MTU limit is reduced from
the OVS point of view from 9710 to 9702.

CC: Mark Kavanagh <mark.b.kavanagh@intel.com>
Fixes: 0072e931 ("netdev-dpdk: add support for jumbo frames")
Signed-off-by: Ian Stokes <ian.stokes@intel.com>
Co-authored-by: Mark Kavanagh <mark.b.kavanagh@intel.com>
Signed-off-by: Mark Kavanagh <mark.b.kavanagh@intel.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>

ofproto: Fix double-unref of temporary rule when learning.

When ofproto_flow_mod_init() accepts a rule, it takes ownership of it and
either unrefs it on error or transfers ownership to the struct it
initializes on success, but ofproto_flow_mod_init_for_learn() was unref-ing
it a second time if it reported an error.

Signed-off-by: Ben Pfaff <blp@ovn.org>
Acked-by: William Tu <u9012063@gmail.com>

ovs-atomic: Fix typo in comment.

Signed-off-by: Ben Pfaff <blp@ovn.org>
Acked-by: Justin Pettit <jpettit@ovn.org>

checkpatch.py: Add check for "xxx" in comments.

"xxx" is often used to indicate items that the developer wanted to look
at again before committing. Flag those as a warning.

Signed-off-by: Justin Pettit <jpettit@ovn.org>
Acked-by: Aaron Conole <aconole@redhat.com>

Fix incorrect handling of return value.

The value cookie_offset should be 'size_t' type.

Signed-off-by: Lili Huang <huanglili.huang@huawei.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>

openvswitch/types.h: Drop the member name in initializer macro

MSVC++ compiler does not allow initializing a struct while
explicitly initializing a member in the struct.

Not allowed:
    static const struct eth_addr a = {{ .ea= { 0xff, 0xff, 0xff, 0xff,
                                        0xff, 0xff }}};

Alowed:
    static const struct eth_addr b  = {{{ 0xff, 0xff, 0xff, 0xff, 0xff,
                                          0xff }}};
*An extra curly brace is required for GCC in case the struct contains
a union.

Signed-off-by: Shashank Ram <rams@vmware.com>
Tested-by: Yi-Hung Wei <yihung.wei@gmail.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>

gre: strip gre-tso offload flags

if the gro enable, ipgre receive a gre-tso package. After pop
the gre-tunnel the encapsulation and GSO_ENCAP flags should be
striped. or the packet encap again and will be dropped in
ovs_iptunnel_handle_offloads

Signed-off-by: wenxu <wenxu@ucloud.cn>
Signed-off-by: Ben Pfaff <blp@ovn.org>
Tested-by: Greg Rose <gvrose8192@gmail.com>
Reviewed-by: Greg Rose <gvrose8192@gmail.com>

ofproto-dpif-upcall: Fix typo in comment.

Signed-off-by: Ben Pfaff <blp@ovn.org>
Acked-by: Justin Pettit <jpettit@ovn.org>

ovn: OVN Support QoS meter

This feature is used to limit the bandwidth of flows, such as floating IP.

ovn-northd changes:
1. add bandwidth column in NB's QOS table.
2. add QOS_METER stages in Logical switch ingress/egress.
3. add set_meter() action in SB's LFlow table.

ovn-controller changes:
add meter_table for meter action process openflow meter table.

Now, This feature is only supported in DPDK.

Signed-off-by: Guoshuai Li <ligs@dtdream.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>

ovn-controller: Add extend_table instead of group_table to expand meter.

The structure and function of the group table and meter table are similar,
refactoring code is used to extend for add the meter table.
The following function as lib: table init/destroy/clear/lookup/remove,
assign id for contents, Move the contents of desired to existing.

Signed-off-by: Guoshuai Li <ligs@dtdream.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>

Revert "compat:inet_frag.h: Check for frag_percpu_counter_batch"

This reverts commit 822afef74f5e65af0cdc3916249ce85a70ae7b83.

Requested-at: https://mail.openvswitch.org/pipermail/ovs-dev/2018-January/343674.html
Requested-by: Gregory Rose <gvrose8192@gmail.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>

tc flower: reorder tunnel encap/decap actions

The tc_flower conversion struct does not consider the order of actions.
If an OvS rule matches on a tunnel (decap) and outputs to a new tunnel,
the netlink conversion to TC will add the set tunnel key action before the
unset, leading to an incorrect TC rule. This patch reorders the netlink
generation to ensure a decap is done before an encap if both exist.

Signed-off-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>

docs: Fix formatting in fedora.rst

Fix rst formatting in fedora.rst so that the commands look correctly
on the web.

Signed-off-by: Yi-Hung Wei <yihung.wei@gmail.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>

AUTHORS: Add Robert Mulik.

Signed-off-by: Ben Pfaff <blp@ovn.org>

LACP: Check active partner sys id

A reboot of one switch in an MC-LAG bond makes all bond links
to go down, causing a total connectivity loss for 3 seconds.

Packet capture shows that spurious LACP PDUs are sent to OVS with
a different MAC address (partner system id) during the final
stages of the MC-LAG switch reboot. The current implementation
doesn't care about the partner sys_id (MAC address).

The code change based on the following:
- If an interface (lead interface) on a bond has an "attached"
  LACP connection, then any other slaves on that bond is allowed
  to become active only when its partner's sys_id is the same as
  the partner's sys_id of the lead interface.
- So, when a slave interface of a bond becomes "current" (it gets
  valid LACP information), first checks if there is already an
  active interface on the bond.
- If there is a lead, the slave checks for the partner sys_ids,
  and becomes active only when they are the same, otherwise it
  remains in "current" state, but "detached".
- If there is no lead, it follows the old way, and accepts any
  partner sys_id.

Signed-off-by: Robert Mulik <robert.mulik@ericsson.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>

compat:inet_frag.h: Check for frag_percpu_counter_batch

Fix up the compat layer to check for frag_percpu_counter_batch and
if not present then use atomic_sub and atomic_add as per the
backport in the 3.16.50 LTS kernel. Fixes compile errors on
3.16 series kernels from 3.16.50 on.

Signed-off-by: Greg Rose <gvrose8192@gmail.com>
Signed-off-by: Justin Pettit <jpettit@ovn.org>

tests: Fix non-canonical MAC addresses in ovn.at.

Signed-off-by: Ben Pfaff <blp@ovn.org>

xlate: fix xport lookup for recirc

Xlate_lookup and xlate_lookup_ofproto_() provides in_port and ofproto
based on xport determined using flow, which is extracted from packet.
The lookup can happen due to recirculation as well. It can happen, that
packet_type has been modified during xlate before recirculation is
triggered, so the lookup fails or delivers wrong xport.
This can be worked around by propagating xport to ctx->xin after the very
first lookup and store it in frozen state of the recirculation.
So, when lookup is performed due to recirculation, the xport can be
retrieved from the frozen state.

The packet-type-aware unit tests are updated with a new one to verify
this behavior.

Signed-off-by: Zoltan Balogh <zoltan.balogh@ericsson.com>
CC: Jan Scheurich <jan.scheurich@ericsson.com>
Fixes: beb75a40fdc2 ("userspace: Switching of L3 packets in L2 pipeline")
Signed-off-by: Ben Pfaff <blp@ovn.org>

ofproto-dpif-xlate: add uuid to xports

This should make possible to look up xport by UUID and will be used by a
later commit.

Signed-off-by: Zoltan Balogh <zoltan.balogh@ericsson.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>

ofproto-dpif-sflow: Recursively examine actions inside clone.

Until now, dpif_sflow_read_actions() has ignored actions inside clone.
This means that sflow missed tnl_push actions inside clone, which OVS
now uses to avoid tx recirculation. This commit fixes the problem
by making dpif_sflow_read_actions() recursively process actions inside
clone.

In addition, some sflow data needs to be stored and restored in
ofproto-dpif-xlate when native_tunnel_output() is invoked. Otherwise the
output action of underlay bridge is getting counted too when sFlow is set
on the overlay bridge.

Both bugs are connected to sflows and were introduced by the commit in
the "Fixes:" tag below.

Signed-off-by: Zoltan Balogh <zoltan.balogh@ericsson.com>
CC: Sugesh Chandran <sugesh.chandran@intel.com>
Fixes: 7c12dfc527a5 ("tunneling: Avoid datapath-recirc by combining recirc actions at xlate.")
Signed-off-by: Ben Pfaff <blp@ovn.org>

bridge: Fix custom stats' counters leak.

The caller takes ownership over allocated array of counters.
And it must free them.

CC: Michal Weglicki <michalx.weglicki@intel.com>
Fixes: 971f4b394c6e ("netdev: Custom statistics.")
Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>

ovn-controller: add new external_id 'ovn-cms-options' to Chassis table

This patch makes ovn-controller sets the external_ids key
'ovn-cms-options' to its own Chassis table entry copying its
contents from the same external_ids key in the local OpenvSwitch
database.

The idea behind this patch is to allow setting general options
from the CMS Plugin to a particular chassis.

A good example of an use case is when we want to schedule a router
on a chassis from OpenStack. In this case, we may want to exclude
some nodes because they are more likely to be restarted for
maintenance operations or they simply won't have external connectivity.
This way, if the CMS/deployment tool would set the external_ids
as:

ovs-vsctl set open . external_ids:ovn-cms-options="enable-chassis-as-gw"

Then ovn-controller will write the options to the Chassis table in
southbound database. This value can be later read by the CMS in order
to decide which Chassis are eligible to schedule a router on.

Similarly, this new key would allow to specify additional options to
be consumed by the CMS.

Signed-off-by: Daniel Alvarez <dalvarez@redhat.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>

bfd: Send BFD packets with DSCP CS6

Send BFD packets with TOS value equivalent to DSCP CS6 so that the network
can apply the right QoS for those packets. This can help avoid BFD flaps due
to network congestion.

For a reference on this being the right choice, here is a short
declaration:

http://www.ciscopress.com/articles/article.asp?p=357102&seqNum=4

A long dissertation:

https://www.cisco.com/c/en/us/td/docs/solutions/Enterprise/WAN_and_MAN/QoS_SRND/QoS-SRND-Book/QoSIntro.html

But in a nutshell:

Network engineers create various queue/drop policies based upon precedence.
Routing protocols are considered high priority/high precedence. During link
saturation events, packets will get dropped. By creating an egress policy
where packets marked by CS6 are allowed front-of-the-queue status, one can be
sure that hello's from the various protocols arrive when they need to, without
delay and without loss. On the other hand, if the hellos are dropped as part
of normal traffic operations, then traffic routing will flap, leading to
further congestion and drops.

CS6 is a 'well known' marker to network engineers. In many vendor's gear, it
is automatically assigned to routing protocol packets.

Since OVS does not perform queuing, and leaves that to the kernel edge
operations, the queue policies can be used to ensure timely egress of the BFD
packets during high utilization events.

See also:
https://mail.openvswitch.org/pipermail/ovs-dev/2017-October/339784.html
https://mail.openvswitch.org/pipermail/ovs-dev/2017-October/339785.html

Thanks to Raymond Burkholder <ray@oneunified.net> for much of the above
information.

Signed-off-by: Venkatesan Pradeep <venkatesan.pradeep@ericsson.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>

datapath: add ct_clear action

Upstream commit:
    commit b8226962b1c49c784aeddb9d2fafbf53dfdc2190
    Author: Eric Garver <e@erig.me>
    Date:   Tue Oct 10 16:54:44 2017 -0400

    openvswitch: add ct_clear action

    This adds a ct_clear action for clearing conntrack state. ct_clear is
    currently implemented in OVS userspace, but is not backed by an action
    in the kernel datapath. This is useful for flows that may modify a
    packet tuple after a ct lookup has already occurred.

Signed-off-by: Eric Garver <e@erig.me>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Notes:
   - hunk from include/uapi/linux/openvswitch.h is missing because it
     was added with userspace support in 1fe178d251c8 ("dpif: Add support
     for OVS_ACTION_ATTR_CT_CLEAR")
   - if IP_CT_UNTRACKED is not available use 0 as other nf_ct_set()
     calls do. Since we're setting ct to NULL this is okay.

Signed-off-by: Eric Garver <e@erig.me>
Acked-by: Pravin B Shelar <pshelar@ovn.org>

acinclude: check for IP_CT_UNTRACKED

IP_CT_UNTRACKED is fairly new, but used by the kernel datapath ct_clear
action.

Signed-off-by: Eric Garver <e@erig.me>
Acked-by: Pravin B Shelar <pshelar@ovn.org>

ovsdb-client: Fix memory leaks

This two leaks are reported by valgrind (testing ovsdb-client
backup and restore):

890 (56 direct, 834 indirect) bytes in 1 blocks are definitely lost in loss record 71 of 73
   by 0x42DE22: xcalloc (util.c:103)
   by 0x40DD8C: ovsdb_schema_create (ovsdb.c:34)
   by 0x40E0B5: ovsdb_schema_from_json (ovsdb.c:196)
   by 0x406DA5: fetch_schema (ovsdb-client.c:415)
   by 0x408478: do_restore (ovsdb-client.c:1595)
   by 0x405BCD: main (ovsdb-client.c:170)

2,688 (88 direct, 2,600 indirect) bytes in 1 blocks are definitely lost in loss record 73 of 73
   by 0x42DE84: xmalloc (util.c:120)
   by 0x40E61F: ovsdb_create (ovsdb.c:329)
   by 0x40BA22: ovsdb_file_open__ (file.c:201)
   by 0x40845A: do_restore (ovsdb-client.c:1592)
   by 0x405BCD: main (ovsdb-client.c:170)

Signed-off-by: Yifeng Sun <pkusunyifeng@gmail.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>

ovn-northd: Fix memory leak

This leak was reported by valgrind (testing ovn -- IPv6 Neighbor
Solicitation for unknown MAC):

3,027 bytes in 49 blocks are definitely lost in loss record 210 of 218
    by 0x484C84: xrealloc (util.c:131)
    by 0x43CE41: ds_reserve (dynamic-string.c:63)
    by 0x43D29D: ds_put_format_valist (dynamic-string.c:161)
    by 0x43D3A3: ds_put_format (dynamic-string.c:142)
    by 0x412EEF: ovn_port_update_sbrec (ovn-northd.c:1948)
    by 0x4148B4: build_ports (ovn-northd.c:2109)
    by 0x4148B4: ovnnb_db_run.isra.37 (ovn-northd.c:6202)
    by 0x406FE0: main (ovn-northd.c:6854)

Signed-off-by: Yifeng Sun <pkusunyifeng@gmail.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>

pinctrl: Fix memory leak

This bug is reported by valgrind (testing ovn -- 3 HVs, 1 LS, 3 lports/HV):

51,680 (27,968 direct, 23,712 indirect) bytes in 76 blocks are definitely lost in loss record 72 of 72
   at 0x4C2FB55: calloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
   by 0x4A8992: xcalloc (util.c:103)
   by 0x493052: ovsdb_idl_index_init_row (ovsdb-idl.c:2343)
   by 0x413F69: send_ipv6_ras (pinctrl.c:1321)
   by 0x413F69: pinctrl_run (pinctrl.c:1093)
   by 0x407348: main (ovn-controller.c:703)

Signed-off-by: Yifeng Sun <pkusunyifeng@gmail.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>

NEWS: Move ct_clear support to 2.9.0 section.

This feature was backported to 2.9.0.

Signed-off-by: Justin Pettit <jpettit@ovn.org>
Acked-by: Eric Garver <e@erig.me>

netdev-linux: do not send packets to down tap ifaces.

Today OVS pushes packets to the TAP interface ignoring its
current state. That works because the kernel will return -EIO
when it's not UP and OVS will just ignore that as it is not
an OVS issue.

However, it causes a huge impact when broadcasts happen when
using userspace datapath accelerated with DPDK (e.g.: action
NORMAL). This patch improves the situation by checking the
TAP's interface state before issueing any syscall.

However, there might be use-cases moving interfaces to other
networking namespaces and in that case, OVS can't retrieve
the iface state (sets it to DOWN). That would stop the traffic
breaking the use-case. This patch relies on netlink notifications
to find out if the device is local or not. When it's local, the
device state is checked otherwise it will behave as before.

Signed-off-by: Flavio Leitner <fbl@sysclose.org>
Signed-off-by: Ben Pfaff <blp@ovn.org>

ofproto: Fix wrong datapath flow with same in_port and output port.

In my test, the new datapath flow which has the same in_port and actions
output port was found using ovs-appctl dpctl/dump-flows.  Then the mac
address will move from one port to another and back it again in the
physical switch. This problem result in the VM's traffic become abnormal.

My test key steps:

    1) There are three VM using ovs bridge and intel 82599 nics as uplink
    port, deployed in different hosts connecting to the same physical
    switch. They can be named using VM-A, VM-B and VM-C, Host-A, Host-B,
    Host-C.

    2) VM-A send many unicast packets to VM-B, and VM-B also send unicast
    packets to VM-A.

    3) VM-C ping VM-A continuously, and do ovs port add/delete testing in
    Host-C ovs bridge.

    4) In some abormal scence, the physical switch clear all the mac-entry
    on each ports. Then Host-C ovs bridge's uplink port will receive two
    direction packets(VM-A to VM-B, and VM-B to VM-A).

The expected result is that this two direction packets should be droppd in
the uplink port. Because the dst port of this packets is the uplink port
which is also the src port by looking ovs bridge's mac-entry table learned
by ovs NORMAL rules.  But the truth is some packets being sent back to
uplink port and physical switch. And then VM-A's mac was moved to the
physical switch port of Host-C from the port of Host-A, as a reulst, VM-C
ping VM-A failed at this time.  When this problem occurs, the abnormal ovs
datapath's flow "in_port(2) actions:2" was found by executing the command
"ovs-appctl dpctl/dump-flows".

Currently, xlate_normal() uses xbundle pointer compare to verify the
packet's dst port whether is same with its input port. This implemention
may be wrong while calling xlate_txn_start/xlate_txn_commit in type_run()
at the same time, because xcfg/xbridge/xbundle object was reallocated and
copied just before we lookup the dst mac_port and mac_xbundle. Then
mac_xbundle and in_xbundle are same related with the uplink port but not
same object pointer.

And we can fix this bug by adding ofbundle check conditions shown in my
patch.

Signed-off-by: Lilijun <jerry.lilijun@huawei.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>

dpif: geneve: supply dpif function to get ifindex

Geneve tunnels are not given a netdev_class function to determine their
ifindex. This means when ofproto-dpif attempts to add a geneve netdev
it fails in 'netdev_ports_insert' in netdev.c. Failure to add this means
that further operations like offloading a rule that egresses to a geneve
port will be rejected as the egress port cannot be found. This patch
applies the same ifindex function to geneve as is used in vxlan.

Signed-off-by: John Hurley <john.hurley@netronome.com>
Acked-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>

system-traffic: Add conntrack floating IP test

This test cases uses floating IP (FIP) addresses for each endpoint. If
the destination is a FIP, the packet will undergo a transformation of
the form (dst=FIP, src=non-FIP) --> (dst=non-FIP, src=FIP) before
egress. Otherwise the packet is untouched.

This exercises the ct_clear action in the datapath.

Signed-off-by: Eric Garver <e@erig.me>
Acked-by: William Tu <u9012063@gmail.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Signed-off-by: Justin Pettit <jpettit@ovn.org>

system-common-macros: Check for ct_clear action in datapath

New macro OVS_CHECK_CT_CLEAR() to check if ct_clear action is supported
by the datapath.

Signed-off-by: Eric Garver <e@erig.me>
Tested-by: William Tu <u9012063@gmail.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Signed-off-by: Justin Pettit <jpettit@ovn.org>

dpif: Add support for OVS_ACTION_ATTR_CT_CLEAR

This supports using the ct_clear action in the kernel datapath. To
preserve compatibility with current ct_clear behavior on old kernels, we
only pass this action down to the datapath if a probe reveals the
datapath actually supports it.

Signed-off-by: Eric Garver <e@erig.me>
Acked-by: William Tu <u9012063@gmail.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Signed-off-by: Justin Pettit <jpettit@ovn.org>

Merge branch 'dpdk_merge' of https://github.com/istokes/ovs into HEAD

dpif-netlink-rtnl: Work around MTU bug in kernel GRE driver.

The kernel GRE driver ignores IFLA_MTU in RTM_NEWLINK requests and
overrides the MTU to 1472 bytes. This commit works around the problem by
following up a request to create a GRE device with a second request to set
the MTU.

Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=1488484
Reported-by: Eric Garver <e@erig.me>
Reported-by: James Page <james.page@ubuntu.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>
Acked-by: Eric Garver <e@erig.me>
Tested-by: James Page <james.page@ubuntu.com>

dpif-netlink-rtnl: Use 65000 instead of 65535 as tunnel MTU.

Most of the existing tunnels accept 65535 for MTU and internally reduce it
to the maximum value actually supported. However, in RTM_SETLINK calls,
at least GRE tunnels reject MTU larger than actually supported. This
commit changes the MTU used in RTM_NEWLINK calls to use a value that should
be acceptable to all tunnels and yet does not noticeably reduce
performance.

(This code doesn't actually use RTM_SETLINK to change MTU yet, but that's
coming up.)

Suggested-by: Eric Garver <e@erig.me>
Suggested-at: https://mail.openvswitch.org/pipermail/ovs-dev/2018-January/343304.html
Signed-off-by: Ben Pfaff <blp@ovn.org>
Acked-by: Eric Garver <e@erig.me>
Tested-by: James Page <james.page@ubuntu.com>

Documentation: Document optional RHEL7 repositories

On minimal install RHEL 7 servers (and perhaps other types of installs)
you need to enable a couple of optional repositories for the yum-builddep
utility to work correctly. This patch documents those two optional
repositories.

Signed-off-by: Greg Rose <gvrose8192@gmail.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>

odp-util: Fix compiler warning.

The result of a ternary operation will be promoted at least to int type.
As such, the compiler may generate a warning as: format specifies type
'unsigned char' but the argument has type 'int'

Found with Apple LLVM version 8.1.0 (clang-802.0.42).

Squelch this by preferring the %d format specifier to print 1/0 values.

Fixes: 74c4530dca93 ("ofproto-dpif: Don't slow-path controller actions with pause.")
Cc: Justin Pettit <jpettit@ovn.org>
Signed-off-by: Aaron Conole <aconole@bytheb.org>
Acked-by: Ian Stokes <ian.stokes@intel.com>
Tested-by: Ian Stokes <ian.stokes@intel.com>
Signed-off-by: Justin Pettit <jpettit@ovn.org>

rhel: Ensure proper OVS kernel modules load - rhel6

Patch c49889cf3e "rhel: Ensure proper OVS kernel modules load after upgrade"
did not address the RHEL 6 kmod rpm spec file. This patch addresses
that error.

Fixes: c49889cf3e ("rhel: Ensure proper OVS kernel modules...")
CC: Ansis Atteka <ansisatteka@gmail.com>
CC: Flavio Leitner <fbl@sysclose.org>
Signed-off-by: Greg Rose <gvrose8192@gmail.com>
Acked-by: Ansis Atteka <aatteka@ovn.org>

Prepare for post-2.9.0 (2.9.90).

Signed-off-by: Justin Pettit <jpettit@ovn.org>

Prepare for 2.9.0.

Signed-off-by: Justin Pettit <jpettit@ovn.org>

Documentation: Update Faucet tutorial.

Drop use of minimum_ip_size_check in Faucet tutorial which is no longer
needed after we fixed a bug that was causing packet length checks to be
calculated wrong.

Signed-off-by: Ben Pfaff <blp@ovn.org>

netdev-dpdk: add vhost-user get_status.

Expose relevant vhost-user information in status.

Signed-off-by: Flavio Leitner <fbl@sysclose.org>
Tested-by: Kevin Traynor <ktraynor@redhat.com>
Acked-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>

NEWS: Add entry for new appctl dpif-netdev/pmd-rxq-rebalance.

This feature was added earlier but we thought it better to
advertise in NEWS after there was stats provided to help
the user decide whether they should use it.

Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>

dpif-netdev: Add percentage of pmd/core used by each rxq.

It is based on the length of history that is stored about an
rxq (currently 1 min).

$ ovs-appctl dpif-netdev/pmd-rxq-show
pmd thread numa_id 0 core_id 4:
        isolated : false
        port: dpdkphy1         queue-id:  0    pmd usage: 70 %
        port: dpdkvhost0       queue-id:  0    pmd usage:  0 %
pmd thread numa_id 0 core_id 6:
        isolated : false
        port: dpdkphy0         queue-id:  0    pmd usage: 64 %
        port: dpdkvhost1       queue-id:  0    pmd usage:  0 %

These values are what would be used as part of rxq to pmd
assignment due to a reconfiguration event e.g. adding pmds,
adding rxqs or with the command:

ovs-appctl dpif-netdev/pmd-rxq-rebalance

Signed-off-by: Jan Scheurich <jan.scheurich@ericsson.com>
Co-authored-by: Jan Scheurich <jan.scheurich@ericsson.com>
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>

dpif-netdev: Reset the rxq current cycle counter on reload.

An rxq may have processing cycles counted in the current
counter when a reload happens. That could temporarily create
a small skew on the stats for an rxq. Reset the counter after
reload.

Fixes: 4809891b2e01 ("dpif-netdev: Count the rxq processing cycles for an rxq.")
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>

NEWS: Mark output packet batching support.

New feature should be mentioned in news, especially because it has
user-visible configuration options.

Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
Acked-by: Jan Scheurich <jan.scheurich@ericsson.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>

docs: Describe output packet batching in DPDK guide.

Added information about output packet batching and a way to
configure 'tx-flush-interval'.

Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
Co-authored-by: Jan Scheurich <jan.scheurich@ericsson.com>
Signed-off-by: Jan Scheurich <jan.scheurich@ericsson.com>
Acked-by: Jan Scheurich <jan.scheurich@ericsson.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>

dpif-netdev: Time based output batching.

This allows to collect packets from more than one RX burst
and send them together with a configurable intervals.

'other_config:tx-flush-interval' can be used to configure
time that a packet can wait in output batch for sending.

'tx-flush-interval' has microsecond resolution.

Tested-by: Jan Scheurich <jan.scheurich@ericsson.com>
Acked-by: Jan Scheurich <jan.scheurich@ericsson.com>
Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>

dpif-netdev: Count cycles on per-rxq basis.

Upcoming time-based output batching will allow to collect in a single
output batch packets from different RX queues. Lets keep the list of
RX queues for each output packet and collect cycles for them on send.

Tested-by: Jan Scheurich <jan.scheurich@ericsson.com>
Acked-by: Jan Scheurich <jan.scheurich@ericsson.com>
Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>

dpif-netdev: Use microsecond granularity.

Upcoming time-based output batching will require microsecond
granularity for it's flexible configuration.

Acked-by: Jan Scheurich <jan.scheurich@ericsson.com>
Acked-by: Ian Stokes <ian.stokes@intel.com>
Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>

dpif-netdev: Refactor cycle counting

Simplify the historically grown TSC cycle counting in PMD threads.
Cycles are currently counted for the following purposes:

1. Measure PMD ustilization

PMD utilization is defined as ratio of cycles spent in busy iterations
(at least one packet received or sent) over the total number of cycles.

This is already done in pmd_perf_start_iteration() and
pmd_perf_end_iteration() based on a TSC timestamp saved in current
iteration at start_iteration() and the actual TSC at end_iteration().
No dependency on intermediate cycle accounting.

2. Measure the processing load per RX queue

This comprises cycles spend on polling and processing packets received
from the rx queue and the cycles spent on delayed sending of these packets
to tx queues (with time-based batching).

The previous scheme using cycles_count_start(), cycles_count_intermediate()
and cycles-count_end() originally introduced to simplify cycle counting
and saving calls to rte_get_tsc_cycles() was rather obscuring things.

Replace by a nestable cycle_timer with with start and stop functions to
embrace a code segment to be timed. The timed code may contain arbitrary
nested cycle_timers. The duration of nested timers is excluded from the
outer timer.

The caller must ensure that each call to cycle_timer_start() is
followed by a call to cycle_timer_end(). Failure to do so will lead to
assertion failure or a memory leak.

The new cycle_timer is used to measure the processing cycles per rx queue.
This is not yet strictly necessary but will be made use of in a subsequent
commit.

All cycle count functions and data are relocated to module
dpif-netdev-perf.

Signed-off-by: Jan Scheurich <jan.scheurich@ericsson.com>
Acked-by: Ilya Maximets <i.maximets@samsung.com>
Acked-by: Billy O'Mahony <billy.o.mahony@intel.com>
Signed-off: Ian Stokes <ian.stokes@intel.com>

dpif-netdev: Refactor PMD performance into dpif-netdev-perf

Add module dpif-netdev-perf to host all PMD performance-related
data structures and functions in dpif-netdev. Refactor the PMD
stats handling in dpif-netdev and delegate whatever possible into
the new module, using clean interfaces to shield dpif-netdev from
the implementation details. Accordingly, the all PMD statistics
members are moved from the main struct dp_netdev_pmd_thread into
a dedicated member of type struct pmd_perf_stats.

Include Darrel's prior refactoring of PMD stats contained in
[PATCH v5,2/3] dpif-netdev: Refactor some pmd stats:

1. The cycles per packet counts are now based on packets
received rather than packet passes through the datapath.

2. Packet counters are now kept for packets received and
packets recirculated. These are kept as separate counters for
maintainability reasons. The cost of incrementing these counters
is negligible. These new counters are also displayed to the user.

3. A display statistic is added for the average number of
datapath passes per packet. This should be useful for user
debugging and understanding of packet processing.

4. The user visible 'miss' counter is used for successful upcalls,
rather than the sum of sucessful and unsuccessful upcalls. Hence,
this becomes what user historically understands by OVS 'miss upcall'.
The user display is annotated to make this clear as well.

5. The user visible 'lost' counter remains as failed upcalls, but
is annotated to make it clear what the meaning is.

6. The enum pmd_stat_type is annotated to make the usage of the
stats counters clear.

7. The subtable lookup stats is renamed to make it clear that it
relates to masked lookups.

8. The PMD stats test is updated to handle the new user stats of
packets received, packets recirculated and average number of datapath
passes per packet.

On top of that introduce a "-pmd <core>" option to the PMD info
commands to filter the output for a single PMD.

Made the pmd-stats-show output a bit more readable by adding a blank
between colon and value.

Signed-off-by: Jan Scheurich <jan.scheurich@ericsson.com>
Co-authored-by: Darrell Ball <dlu998@gmail.com>
Signed-off-by: Darrell Ball <dlu998@gmail.com>
Acked-by: Ilya Maximets <i.maximets@samsung.com>
Acked-by: Billy O'Mahony <billy.o.mahony@intel.com>
Signed-off: Ian Stokes <ian.stokes@intel.com>

netdev-dpdk: fix ingress_policer leak on error path

Fix memory leak by freeing the policer if rte_meter_srtcm_config fails.

Fixes: 9509913aa722 ("netdev-dpdk.c: Add ingress-policing functionality.")
Signed-off-by: zhangliping <zhangliping02@baidu.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>

rhel: Add the new ovsdb manpages to %files list (also for RHEL)

Currently, "rpmbuild -bb rhel/openvswitch.spec" doesn't work correctly
since the new ovsdb manpages (ovsdb.5, ovsdb.7 and ovsdb-server.7) were
added.

This patch adds the new ovsdb manpages in the %files list in the spec
file.

CC: Ben Pfaff <blp@ovn.org>
Fixes: 12b84d50e032 ("ovsdb: Improve documentation.")
Signed-off-by: Ansis Atteka <aatteka@ovn.org>
Tested-by: Greg Rose <gvrose8192@gmail.com>
Reviewed-by: Greg Rose <gvrose8192@gmail.com>

rhel: add missing mandatory build dependencies

autoconf, automake and libtool are required for ./boot.sh.

python-sphinx is required to prevent an error where ovs-test.8 is
otherwise not generated.

Signed-off-by: Ansis Atteka <aatteka@ovn.org>
Tested-by: Greg Rose <gvrose8192@gmail.com>
Reviewed-by: Greg Rose <gvrose8192@gmail.com>

rhel: Ensure proper OVS kernel modules load after upgrade

Add post install and post un-install scripts to make sure that the
openvswitch kernel modules are correctly written with the weak-modules
utility. This ensures that after an upgrade to a newer kernel the
correct openvswitch kernel modules from a previous installation will
be found by the depmod search path.

Suggested-by: Flavio Leitner <fbl@sysclose.org>
Signed-off-by: Greg Rose <gvrose8192@gmail.com>
Acked-by: Ansis Atteka <aatteka@ovn.org>
Acked-by: Flavio Leitner <fbl@sysclose.org>

flake8: Ignore bare except violations

Newer versions of flake8 (3.5.0, mccabe: 0.6.1, pycodestyle: 2.3.1,
pyflakes: 1.6.0) add an error code for 'bare exception'.  The OvS
codebase does use bare exceptions in places, especially when the
specific exception isn't important (ie: the program will be
terminating, so the specific exception isn't important).

Without this change, the following error messages appear:
   utilities/checkpatch.py:476:5: E722 do not use bare except'
   utilities/checkpatch.py:514:5: E722 do not use bare except'
   utilities/ovs-dev.py:189:5: E722 do not use bare except'
   utilities/ovs-dev.py:192:9: E722 do not use bare except'
   utilities/ovs-dev.py:197:5: E722 do not use bare except'
   utilities/bugtool/ovs-bugtool.in:360:13: E722 do not use bare except'
   utilities/bugtool/ovs-bugtool.in:434:5: E722 do not use bare except'
   utilities/bugtool/ovs-bugtool.in:470:13: E722 do not use bare except'
   utilities/bugtool/ovs-bugtool.in:609:9: E722 do not use bare except'
   utilities/bugtool/ovs-bugtool.in:679:5: E722 do not use bare except'
   utilities/bugtool/ovs-bugtool.in:712:13: E722 do not use bare except'
   utilities/bugtool/ovs-bugtool.in:744:9: E722 do not use bare except'
   utilities/bugtool/ovs-bugtool.in:751:9: E722 do not use bare except'
   utilities/bugtool/ovs-bugtool.in:825:5: E722 do not use bare except'
   utilities/bugtool/ovs-bugtool.in:1006:13: E722 do not use bare except'
   utilities/bugtool/ovs-bugtool.in:1041:13: E722 do not use bare except'
   utilities/bugtool/ovs-bugtool.in:1079:5: E722 do not use bare except'
   utilities/bugtool/ovs-bugtool.in:1202:5: E722 do not use bare except'
   utilities/bugtool/ovs-bugtool.in:1247:9: E722 do not use bare except'
   utilities/bugtool/ovs-bugtool.in:1257:13: E722 do not use bare except'
   utilities/bugtool/ovs-bugtool.in:1328:9: E722 do not use bare except'
   tests/test-daemon.py:60:5: E722 do not use bare except'
   tests/test-l7.py:23:1: E722 do not use bare except'
   tests/test-unixctl.py:96:5: E722 do not use bare except'
   xenserver/usr_share_openvswitch_scripts_ovs-xapi-sync:404:5: E722 do not use bare except'
   python/ovs/fcntl_win.py:39:9: E722 do not use bare except'
   python/ovs/poller.py:38:1: E722 do not use bare except'
   python/ovs/socket_util.py:151:13: E722 do not use bare except'
   python/ovs/stream.py:169:17: E722 do not use bare except'
   python/ovs/stream.py:578:17: E722 do not use bare except'
   python/ovs/timeval.py:51:1: E722 do not use bare except'
   python/ovstest/util.py:52:5: E722 do not use bare except'
   vtep/ovs-vtep:767:5: E722 do not use bare except'

Signed-off-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>

ovs-tcpundump: fix a conversion issue

When I tried using ovs-tcpundump, I got the following error message:
Traceback (most recent call last):
File ./ovs-tcpundump, line 64, in <module>
if m is None or int(m.group(1)) == 0:
ValueError: invalid literal for int() with base 10: '00a0'

Signed-off-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>

netdev-native-tnl: Add assertion in vxlan_pop_header.

During tunnel decapsulation the below steps are performed:
[1] Tunnel information is populated in packet metadata i.e packet->md->tunnel.
[2] Outer header gets popped.
[3] Packet is recirculated.

For [1] to work, the dp_packet L3 and L4 header offsets should be valid.
The offsets in the dp_packet are set as part of miniflow extraction.

If offsets are accidentally reset (or) the pop header operation is performed
prior to miniflow extraction, step [1] fails silently and creates
issues that are harder to debug. Add the assertion to check if the
offsets are valid.

Signed-off-by: Bhanuprakash Bodireddy <bhanuprakash.bodireddy@intel.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>

docs: Recommend newer version of "sparse".

The previously recommended version of sparse, version 0.4.4, does not
support -Wsparse-error properly, so configuring with --enable-Werror and
--enable-sparse will not have the desired effect of breaking the build
when sparse reports an error. Version 0.5.1 and later do implement this
properly.

This commit also updates the recommended URL for sparse because the
previous URL doesn't have the newer releases.

Reported-by: Justin Pettit <jpettit@ovn.org>
Signed-off-by: Ben Pfaff <blp@ovn.org>
Acked-by: Justin Pettit <jpettit@ovn.org>

configure: New --enable-sparse option to enable sparse checking by default.

Until now, "make" called sparse to do checking only if C=1 was passed on
the command line. It was easy for developers to forget to specify that.
This commit adds another option: specifying --enable-sparse on the
configure command line enables sparse checking by default. (It can still
be disabled with C=0.)

Requested-by: Justin Pettit <jpettit@ovn.org>
Signed-off-by: Ben Pfaff <blp@ovn.org>
Acked-by: Justin Pettit <jpettit@ovn.org>

packets: Prefetch the packet metadata in cacheline1.

pkt_metadata_prefetch_init() is used to prefetch the packet metadata
before initializing the metadata in pkt_metadata_init(). This is done
for every packet in userspace datapath and is performance critical.

Commit 99fc16c0 prefetches only cachline0 and cacheline2 as the metadata
part of respective cachelines will be initialized by pkt_metadata_init().

However in VXLAN case when popping the vxlan header, netdev_vxlan_pop_header()
invokes pkt_metadata_init_tnl() which zeroes out metadata part of
cacheline1 that wasn't prefetched earlier and causes performance
degradation.

By prefetching cacheline1, 9% performance improvement is observed with
vxlan decapsulation test case for packet sizes of 118 bytes. Performance
variation is observed based on CFLAGS.

       CFLAGS="-O2"                CFLAGS="-O2 -msse4.2"
  Master      4.667 Mpps         Master       4.710 Mpps
  With Patch  5.045 Mpps         With Patch   5.097 Mpps

  CFLAGS="-O2 -march=native"     CFLAGS="-Ofast -march=native"
  Master      5.072 Mpps         Master       5.349 Mpps
  With Patch  5.193 Mpps         With Patch   5.378 Mpps

Fixes: 99fc16c0 ("Reorganize the pkt_metadata structure.")
Signed-off-by: Bhanuprakash Bodireddy <bhanuprakash.bodireddy@intel.com>
Acked-by: Sugesh Chandran <sugesh.chandran@intel.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>

doc: Update configure section with popcnt details.

Popcnt instruction can be used to speedup hash computation on processors
with POPCNT support.

Signed-off-by: Bhanuprakash Bodireddy <bhanuprakash.bodireddy@intel.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>

nsh: add dec_nsh_ttl action

NSH ttl is a 6-bit field ranged from 0 to 63, it should be
decremented by 1 every hop, if it is 0 or it is so after
decremented, the packet should be dropped and a packet-in
message is sent to main controller.

Signed-off-by: Yi Yang <yi.y.yang@intel.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>

nsh: fix nested mask for OVS_KEY_ATTR_NSH

NSH kernel implementation used nested mask for OVS_KEY_ATTR_NSH,
so NSH userspace must adapt to it, OVS hasn't used nested mask for
any key attribute so far, OVS_KEY_ATTR_NSH is the first use case.

Signed-off-by: Yi Yang <yi.y.yang@intel.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>