git.proxmox.com Git - ovs.git/log

datapath: Kbuild: Add kcompat.h header to front of NOSTDINC

Since this commit in the Linux upstream kernel:
'commit 9b9a3f20cbe0 ("kbuild: split final module linking out into Makefile.modfinal")'
The openvswitch kernel module fails to build against the upstream
Linux kernel. The cause of the build failure is that the include of the
KBUILD_EXTMOD variable was dropped in Makefile.modfinal when
it was split out from Makefile.modpost.  Our Kbuild was setting
the ccflags-y variable to include our kcompat.h header as the
first header file.  The Linux kernel maintainer has said that
it is incorrect to rely on the ccflags-y variable for the modfinal
phase of the build so that is why KBUILD_EXTMOD is not included.

We fix this by breaking a different Linux kernel make rule.  We
add '-include $(builddir)/kcompat.h' to the front of the NOSTDINC
variable setting in our Kbuild makefile.

As noted already in the comment for the NOSTDINC setting:
\# These include directories have to go before -I$(KSRC)/include.
\# NOSTDINC_FLAGS just happens to be a variable that goes in the
\# right place, even though it's conceptually incorrect.

So we continue the misuse of the NOSTDINC variable to fix this
issue as well.

The assumption of the Linux kernel maintainers is that any
local, out-of-tree build include files can be added to the end
of the command line. In our case that is wrong of course, but
there is nothing we can do about it that I know of other than using
some utility like unifdef to strip out offending chunks of our
compatibility layer code before invocation of Makefile.modfinal.
That is a big change that would take a lot of work to implement.

We could ask the Linux kernel maintainers to provide some
way for out-of-tree kernel modules to include their own header
files first in a proper manner. I consider that to be a very
low probability of success but something we could ask about.

For now we cheat and take the easy way out.

Reported-by: David Ahern <dsahern@gmail.com>
Acked-by: Yi-Hung Wei <yihung.wei@gmail.com>
Signed-off-by: Greg Rose <gvrose8192@gmail.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>

datapath: Use sizeof_field macro

Upstream commit:
    commit c593642c8be046915ca3a4a300243a68077cd207
    Author: Pankaj Bharadiya <pankaj.laxminarayan.bharadiya@intel.com>
    Date:   Mon Dec 9 10:31:43 2019 -0800

    treewide: Use sizeof_field() macro

    Replace all the occurrences of FIELD_SIZEOF() with sizeof_field() except
    at places where these are defined. Later patches will remove the unused
    definition of FIELD_SIZEOF().

    This patch is generated using following script:

    EXCLUDE_FILES="include/linux/stddef.h|include/linux/kernel.h"

    git grep -l -e "\bFIELD_SIZEOF\b" | while read file;
    do

     if [[ "$file" =~ $EXCLUDE_FILES ]]; then
     continue
     fi
     sed -i  -e 's/\bFIELD_SIZEOF\b/sizeof_field/g' $file;
    done

Signed-off-by: Pankaj Bharadiya <pankaj.laxminarayan.bharadiya@intel.com>
Link: https://lore.kernel.org/r/20190924105839.110713-3-pankaj.laxminarayan.bharadiya@intel.com
Co-developed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: David Miller <davem@davemloft.net> # for net
Also added a compatibility layer macro for older kernels that still
use FIELD_SIZEOF

Acked-by: Yi-Hung Wei <yihung.wei@gmail.com>
Signed-off-by: Greg Rose <gvrose8192@gmail.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>

compat: Remove flex_array code

Flex array support is removed since kernel 5.1. Convert to use
kvmalloc_array instead.

Acked-by: Yi-Hung Wei <yihung.wei@gmail.com>
Signed-off-by: Greg Rose <gvrose8192@gmail.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>

compat: Move genl_ops policy to genl_family

Upstream commit:
    commit 3b0f31f2b8c9fb348e4530b88f6b64f9621f83d6
    Author: Johannes Berg <johannes.berg@intel.com>
    Date:   Thu Mar 21 22:51:02 2019 +0100

    genetlink: make policy common to family

    Since maxattr is common, the policy can't really differ sanely,
    so make it common as well.

    The only user that did in fact manage to make a non-common policy
    is taskstats, which has to be really careful about it (since it's
    still using a common maxattr!). This is no longer supported, but
    we can fake it using pre_doit.

    This reduces the size of e.g. nl80211.o (which has lots of commands):

       text    data     bss     dec     hex filename
     398745   14323    2240 415308   6564c net/wireless/nl80211.o (before)
     397913   14331    2240 414484   65314 net/wireless/nl80211.o (after)
    --------------------------------
       -832      +8       0    -824

    Which is obviously just 8 bytes for each command, and an added 8
    bytes for the new policy pointer. I'm not sure why the ops list is
    counted as .text though.

    Most of the code transformations were done using the following spatch:
        @ops@
        identifier OPS;
        expression POLICY;
        @@
        struct genl_ops OPS[] = {
        ...,
         {
        - .policy = POLICY,
         },
        ...
        };

        @@
        identifier ops.OPS;
        expression ops.POLICY;
        identifier fam;
        expression M;
        @@
        struct genl_family fam = {
                .ops = OPS,
                .maxattr = M,
        +       .policy = POLICY,
                ...
        };

    This also gets rid of devlink_nl_cmd_region_read_dumpit() accessing
    the cb->data as ops, which we want to change in a later genl patch.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since commit 3b0f31f2b8c9f ("genetlink: make policy common to family")
the policy field of the genl_ops structure has been moved into the
genl_family structure.  Add necessary compat layer infrastructure
to still support older kernels.

Acked-by: Yi-Hung Wei <yihung.wei@gmail.com>
Signed-off-by: Greg Rose <gvrose8192@gmail.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>

compat: Fix up changes to inet frags in 5.1+

Since Linux kernel release 5.1 the fragments field of the inet_frag_queue
structure is removed and now only the rb_fragments structure with an
rb_node pointer is used for both ipv4 and ipv6. In addition, the
atomic_sub and atomic_add functions are replaced with their
equivalent long counterparts.

Acked-by: Yi-Hung Wei <yihung.wei@gmail.com>
Signed-off-by: Greg Rose <gvrose8192@gmail.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>

acinclude: Enable Linux kernel 5.5

Acked-by: Yi-Hung Wei <yihung.wei@gmail.com>
Signed-off-by: Greg Rose <gvrose8192@gmail.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>

raft: Unset leader when starting election.

During election, there shouldn't be any leader. This change makes sure
that a server in candidate role always report leader as "unknown".

Signed-off-by: Han Zhou <hzhou@ovn.org>
Signed-off-by: Ben Pfaff <blp@ovn.org>

raft: Fix the problem of stuck in candidate role forever.

Sometimes a server can stay in candidate role forever, even if the server
already see the new leader and handles append-requests normally. However,
because of the wrong role, it appears as disconnected from cluster and
so the clients are disconnected.

This problem happens when 2 servers become candidates in the same
term, and one of them is elected as leader in that term. It can be
reproduced by the test cases added in this patch.

The root cause is that the current implementation only changes role to
follower when a bigger term is observed (in raft_receive_term__()).
According to the RAFT paper, if another candidate becomes leader with
the same term, the candidate should change to follower.

This patch fixes it by changing the role to follower when leader
is being updated in raft_update_leader().

Signed-off-by: Han Zhou <hzhou@ovn.org>
Signed-off-by: Ben Pfaff <blp@ovn.org>

raft: Fix next_index in install_snapshot reply handling.

When a leader handles install_snapshot reply, the next_index for
the follower should be log_start instead of log_end, because there
can be new entries added in leader's log after initiating the
install_snapshot procedure. Also, it should send all the accumulated
entries to follower in the following append-request message, instead
of sending 0 entries, to speed up the converge.

Without this fix, there is no functional problem, but it takes
uncessary extra rounds of append-requests responsed with "inconsistency"
by follower, although finally will be converged.

Signed-off-by: Han Zhou <hzhou@ovn.org>
Signed-off-by: Ben Pfaff <blp@ovn.org>

raft: Send all missing logs in one single append_request.

When a follower needs to "catch up", leader can send N entries in
a single append_request instead of only one entry by each message.

The function raft_send_append_request() already supports this, so
this patch just calculate the correct "n" and use it.

Signed-off-by: Han Zhou <hzhou@ovn.org>
Signed-off-by: Ben Pfaff <blp@ovn.org>

raft: Avoid sending unnecessary heartbeat when becoming leader.

When a node becomes leader, it sends out heartbeat to all followers
and then sends out another append-request for a no-op command
execution to all followers again immediately. This causes 2
continously append-requests sent out to each followers, and the first
heartbeat append-request is unnecessary. This patch removes the
heartbeat.

Signed-off-by: Han Zhou <hzhou@ovn.org>
Signed-off-by: Ben Pfaff <blp@ovn.org>

raft: Avoid busy loop during leader election.

When a server doesn't see a leader yet, e.g. during leader re-election,
if a transaction comes from a client, it will cause 100% CPU busy loop.
With debug log enabled it is like:

2020-02-28T04:04:35.631Z|00059|poll_loop|DBG|wakeup due to 0-ms timeout at ../ovsdb/trigger.c:164
2020-02-28T04:04:35.631Z|00062|poll_loop|DBG|wakeup due to 0-ms timeout at ../ovsdb/trigger.c:164
2020-02-28T04:04:35.631Z|00065|poll_loop|DBG|wakeup due to 0-ms timeout at ../ovsdb/trigger.c:164
2020-02-28T04:04:35.631Z|00068|poll_loop|DBG|wakeup due to 0-ms timeout at ../ovsdb/trigger.c:164
2020-02-28T04:04:35.631Z|00071|poll_loop|DBG|wakeup due to 0-ms timeout at ../ovsdb/trigger.c:164
2020-02-28T04:04:35.631Z|00074|poll_loop|DBG|wakeup due to 0-ms timeout at ../ovsdb/trigger.c:164
2020-02-28T04:04:35.631Z|00077|poll_loop|DBG|wakeup due to 0-ms timeout at ../ovsdb/trigger.c:164
...

The problem is that in ovsdb_trigger_try(), all cluster errors are treated
as temporary error and retry immediately. This patch fixes it by introducing
'run_triggers_now', which tells if a retry is needed immediately. When the
cluster error is with detail 'not leader', we don't immediately retry, but
will wait for the next poll event to trigger the retry. When 'not leader'
status changes, there must be a event, i.e. raft RPC that changes the
status, so the trigger is guaranteed to be triggered, without busy loop.

Signed-off-by: Han Zhou <hzhou@ovn.org>
Signed-off-by: Ben Pfaff <blp@ovn.org>

raft: Fix raft_is_connected() when there is no leader yet.

If there is never a leader known by the current server, it's status
should be "disconnected" to the cluster. Without this patch, when
a server in cluster is restarted, before it successfully connecting
back to the cluster it will appear as connected, which is wrong.

Signed-off-by: Han Zhou <hzhou@ovn.org>
Signed-off-by: Ben Pfaff <blp@ovn.org>

ovsdb-server: Don't disconnect clients after raft install_snapshot.

When "schema" field is found in read_db(), there can be two cases:
1. There is a schema change in clustered DB and the "schema" is the new one.
2. There is a install_snapshot RPC happened, which caused log compaction on the
server and the next log is just the snapshot, which always constains "schema"
field, even though the schema hasn't been changed.

The current implementation doesn't handle case 2), and always assume the schema
is changed hence disconnect all clients of the server. It can cause stability
problem when there are big number of clients connected when this happens in
a large scale environment.

Signed-off-by: Han Zhou <hzhou@ovn.org>
Signed-off-by: Ben Pfaff <blp@ovn.org>

raft-rpc: Fix message format.

Signed-off-by: Han Zhou <hzhou@ovn.org>
Signed-off-by: Ben Pfaff <blp@ovn.org>

ofproto: Add support to watch controller port liveness in fast-failover group

Currently fast-failover group does not support checking liveness of controller
port (OFPP_CONTROLLER). However this feature can be useful for selecting
alternate pipeline when controller connection itself is down for e.g.
by using local DHCP server to reply for any DHCP request originating from VMs.

This patch adds the support for watching controller port liveness in fast-
failover group. Controller port is considered live when atleast one
of-connection is alive.

Example usage:

ovs-ofctl add-group br-int 'group_id=1234,type=ff,
bucket=watch_port:CONTROLLER,actions:<A>,
bucket=watch_port:1,actions:<B>

Signed-off-by: Vishal Deep Ajmera <vishal.deep.ajmera@ericsson.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>

release-process: Fix indentation.

Signed-off-by: Ben Pfaff <blp@ovn.org>

release-process: Describe how to branch and how to make a release.

Acked-by: Justin Pettit <jpettit@ovn.org>
Signed-off-by: Ben Pfaff <blp@ovn.org>

netdev-dpdk: Remove deprecated ring port type.

'dpdkr' ring ports was deprecated in 2.13 release and was not
actually used for a long time. Remove support now.

More details in
commit b4c5f00c339b ("netdev-dpdk: Deprecate ring ports.")

Acked-by: Aaron Conole <aconole@redhat.com>
Acked-by: David Marchand <david.marchand@redhat.com>
Acked-by: Ian Stokes <ian.stokes@intel.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>

dpdk: Remove deprecated pdump support.

DPDK pdump was deprecated in 2.13 release and didn't actually
work since 2.11. Removing it.

More details in commit 4ae8c4617fd3 ("dpdk: Deprecate pdump support.")

Acked-by: Aaron Conole <aconole@redhat.com>
Acked-by: David Marchand <david.marchand@redhat.com>
Acked-by: Ian Stokes <ian.stokes@intel.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>

ovs-dpctl-top: python3 compatibility

During the transition to python3 support, some syntax errors weren't
adequately cleaned. This addresses the various errors, plus one
minor issue with string type conversion.

Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=1809184
Tested-by: Flavio Leitner <fbl@sysclose.org>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Signed-off-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>

packets: Fix typo in comment.

Acked-by: Han Zhou <hzhou@ovn.org>
Reported-by: Toms Atteka <tatteka@vmware.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>

netdev-linux: Enable TSO in the TAP device.

Use ioctl TUNSETOFFLOAD if kernel supports to enable TSO
offloading in the tap device.

Fixes: 29cf9c1b3b9c ("userspace: Add TCP Segmentation Offload support")
Reported-by: "Yi Yang (杨�D)-云服务集团" <yangyi01@inspur.com>
Tested-by: William Tu <u9012063@gmail.com>
Signed-off-by: Flavio Leitner <fbl@sysclose.org>
Signed-off-by: William Tu <u9012063@gmail.com>

ofproto-dpif: Only delete tunnel backer ports along with the dpif.

The admin can choose whether or not to delete flows from datapaths when
they stop ovs-vswitchd.  The goal of not deleting flows it to allow
existing traffic to continue being forwarded until ovs-vswitchd is
restarted.  Until now, regardless of this choice, ovs-vswitchd has
always deleted tunnel ports from the datapath.  When flows are not
deleted, this nevertheless prevents tunnel traffic from being forwarded.

With this patch, ovs-vswitchd no longer deletes tunnel ports in the
case where it does not delete flows, allowing tunnel traffic to continue
being forwarded.

Reported-by: Antonin Bas <abas@vmware.com>
Tested-by: Antonin Bas <abas@vmware.com>
Tested-by: txfh2007 <txfh2007@aliyun.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>

ovs-vswitchd: Fix typo in manpage.

Reviewed-by: Greg Rose <gvrose8192@gmail.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>

dpif-netdev.at: VLAN id modification test for ARP partial HW offloading.

Follow up to commit eb540c0f5fc8 ("flow: Fix parsing l3_ofs with partial
offloading") that fixed the issue, add a unit-test for it.

Signed-off-by: Eli Britstein <elibr@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>

dpif-netdev.at: Fix partial offloading test cases failure.

Some partial offloading test cases are failing inconsistently. The root
cause is that dummy netdev is assigned with "linux_tc" offloading API.
dpif-netdev - partial hw offload - dummy
dpif-netdev - partial hw offload - dummy-pmd
dpif-netdev - partial hw offload with packet modifications - dummy
dpif-netdev - partial hw offload with packet modifications - dummy-pmd

This patch fixes this issue by changing 'options:ifindex=1' to some big
value. It is a workaround to make "linux_tc" init flow api failure. All
above cases can pass consistently after applying this patch.

Suggested-by: Ilya Maximets <i.maximets@ovn.org>
Reviewed-by: Gavin Hu <Gavin.Hu@arm.com>
Reviewed-by: Lijian Zhang <Lijian.Zhang@arm.com>
Signed-off-by: Yanqin Wei <Yanqin.Wei@arm.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>

userspace TSO: SCTP checksum offload optional.

Ideally SCTP checksum offload needs be advertised by the
NIC when userspace TSO is enabled. However, very few drivers
do that and it's not a widely used protocol. So, this patch
enables SCTP checksum offload if available, otherwise userspace
TSO can still be enabled but SCTP packets will be dropped on
NICs without support.

Fixes: 29cf9c1b3b9c ("userspace: Add TCP Segmentation Offload support")
Signed-off-by: Flavio Leitner <fbl@sysclose.org>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>

userspace TSO: Include UDP checksum offload.

Virtio doesn't expose flags to control which protocols checksum
offload needs to be enabled or disabled. This patch checks if the
NIC supports UDP checksum offload and active it when TSO is enabled.

Reported-by: Ilya Maximets <i.maximets@ovn.org>
Fixes: 29cf9c1b3b9c ("userspace: Add TCP Segmentation Offload support")
Signed-off-by: Flavio Leitner <fbl@sysclose.org>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>

netdev-dpdk: vhost: disable unsupported offload features.

Disable ECN and UFO since this is not supported yet. Also, disable
all other features when userspace_tso is not enabled.

Fixes: 29cf9c1b3b9c ("userspace: Add TCP Segmentation Offload support")
Signed-off-by: Flavio Leitner <fbl@sysclose.org>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>

docs: Update conntrack established state description

Patch a867c010ee91 ("conntrack: Fix conntrack new state") fixes
the userspace conntrack behavior. This patch updates the
corresponding conntrack state description.

Fixes: a867c010ee91 ("conntrack: Fix conntrack new state")
Reported-by: Roni Bar Yanai <roniba@mellanox.com>
Acked-by: Roni Bar Yanai <roniba@mellanox.com>
Signed-off-by: Yi-Hung Wei <yihung.wei@gmail.com>
Signed-off-by: William Tu <u9012063@gmail.com>

conntrack: Fix TCP conntrack state

If a TCP connection is in SYN_SENT state, receiving another SYN packet
would just renew the timeout of that conntrack entry rather than create
a new one. Thus, tcp_conn_update() should return CT_UPDATE_VALID_NEW.

This also fixes regressions of a couple of OVN system tests.

Fixes: a867c010ee91 ("conntrack: Fix conntrack new state")
Reported-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Yi-Hung Wei <yihung.wei@gmail.com>
Tested-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: William Tu <u9012063@gmail.com>

docs: Update DPDK version table

Signed-off-by: Tomasz Konieczny <tomaszx.konieczny@intel.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Acked-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>

Set release date for 2.13.0.

The "Valentine's Day" release.

Acked-by: Flavio Leitner <fbl@sysclose.org>
Signed-off-by: Ben Pfaff <blp@ovn.org>

Documentation: Fix literal blocks formating

Acked-by: Flavio Leitner <fbl@sysclose.org>
Signed-off-by: Yi-Hung Wei <yihung.wei@gmail.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>

docs: handle multi line headers for nroff

Before the fix, headers split into multiple lines were producing bogus
quote characters in nroff output and failed to indent headers properly.

Specifically, it fixes a header and its indentation in
ovn-architecture(7).

Signed-off-by: Ihar Hrachyshka <ihrachys@redhat.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>

dp-packet: prefetch the next packet when cloning a batch.

There is a cache miss when accessing mbuf->data_off while cloning
a batch and using prefetch improved the throughput by ~2.3%.

Before: 13709416.30 pps
After: 14031475.80 pps

Fixes: d48771848560 ("dp-packet: preserve headroom when cloning a pkt batch")
Signed-off-by: Flavio Leitner <fbl@sysclose.org>
Signed-off-by: Ben Pfaff <blp@ovn.org>

datapath-windows: Append tunnel info to upcall for correct template

Formerly, there is no tunnel information appended in the upcall’s
packet data, which is expected by IPFIX in userspace to calculate
the template for exporting the sampled flow record of on egress
tunnel port.
To fix this, during performing OvsOutputUserspaceAction(), we
would check whether it is initiated by the sampling on egress
tunnel which would be indicated by the attribute as
OVS_USERSPACE_ATTR_EGRESS_TUN_PORT in the nested attribute
list. If so, we would append the tunKey in OvsForwardingContext
indexed by OVS_PACKET_ATTR_EGRESS_TUN_KEY to the upcall.
Besides, at this point, the source transport port and source ip
address are not available in the structure, so we have to fill it in the
way how the packet would be capsulated during performing
OvsEncapGeneve(), which is following the
OvsOutputUserspaceAction() unfortunately.
I have tested the IPFIX functionality with the change, we could see the
template is correct and the expected tunnel information could be
packed in the IPFIX packet finally. The traffic for test is generated by
PING utility.

>From d727d051c9a44a4a93e5ee5f3da3ca9b125aad29 Mon Sep 17 00:00:00 2001
From: Amber Hu <qhu@vmware.com>
Date: Thu, 30 Jan 2020 18:01:32 -0800
Subject: [PATCH v3] datapath-windows: Append tunnel info to upcall for correct
template

Signed-off-by: Amber Hu <qhu@vmware.com>
Acked-by: Alin Gabriel Serdean <aserdean@ovn.org>
Signed-off-by: Alin Gabriel Serdean <aserdean@ovn.org>

AUTHORS: Add Martin Varghese.

Signed-off-by: Ben Pfaff <blp@ovn.org>

netdev-dpdk: Don't enable offloading on HW device if not requested.

DPDK drivers has different implementations of transmit functions.
Enabled offloading may cause driver to choose slower variant
significantly affecting performance if userspace TSO wasn't requested.

Fixes: 29cf9c1b3b9c ("userspace: Add TCP Segmentation Offload support")
Reported-by: David Marchand <david.marchand@redhat.com>
Acked-by: David Marchand <david.marchand@redhat.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Acked-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>

netdev-linux: Prepend the std packet in the TSO packet

Usually TSO packets are close to 50k, 60k bytes long, so to
to copy less bytes when receiving a packet from the kernel
change the approach. Instead of extending the MTU sized
packet received and append with remaining TSO data from
the TSO buffer, allocate a TSO packet with enough headroom
to prepend the std packet data.

Fixes: 29cf9c1b3b9c ("userspace: Add TCP Segmentation Offload support")
Suggested-by: Ben Pfaff <blp@ovn.org>
Signed-off-by: Flavio Leitner <fbl@sysclose.org>
Signed-off-by: Ben Pfaff <blp@ovn.org>

netdev-linux-private: fix max length to be 16 bits

The dp_packet length is limited to 16 bits, so document that
and fix the length value accordingly.

Fixes: 29cf9c1b3b9c ("userspace: Add TCP Segmentation Offload support")
Signed-off-by: Flavio Leitner <fbl@sysclose.org>
Signed-off-by: Ben Pfaff <blp@ovn.org>

netdev-dpdk: Fix port init when lacking Tx offloads for TSO.

The check on TSO capability did not ensure ip checksum, tcp checksum and
TSO tx offloads were available which resulted in a port init failure
(example below with a ena device):

*2020-02-04T17:42:52.976Z|00084|dpdk|ERR|Ethdev port_id=0 requested Tx
offloads 0x2a doesn't match Tx offloads capabilities 0xe in
rte_eth_dev_configure()*

Fixes: 29cf9c1b3b9c ("userspace: Add TCP Segmentation Offload support")
Reported-by: Ravi Kerur <rkerur@gmail.com>
Signed-off-by: David Marchand <david.marchand@redhat.com>
Acked-by: Kevin Traynor <ktraynor@redhat.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>

vswitchd: Add serial number configuration.

Signed-off-by: Kirill A. Kornilov <kornilov@zelax.ru>
Signed-off-by: Ben Pfaff <blp@ovn.org>

flow: Fix parsing l3_ofs with partial offloading

l3_ofs should be set all Ethernet packets, not just IPv4/IPv6 ones.
For example for ARP over VLAN tagged packets, it may cause wrong
processing like in changing the VLAN ID action. Fix it.

Fixes: aab96ec4d81e ("dpif-netdev: retrieve flow directly from the flow mark")
Signed-off-by: Eli Britstein <elibr@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>

compat: Remove HAVE_BOOL_TYPE

OVS only supports Linux kernels since 3.10 and all kernels since
then have the bool type. This check is unnecessary so remove it.

Passes Travis:
https://travis-ci.org/gvrose8192/ovs-experimental/builds/644103253

Signed-off-by: Greg Rose <gvrose8192@gmail.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>

docs: Add header install command for afxdp.

The 'XDP_RING_NEED_WAKEUP' and related flags are defined if_xdp.h, so if
users are building their own kernel, users have to update the kernel's
header files, by doing:

  $ make headers_install INSTALL_HDR_PATH=/usr

Otherwise the following error shows:
/usr/local/include/bpf/xsk.h: In function 'xsk_ring_prod__needs_wakeup':
/usr/local/include/bpf/xsk.h:82:21: error: 'XDP_RING_NEED_WAKEUP' undeclared \
  (first use in this function)
  return *r->flags & XDP_RING_NEED_WAKEUP;

Reported-by: Tomek Osinski <osinstom@gmail.com>
Reported-at: https://osinstom.github.io/en/tutorial/ovs-afxdp-installation/
Signed-off-by: William Tu <u9012063@gmail.com>
Acked-by: Ben Pfaff <blp@ovn.org>
Signed-off-by: William Tu <u9012063@gmail.com>

conntrack: Fix conntrack new state

In connection tracking system, a connection is established if we
see packets from both directions. However, in userspace datapath's
conntrack, if we send a connection setup packet in one direction
twice, it will make the connection to be in established state.

This patch fixes the aforementioned issue, and adds a system traffic
test for UDP and TCP traffic to avoid regression.

Fixes: a489b16854b59 ("conntrack: New userspace connection tracker.")
Signed-off-by: Yi-Hung Wei <yihung.wei@gmail.com>
Signed-off-by: William Tu <u9012063@gmail.com>

dpif: Fix dp_extra_info leak by reworking the allocation scheme.

dpctl module leaks the 'dp_extra_info' in case the dumped flow doesn't
fit the dump filter while executing dpctl/dump-flows and also while
executing dpctl/get-flow.

This is already a 3rd attempt to fix all the leaks and incorrect usage
of this string that definitely indicates poor initial design of the
feature.

Flow dump/get documentation clearly states that the caller does not own
the data provided in dpif_flow. Datapath still owns all the data and
promises to not free/modify it until the next quiescent period, however
we're requesting the caller to free 'dp_extra_info' and this obviously
breaks the rules.

This patch fixes the issue by by storing 'dp_extra_info' within
'struct dp_netdev_flow' making datapath to own it. 'dp_netdev_flow'
is RCU-protected, so it will be valid until the next quiescent period.

Fixes: 0e8f5c6a38d0 ("dpif-netdev: Modified ovs-appctl dpctl/dump-flows command")
Tested-by: Emma Finn <emma.finn@intel.com>
Acked-by: Emma Finn <emma.finn@intel.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>

ofproto: Do not delete datapath flows on exit by default.

Commit e96a5c24e853 ("upcall: Remove datapath flows when setting
n-threads.") caused OVS to delete datapath flows when it exits through
any graceful means. This is not necessarily desirable, especially when
OVS is being stopped as part of an upgrade. This commit changes OVS so
that it only removes datapath flows when requested, via "ovs-appctl
exit --cleanup".

Acked-by: Numan Siddique <numans@ovn.org>
Tested-by: Numan Siddique <numans@ovn.org>
Signed-off-by: Ben Pfaff <blp@ovn.org>

ofproto-dpif-upcall: Get rid of udpif_synchronize().

RCU provides the semantics we want from udpif_synchronize() and it
should be much more lightweight than killing and restarting all the
upcall threads.  It looks like udpif_synchronize() was written before
the OVS tree had RCU support, which is probably why we didn't use it
here from the beginning.  So we can just change udpif_synchronize()
to a single ovsrcu_synchronize() call.

However, udpif_synchronize() only has a single caller, which calls
ovsrcu_synchronize() anyway just beforehand, via xlate_txn_commit().
So we can get rid of udpif_synchronize() entirely, which this patch
does.

As a side effect, this eliminates one reason why terminating OVS cleanly
clears the datapath flow table.  An upcoming patch will eliminate
other reasons.

Acked-by: Numan Siddique <numans@ovn.org>
Signed-off-by: Ben Pfaff <blp@ovn.org>

lib/stream-windows.c: Grant Access Privilege of Named Pipe to Creator

Current implementation of ovs on windows only allows LocalSystem and
Administrators to access the named pipe created with API of ovs.
Thus any service that needs to invoke the API to create named pipe
has to run as System account to interactive with ovs. It causes the
system more vulnerable if one of those services was break into.
The patch adds the creator owner account to allowed ACLs.

Signed-off-by: Ning Wu <nwu@vmware.com>
Acked-by: Alin Gabriel Serdean <aserdean@ovn.org>
Acked-by: Anand Kumar <kumaranand@vmware.com>
Signed-off-by: Alin Gabriel Serdean <aserdean@ovn.org>

dpif: Fix memory leak while dumping dpif flows.

Leak was detected by running test: "ofproto-dpif - balance-tcp bonding"

Fixes: 0e8f5c6a38d0 ("dpif-netdev: Modified ovs-appctl dpctl/dump-flows command")
Signed-off-by: Damijan Skvarc <damjan.skvarc@gmail.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>

tc: handle packet mark of zero

Openstack may set an skb mark of 0 in tunnel rules. This is considered to
be an unused/unset value. However, it prevents the rule from being
offloaded.

Check if the key value of the skb mark is 0 when it is in use (mask is
set to all ones). If it is then ignore the field and continue with TC offload.

Only the exact-match case is covered by this patch as it addresses the
Openstack use-case and seems most robust against feature evolution: f.e. in
future there may exist hardware offload scenarios where an operation, such
as a BPF offload, sets the SKB mark before proceeding tho the in-HW OVS.
datapath.

Signed-off-by: John Hurley <john.hurley@netronome.com>
Co-Authored-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Acked-by: Aaron Conole <aconole@redhat.com>

Prepare for post-2.13.0 (2.13.90).

Acked-by: Gurucharan Shetty <guru@ovn.org>
Signed-off-by: Ben Pfaff <blp@ovn.org>

Prepare for 2.13.0.

Acked-by: Gurucharan Shetty <guru@ovn.org>
Signed-off-by: Ben Pfaff <blp@ovn.org>

Documentation: add notes for TSO & i40e

When using TSO in OVS-DPDK with an i40e device, the following
patch is required for DPDK, which fixes an issue on the TSO path:
https://patches.dpdk.org/patch/64136/
Document this as a limitation until a DPDK release with the fix
included is supported by OVS.

Also, document best known methods for performance tuning when
testing TSO with the tool iperf.

Fixes: 29cf9c1b3b9c ("userspace: Add TCP Segmentation Offload support")
Signed-off-by: Ciara Loftus <ciara.loftus@intel.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>

dpif: Fix leak and usage of uninitialized dp_extra_info.

'dpif_probe_feature'/'revalidate' doesn't free the 'dp_extra_info'
string.  Also, all the implementations of dpif_flow_get() should
initialize the value to avoid printing/freeing of random memory.

30 bytes in 1 blocks are definitely lost in loss record 323 of 889
    at 0x483AD19: realloc (vg_replace_malloc.c:836)
    by 0xDDAD89: xrealloc (util.c:149)
    by 0xCE1609: ds_reserve (dynamic-string.c:63)
    by 0xCE1A90: ds_put_format_valist (dynamic-string.c:161)
    by 0xCE19B9: ds_put_format (dynamic-string.c:142)
    by 0xCCCEA9: dp_netdev_flow_to_dpif_flow (dpif-netdev.c:3170)
    by 0xCCD2DD: dpif_netdev_flow_get (dpif-netdev.c:3278)
    by 0xCCEA0A: dpif_netdev_operate (dpif-netdev.c:3868)
    by 0xCDF81B: dpif_operate (dpif.c:1361)
    by 0xCDEE93: dpif_flow_get (dpif.c:1002)
    by 0xCDECF9: dpif_probe_feature (dpif.c:962)
    by 0xC635D2: check_recirc (ofproto-dpif.c:896)
    by 0xC65C02: check_support (ofproto-dpif.c:1567)
    by 0xC63274: open_dpif_backer (ofproto-dpif.c:818)
    by 0xC65E3E: construct (ofproto-dpif.c:1605)
    by 0xC4D436: ofproto_create (ofproto.c:549)
    by 0xC3931A: bridge_reconfigure (bridge.c:877)
    by 0xC3FEAC: bridge_run (bridge.c:3324)
    by 0xC4551D: main (ovs-vswitchd.c:127)

CC: Emma Finn <emma.finn@intel.com>
Fixes: 0e8f5c6a38d0 ("dpif-netdev: Modified ovs-appctl dpctl/dump-flows command")
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Acked-by: Roi Dayan <roid@mellanox.com>

netdev-afxdp: NUMA-aware memory allocation for XSK related memory.

Currently, the AF_XDP socket (XSK) related memory are allocated by main
thread in the main thread's NUMA domain. With the patch that detects
netdev-linux's NUMA node id, the PMD thread of AF_XDP port will be run on
the AF_XDP netdev's NUMA domain. If the net device's NUMA domain
is different from the main thread's NUMA domain, we will have two
cross-NUMA memory accesses (netdev <-> memory, memory <-> CPU).

This patch addresses the aforementioned issue by allocating
the memory in the net device's NUMA domain.

Signed-off-by: Yi-Hung Wei <yihung.wei@gmail.com>
Acked-by: William Tu <u9012063@gmail.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>

netdev-linux: Detect numa node id.

The patch detects the numa node id from the name of the netdev,
by reading the '/sys/class/net/<devname>/device/numa_node'.
If not available, ex: virtual device, or any error happens,
return numa id 0. Currently only the afxdp netdev type uses it,
other linux netdev types are disabled due to no use case.

Signed-off-by: William Tu <u9012063@gmail.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>

userspace: Add TCP Segmentation Offload support

Abbreviated as TSO, TCP Segmentation Offload is a feature which enables
the network stack to delegate the TCP segmentation to the NIC reducing
the per packet CPU overhead.

A guest using vhostuser interface with TSO enabled can send TCP packets
much bigger than the MTU, which saves CPU cycles normally used to break
the packets down to MTU size and to calculate checksums.

It also saves CPU cycles used to parse multiple packets/headers during
the packet processing inside virtual switch.

If the destination of the packet is another guest in the same host, then
the same big packet can be sent through a vhostuser interface skipping
the segmentation completely. However, if the destination is not local,
the NIC hardware is instructed to do the TCP segmentation and checksum
calculation.

It is recommended to check if NIC hardware supports TSO before enabling
the feature, which is off by default. For additional information please
check the tso.rst document.

Signed-off-by: Flavio Leitner <fbl@sysclose.org>
Tested-by: Ciara Loftus <ciara.loftus.intel.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>

Typo fix: vswtch -> vswitch.

Acked-by: Numan Siddique <numans@ovn.org>
Signed-off-by: Ben Pfaff <blp@ovn.org>

vhost: Disable multi-segmented buffers

There is no support for multi-segmented buffers, so flag
that to vhost library.

Signed-off-by: Flavio Leitner <fbl@sysclose.org>
Tested-by: Ciara Loftus <ciara.loftus.intel.com>
Acked-by: Ilya Maximets <i.maximets@ovn.org>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>

dp-packet: preserve headroom when cloning a pkt batch

The headroom is useful if the packet needs to insert additional
header, so preserve the original headroom when cloning the batch.

Signed-off-by: Flavio Leitner <fbl@sysclose.org>
Tested-by: Ciara Loftus <ciara.loftus.intel.com>
Acked-by: Ilya Maximets <i.maximets@ovn.org>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>

AUTHORS: Add Emma Finn.

Signed-off-by: Ian Stokes <ian.stokes@intel.com>

dpif-netdev: Modified ovs-appctl dpctl/dump-flows command

Modified ovs-appctl dpctl/dump-flows command to output
the miniflow bits for a given flow when -m option is passed.

$ ovs-appctl dpctl/dump-flows -m

Signed-off-by: Emma Finn <emma.finn@intel.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>

NEWS: Fix location of output actions to be under DPDK.

In the cited commit, the output actions NEWS was mistakenly under OVSDB
instead of DPDK. Fix it.

Fixes: 3c7330ebf036 ("netdev-offload-dpdk: Support offload of output action.")
Signed-off-by: Eli Britstein <elibr@mellanox.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>

Documentation: Fix building with Python 3.9

open(), io.open(), codecs.open() and fileinput.FileInput no longer accept 'U'
("universal newline") in the file mode.
This flag was deprecated since Python 3.3.
In Python 3, the "universal newline" is used by default when a file is open
in text mode.

Acked-by: Flavio Leitner <fbl@sysclose.org>
Reported-at: https://bugzilla.redhat.com/1791681
Reported-by: Miro Hrončok <mhroncok@redhat.com>
Signed-off-by: Timothy Redaelli <tredaelli@redhat.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>

ovsdb-server: Allow OVSDB clients to specify the UUID for inserted rows.

Acked-by: Han Zhou <hzhou@ovn.org>
Requested-by: Leonid Ryzhyk <lryzhyk@vmware.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>

netdev-offload-dpdk: Support offload of set TCP/UDP ports actions.

Signed-off-by: Eli Britstein <elibr@mellanox.com>
Reviewed-by: Oz Shlomo <ozsh@mellanox.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>

netdev-offload-dpdk: Support offload of set IPv4 actions.

Signed-off-by: Eli Britstein <elibr@mellanox.com>
Reviewed-by: Oz Shlomo <ozsh@mellanox.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>

netdev-offload-dpdk: Support offload of set MAC actions.

Signed-off-by: Eli Britstein <elibr@mellanox.com>
Reviewed-by: Oz Shlomo <ozsh@mellanox.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>

netdev-offload-dpdk: Support offload of drop action.

Signed-off-by: Eli Britstein <elibr@mellanox.com>
Reviewed-by: Oz Shlomo <ozsh@mellanox.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>

netdev-offload-dpdk: Support offload of output action.

Support offload of output action, also configuring count action for
allowing query statistics of HW offloaded flows.

Signed-off-by: Eli Britstein <elibr@mellanox.com>
Reviewed-by: Oz Shlomo <ozsh@mellanox.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>

netdev-offload: Introduce a function to validate same flow api handle.

Signed-off-by: Eli Britstein <elibr@mellanox.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>

netdev-dpdk: Getter function for dpdk port id API.

Add a getter function for using the dpdk port id outside the scope of
netdev-dpdk.c to be used for HW offload.

Signed-off-by: Eli Britstein <elibr@mellanox.com>
Reviewed-by: Oz Shlomo <ozsh@mellanox.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>

dpif-netdev: Populate dpif class field in offload struct.

Populate dpif class field in offload struct to be used in offloading
flow put.

Signed-off-by: Eli Britstein <elibr@mellanox.com>
Reviewed-by: Oz Shlomo <ozsh@mellanox.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>

dpif-netdev: Update offloaded flows statistics.

In case a flow is HW offloaded, packets do not reach the SW, thus not
counted for statistics. Use netdev flow get API in order to update the
statistics of flows by the HW statistics.

Co-authored-by: Eli Britstein <elibr@mellanox.com>
Signed-off-by: Ophir Munk <ophirmu@mellanox.com>
Reviewed-by: Oz Shlomo <ozsh@mellanox.com>
Signed-off-by: Eli Britstein <elibr@mellanox.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>

dpctl: Support dump-flows filters "dpdk" and "partially-offloaded".

Flows that are offloaded via DPDK can be partially offloaded (matches
only) or fully offloaded (matches and actions). Set partially offloaded
display to (offloaded=partial, dp:ovs), and fully offloaded to
(offloaded=yes, dp:dpdk). Also support filter types "dpdk" and
"partially-offloaded".

Signed-off-by: Eli Britstein <elibr@mellanox.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>

netdev-offload-dpdk: Implement flow get method.

Implement the flow get method for DPDK, to get the statistics of the
provided ufid, towards reading statistics of fully offloaded flows.

Signed-off-by: Eli Britstein <elibr@mellanox.com>
Reviewed-by: Oz Shlomo <ozsh@mellanox.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>

netdev-offload-dpdk: Framework for actions offload.

Currently HW offload is accelerating only the rule matching sequence.
Introduce a framework for offloading rule actions as a pre-step for
processing the rule actions in HW. In case of a failure, fallback to the
legacy partial offload scheme.

Note: a flow will be fully offloaded only if it can process all its
actions in HW.

Signed-off-by: Eli Britstein <elibr@mellanox.com>
Reviewed-by: Oz Shlomo <ozsh@mellanox.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>

netdev-offload-dpdk: Return UFID-rte_flow entry in find method.

Change the find method to return the whole entry of UFID-rte_flow
association instead of only the rte_flow field in it, as a pre-step
towards adding and using more fields into that map entry.

Signed-off-by: Eli Britstein <elibr@mellanox.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>

netdev-dpdk: Introduce rte flow query function.

Introduce a rte flow query function as a pre-step towards reading HW
statistics of fully offloaded flows.

Signed-off-by: Eli Britstein <elibr@mellanox.com>
Reviewed-by: Oz Shlomo <ozsh@mellanox.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>

netdev-offload-dpdk: Improve HW offload flow debuggability.

Add debug prints when creating and destroying rte flows, with all the
flow details (attributes, patterns, actions).

Signed-off-by: Eli Britstein <elibr@mellanox.com>
Reviewed-by: Oz Shlomo <ozsh@mellanox.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>

netdev-offload-dpdk: Dynamically allocate pattern items.

Instead of statically allocated pattern items on the stack, dynamically
allocate only the required items while parsing the matches, to simplify
the parsing and make it self-contained, without need of external types,
making it easier to support more matches in the future.

Signed-off-by: Eli Britstein <elibr@mellanox.com>
Reviewed-by: Oz Shlomo <ozsh@mellanox.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>

netdev-offload-dpdk: Refactor action items freeing scheme.

Action item data structures are pointed by rte_flow_action items.
Refactor the code to free the data structures when freeing the
rte_flow_action items, allowing simpler future actions simpler to add to
the code.

Signed-off-by: Eli Britstein <elibr@mellanox.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>

netdev-offload-dpdk: Refactor flow patterns.

Refactor the flow patterns code to a helper function for better
readability and towards supporting more matches.

Signed-off-by: Eli Britstein <elibr@mellanox.com>
Reviewed-by: Oz Shlomo <ozsh@mellanox.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>

dpif-netdev: Make datapath port mutex recursive.

Upcoming HW offloading will request flow statistics from the dpdk
offloading module.  This operation requires holding datapath port
mutex.  However, there is a possible scenario in which flow deletion
happens during datapath reconfiguration process and the mutex already
acquired:

  0  raise () from /lib64/libc.so.6
  1  abort () from /lib64/libc.so.6
  2  ovs_abort_valist ()
  3  ovs_abort ()
  4  ovs_mutex_lock_at ()
  5  dpif_netdev_get_flow_offload_status ()
  6  get_dpif_flow_status ()
  7  flow_del_on_pmd ()
  8  dpif_netdev_flow_del ()
  9  dpif_netdev_operate ()
  10 dpif_operate ()
  11 push_dp_ops ()
  12 push_ukey_ops ()
  13 dp_purge_cb ()
  14 dp_netdev_del_pmd ()
  15 reconfigure_pmd_threads ()
  16 reconfigure_datapath ()
  17 do_del_port ()
  18 dpif_netdev_port_del ()
  19 dpif_port_del ()
  20 port_del ()
  21 ofproto_port_del ()
  22 bridge_delete_or_reconfigure_ports ()
  23 bridge_reconfigure ()
  24 bridge_run ()
  25 main ()

This happens while removing the last port of a particular PMD thread.
Reconfiguration process decides that we need to remove current PMD
thread and calls datapath purge callback in order to clean up resources
assigned to it.  This turns into flow removal and flow_del() tries to
request statistics.

Turning the dp->port_mutex into recursive version as a quick fix for
this issue.  Better solutions might be to avoid statistics request
somehow, or fully disassociate offloaded flows from the datapath flows.

Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Acked-by: Ian Stokes <ian.stokes@intel.com>

docs: Correct links in mailing-lists.rst

Adds `mailto:` to email address links in mailing-lists.rst. The
existing syntax resulted in broken links of the form
`http://docs.openvswitch.org/en/latest/internals/mailing-lists/<address>`,
which would result in a 404 error.

Signed-off-by: Lars Kellogg-Stedman <lars@redhat.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>

netdev-dpdk: Add new DPDK RFC 4115 egress policer

This patch adds a new policer to the DPDK datapath based on RFC 4115's
Two-Rate, Three-Color marker. It's a two-level hierarchical policer
which first does a color-blind marking of the traffic at the queue
level, followed by a color-aware marking at the port level. At the end
traffic marked as Green or Yellow is forwarded, Red is dropped. For
details on how traffic is marked, see RFC 4115.

This egress policer can be used to limit traffic at different rated
based on the queues the traffic is in. In addition, it can also be used
to prioritize certain traffic over others at a port level.

For example, the following configuration will limit the traffic rate at a
port level to a maximum of 2000 packets a second (64 bytes IPv4 packets).
100pps as CIR (Committed Information Rate) and 1000pps as EIR (Excess
Information Rate). High priority traffic is routed to queue 10, which marks
all traffic as CIR, i.e. Green. All low priority traffic, queue 20, is
marked as EIR, i.e. Yellow.

ovs-vsctl --timeout=5 set port dpdk1 qos=@myqos -- \
  --id=@myqos create qos type=trtcm-policer \
  other-config:cir=52000 other-config:cbs=2048 \
  other-config:eir=52000 other-config:ebs=2048  \
  queues:10=@dpdk1Q10 queues:20=@dpdk1Q20 -- \
  --id=@dpdk1Q10 create queue \
    other-config:cir=41600000 other-config:cbs=2048 \
    other-config:eir=0 other-config:ebs=0 -- \
  --id=@dpdk1Q20 create queue \
    other-config:cir=0 other-config:cbs=0 \
    other-config:eir=41600000 other-config:ebs=2048 \

This configuration accomplishes that the high priority traffic has a
guaranteed bandwidth egressing the ports at CIR (1000pps), but it can also
use the EIR, so a total of 2000pps at max. These additional 1000pps is
shared with the low priority traffic. The low priority traffic can use at
maximum 1000pps.

Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>

netdev-dpdk: Add support for multi-queue QoS to the DPDK datapath

This patch adds support for multi-queue QoS to the DPDK datapath. Most of
the code is based on an earlier patch from a patchset sent out by
zhaozhanxu. The patch was titled "[ovs-dev, v2, 1/4] netdev-dpdk.c: Support
the multi-queue QoS configuration for dpdk datapath"

Signed-off-by: zhaozhanxu <zhaozhanxu@163.com>
Co-authored-by: zhaozhanxu <zhaozhanxu@163.com>
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>

netdev-offload-tc: Fix crash if offloading is not configured on outdev.

If output device is not yet added to netdev-offload, netdev_ports_get()
will not find it leading to NULL pointer dereference inside
netdev_get_ifindex().

Fixes: 8f283af89298 ("netdev-tc-offloads: Implement netdev flow put using tc interface")
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Acked-by: Ben Pfaff <blp@ovn.org>

socket-util: Emulate recvmmsg() and sendmmsg() on Linux only.

These functions failed to build on OS X because MSG_WAITFORONE is not
defined there. There are pitfalls for trying to define our own MSG_*
constants, since it's hard to pick a constant that is not used by the
system already. Because OVS only uses recvmmsg() and sendmmsg() on
Linux, it seems easiest to just emulate them there.

Reported-by: Ilya Maximets <i.maximets@ovn.org>
Acked-by: Ilya Maximets <i.maximets@ovn.org>
Signed-off-by: Ben Pfaff <blp@ovn.org>

odp-util: Fix passing uninitialized bytes in OVS_KEY_ATTR_CT_ORIG_TUPLE_IPV*.

Both ovs_key_ct_tuple_ipv* structures contains padding at the end
that mast be cleared before passing attributes to kernel:

Syscall param sendmsg(msg.msg_iov[0]) points to uninitialised byte(s)
    at 0x566A607: sendmsg (sendmsg.c:28)
    by 0xFC95CE: nl_sock_transact_multiple__ (netlink-socket.c:858)
    by 0xFC8580: nl_sock_transact_multiple (netlink-socket.c:1079)
    by 0xFC83FF: nl_transact_multiple (netlink-socket.c:1839)
    by 0xFA8648: dpif_netlink_operate__ (dpif-netlink.c:1926)
    by 0xFA789F: dpif_netlink_operate_chunks (dpif-netlink.c:2219)
    by 0xFA25CB: dpif_netlink_operate (dpif-netlink.c:2278)
    by 0xE5BB4C: dpif_operate (dpif.c:1377)
    by 0xE5B7F6: dpif_flow_put (dpif.c:1048)
    by 0xE5B49A: dpif_probe_feature (dpif.c:965)
    by 0xDD6BF5: check_ct_orig_tuple (ofproto-dpif.c:1557)
    by 0xDD41EC: check_support (ofproto-dpif.c:1590)
    by 0xDD3BF3: open_dpif_backer (ofproto-dpif.c:818)
    by 0xDC8467: construct (ofproto-dpif.c:1605)
    by 0xDAD6BB: ofproto_create (ofproto.c:549)
    by 0xD96A19: bridge_reconfigure (bridge.c:877)
    by 0xD9625D: bridge_run (bridge.c:3324)
    by 0xDA5829: main (ovs-vswitchd.c:127)
  Address 0x1ffefe36a5 is on thread 1's stack
  in frame #4, created by dpif_netlink_operate__ (dpif-netlink.c:1839)
  Uninitialised value was created by a stack allocation
    at 0xEB87D0: odp_flow_key_from_flow__ (odp-util.c:5996)

Fixes: daf4d3c18da4 ("odp: Support conntrack orig tuple key.")
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Acked-by: Ben Pfaff <blp@ovn.org>

ovs-tcpdump: Fallback to read /proc/net/dev on Linux

Currently, ovs-tcpdump uses python3-netifaces in order to get the list of
the network interfaces, but python3-netifaces may not be installed nor
available on some distributions (for example on RHEL7).

This commit adds, only for Linux, an alternative way (that is only used
when netifaces is not available) to read the list of the network interfaces
by reading "/proc/net/dev".

Signed-off-by: Timothy Redaelli <tredaelli@redhat.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>

netdev-dpdk: Avoid undefined behavior processing devargs

In "Use of library functions" in the C standard, the following statement
is written to apply to all library functions:

  If an argument to a function has an invalid value (such as ... a
  null pointer ... the behavior is undefined.

Later, under the "String handling" section, "Comparison functions" no
exception is listed for strcmp, which means NULL is invalid.  It may
be possible for the smap_get to return NULL.

Given the above, we must check that new_devargs is not null.  The check
against NULL for new_devargs later in the function is still valid.

Fixes: 55e075e65ef9 ("netdev-dpdk: Arbitrary 'dpdk' port naming")
Signed-off-by: Aaron Conole <aconole@redhat.com>
Acked-by: Ciara Loftus <ciara.loftus@intel.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>

Use batch process recv for tap and raw socket in netdev datapath

Current netdev_linux_rxq_recv_tap and netdev_linux_rxq_recv_sock
just receive single packet, that is very inefficient, per my test
case which adds two tap ports or veth ports into OVS bridge
(datapath_type=netdev) and use iperf3 to do performance test
between two ports (they are set into different network name space).

The result is as below:

  tap:  295 Mbits/sec
  veth: 207 Mbits/sec

After I change netdev_linux_rxq_recv_tap and
netdev_linux_rxq_recv_sock to use batch process, the performance
is boosted by about 7 times, here is the result:

  tap:  1.96 Gbits/sec
  veth: 1.47 Gbits/sec

Undoubtedly this is a huge improvement although it can't match
OVS kernel datapath yet.

FYI: here is thr result for OVS kernel datapath:

  tap:  37.2 Gbits/sec
  veth: 36.3 Gbits/sec

Note: performance result is highly related with your test machine,
you shouldn't expect the same results on your test machine.

Signed-off-by: Yi Yang <yangyi01@inspur.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>

socket-util: Introduce emulation and wrapper for recvmmsg().

Not every system will have recvmmsg(), so introduce compatibility code
that will allow it to be used blindly from the rest of the tree.

This assumes that recvmmsg() and sendmmsg() are either both present or
both absent in system libraries and headers.

CC: Yi Yang <yangyi01@inspur.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>

acinclude: Use RTE_IBVERBS_LINK_DLOPEN

On DPDK 19.11 RTE_IBVERBS_LINK_DLOPEN is used by Mellanox PMDs (mlx4 and
mlx5) instead of RTE_LIBRTE_MLX{4,5}_DLOPEN_DEPS.

Without this commit is not possible to statically link OVS with DPDK when MLX4
or MLX5 PMDs are enabled.

Signed-off-by: Timothy Redaelli <tredaelli@redhat.com>
Acked-by: David Marchand <david.marchand@redhat.com>
Acked-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>

AUTHORS: update email for Lance Richardson

Update email address for Lance Richardson.

Signed-off-by: Lance Richardson <lance.richardson@broadcom.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>