git.proxmox.com Git - mirror

upcall: Configure datapath min-revalidate-pps through ovs-vsctl.

This patch adds a new configuration option, "min-revalidate-pps" to the
Open_vSwitch "other-config" column. This sets minimum pps that flow must
have in order to be revalidated when revalidation duration exceeds half of
max-revalidator config variable.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>

upcall: Change should_revalidate to use max-revalidator value

Revalidate if dump duration was longer than half of max-revalidator
timeout, instead of hardcoded 200msec value.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>

upcall: Configure datapath max-revalidator through ovs-vsctl.

This patch adds a new configuration option, "max-revalidator" to the
Open_vSwitch "other-config" column. This sets maximum allowed ravalidator
timeout. Actual timeout value is determined at runtime as minimum of
"max-idle" and "max-revalidator".

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>

ovsdb monitor: Fix crash when using non-zero last-id with standalone DB.

When a client uses monitor-cond-since with a non-zero last-id but the
server is not in cluster mode for the DB being monitored, it leads to
segmentation fault because the txn_history list is not initialized in
this case.

Program terminated with signal SIGSEGV, Segmentation fault.
1536 struct ovsdb_txn *txn = h_node->txn;
(gdb) bt
0 ovsdb_monitor_get_changes_after (txn_uuid=txn_uuid@entry=0x7ffe8605b7e0, dbmon=0x17c1b40, p_mcs=p_mcs@entry=0x17c4900) at ovsdb/monitor.c:1536
1 0x000000000040da2d in ovsdb_jsonrpc_monitor_create (request_id=0x1804630, version=<optimized out>, params=0x17ad330, db=0x18015b0, s=<optimized out>) at ovsdb/jsonrpc-server.c:1469
2 ovsdb_jsonrpc_session_got_request (request=0x17ad520, s=<optimized out>) at ovsdb/jsonrpc-server.c:1002
3 ovsdb_jsonrpc_session_run (s=<optimized out>) at ovsdb/jsonrpc-server.c:556
...

Although it doesn't happen in normal use cases, no one can prevent a
client to send this on purpose or in a corner case when a client firstly
connected to a clustered DB but later the server restarted with a
non-clustered DB.

This patch fixes it by always initialize the txn_history list to avoid
the undefined behavior in this case. It adds a test case to cover it, too.

Fixes: 695e815 ("ovsdb-server: Transaction history tracking.")
Reported-by: Aliasgar Ginwala <aginwala@ebay.com>
Signed-off-by: Han Zhou <hzhou8@ebay.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>

ovsdb raft: Support leader election time change online.

A new unixctl command cluster/change-election-timer is implemented to
change leader election timeout base value according to the scale needs.

The change takes effect upon consensus of the cluster, implemented through
the append-request RPC. A new field "election-timer" is added to raft log
entry for this purpose.

Signed-off-by: Han Zhou <hzhou8@ebay.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>

raft.c: Set candidate_retrying if no leader elected since last election.

candiate_retrying is used to determine if the current node is disconnected
from the cluster when the node is in candiate role. However, a node
can flap between candidate and follower role before a leader is elected
when majority of the cluster is down, so is_connected() will flap, too, which
confuses clients.

This patch avoids the flapping with the help of a new member had_leader,
so that if no leader was elected since last election, we know we are
still retrying, and keep as disconnected from the cluster.

Signed-off-by: Han Zhou <hzhou8@ebay.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>

raft.c: Stale leader should disconnect from cluster.

As mentioned in RAFT paper, section 6.2:

Leaders: A server might be in the leader state, but if it isn’t the current
leader, it could be needlessly delaying client requests. For example, suppose a
leader is partitioned from the rest of the cluster, but it can still
communicate with a particular client. Without additional mechanism, it could
delay a request from that client forever, being unable to replicate a log entry
to any other servers. Meanwhile, there might be another leader of a newer term
that is able to communicate with a majority of the cluster and would be able to
commit the client’s request. Thus, a leader in Raft steps down if an election
timeout elapses without a successful round of heartbeats to a majority of its
cluster; this allows clients to retry their requests with another server.

Reported-by: Aliasgar Ginwala <aginwala@ebay.com>
Tested-by: Aliasgar Ginwala <aginwala@ebay.com>
Signed-off-by: Han Zhou <hzhou8@ebay.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>

ovsdb-idl.c: Allows retry even when using a single remote.

When clustered mode is used, the client needs to retry connecting
to new servers when certain failures happen. Today it is allowed to
retry new connection only if multiple remotes are used, which prevents
using LB VIP with clustered nodes. This patch makes sure the retry
logic works when using LB VIP: although same IP is used for retrying,
the LB can actually redirect the connection to a new node.

Signed-off-by: Han Zhou <hzhou8@ebay.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>

raft: Move raft_reset_ping_timer() out of the loop.

Fixes: commit 5a9b53a5 ("ovsdb raft: Fix duplicated transaction execution when leader failover.")
Signed-off-by: Han Zhou <hzhou8@ebay.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>

sat-math: Do not use __builtin_s*_overflow() with sparse.

Some versions of sparse do not understand __builtin_saddll_overflow() and
related GCC builtins for calculations with overflow detection. This patch
avoids using them when sparse is in use.

Reported-by: Justin Pettit <jpettit@ovn.org>
Tested-by: Justin Pettit <jpettit@ovn.org>
Acked-by: Justin Pettit <jpettit@ovn.org>
Signed-off-by: Ben Pfaff <blp@ovn.org>

ovs-lib: Fix standalone db migration to raft

Current code of create-cluster from standalone db takes backup of existing
standalone db and then generates a new clustered dbs from backup dbs. Hence,
during migration if nb and sb dbs are still present, create-cluster will fail
saying file exists and will not really convert dbs to clustered dbs. This
patch fixes the same.

e.g message that pops up while migration from standalone to raft cluster:
* Backing up database to /etc/openvswitch/ovnnb_db.db.backup5.13.0-1278623084
ovsdb-tool: I/O error: /etc/openvswitch/ovnnb_db.db: create failed (File exists)
* Creating cluster database /etc/openvswitch/ovnnb_db.db from existing one

Signed-off-by: aginwala <aginwala@ebay.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>

datapath-windows: Fix updating ct label when mask is specified

When an existing label needs to be changed by specifing bits to be
updated using mask, instead of updating only the masked bits,
new label was getting overridden. This patch fixes this issue.

Signed-off-by: Anand Kumar <kumaranand@vmware.com>
Acked-by: Alin Gabriel Serdean <aserdean@ovn.org>
Signed-off-by: Alin Gabriel Serdean <aserdean@ovn.org>

travis: Drop OSX workarounds.

TravisCI currently uses xcode9.4 as a default image and it
it has good version of libtool out-of-the-box.
Removing these workarounds saves 4-6 minutes of OSX build.

Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
Acked-by: Aaron Conole <aconole@redhat.com>

travis: Combine kernel builds.

Single kernel build job takes ~3 minutes in average. Most of
this time takes VM spawning and initial configuration.
Combining these 24 jobs in 4 allows us to better utilize workers
and not waste time on spawning VMs.

Before:

 24 jobs * 3 minutes = 72 minutes
 72 minutes / 5 workers = 14.4 minutes / worker

After:

 4 jobs * 10 minutes = 40 minutes
 40 minutes / 4 workers = 10 minutes / worker
 + 1 free worker that able to run other jobs at the same time.

Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
Acked-by: Aaron Conole <aconole@redhat.com>

travis: Cache DPDK build.

This change enables cache for DPDK build directory, so we'll never
build same version of DPDK again. This speeds up each DPDK related
job by 4-6 minutes effectively saving 30-50 minutes of the total time.
Ex. Full TravisCI run on 'trusty' images:

 Without cache:
 Ran for 1 hr 9 min 29 sec
 Total time 4 hrs 55 min 13 sec

 With populated cache:
 Ran for 1 hr 2 min 18 sec
 Total time 4 hrs 20 min 9 sec

Saved:
 Real time: ~7 minutes.
 Total worker time: ~35 minutes.

Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
Acked-by: Aaron Conole <aconole@redhat.com>

Switch from internal port to all ports defined

This patch changes the way we try to figure out if a port is defined on a given switch.

Instead of looking only in the internal ports defined switch to all ports defined.

This caused issues when trying to add a Hyper-V container port to a given OVS bridge.

Reported-by: Danting Liu <dantingl@vmware.com>
Signed-off-by: Alin Gabriel Serdean <aserdean@ovn.org>
Acked-by: Anand Kumar <kumaranand@vmware.com>

checkpatch: Fix regexp for if, while, etc inside macros.

This allows to use a one-character expression inside the 'if'
statement and multiple spaces before the line continuation character.

Fixes false positive in case like this:

 #define MACRO(ARG) \
 if (a) { \
 do_work(ARG); \
 }

Fixes: 16770c6d9179 ("checkpatch: support macro continuation")
Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
Acked-by: Aaron Conole <aconole@redhat.com>

netdev-afxdp: fix corner case where umem entries were not released

If for some reason the last element in the batch was already pushed on
the stack, none of the elements where pushed. This was leading to
buffer starvation.

Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@samsung.com>

ovs-tc: offload MPLS set actions to TC datapath

Recent modifications to TC allows the modifying of fields within the
outermost MPLS header of a packet. OvS datapath rules impliment an MPLS
set action by supplying a new MPLS header that should overwrite the
current one.

Convert the OvS datapath MPLS set action to a TC modify action and allow
such rules to be offloaded to a TC datapath.

Signed-off-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>

ovs-tc: offload MPLS push actions to TC datapath

TC can now be used to push an MPLS header onto a packet. The MPLS label is
the only information that needs to be passed here with the rest reverting
to default values if none are supplied. OvS, however, gives the entire
MPLS header to be pushed along with the MPLS protocol to use. TC can
optionally accept these values so can be made replicate the OvS datapath
rule.

Convert OvS MPLS push datapath rules to TC format and offload to a TC
datapath.

Signed-off-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>

ovs-tc: offload MPLS pop actions to TC datapath

TC now supports an action to pop the outer MPLS header from a packet. The
next protocol after the header is required alongside this. Currently, OvS
datapath rules also supply this information.

Offload OvS MPLS pop actions to TC along with the next protocol.

Signed-off-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>

compat: add compatibility headers for tc mpls action

OvS includes compat code for several TC actions including vlan, mirred and
tunnel key. MPLS actions have recently been added to TC in the kernel. In
preparation for adding TC offload code for MPLS, add the MPLS compat code.

Signed-off-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>

dpif-netlink: Allow offloading of flows with dl_type 0x1234.

'dpif_probe_feature()' always has DPIF_FP_PROBE flag set. Other probing
code uses dpif_execute() with DPIF_OP_EXECUTE, hence never calls
parse_flow_put().
Thus, this 'if' statement is wrong and should be removed as it only
forbids offloading of the real legitimate flows with dl_type 0x1234.
Dummy flows never reach this code.

CC: Paul Blakey <paulb@mellanox.com>
Fixes: 8b668ee3f0cc ("dpif-netlink: Use netdev flow put api to insert a flow")
Reported-by: Eli Britstein <elibr@mellanox.com>
Acked-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: Ilya Maximets <i.maximets@samsung.com>

netdev-afxdp: Error when no XDP program loaded.

netdev-afxdp requires XDP program to be loaded. When prog_id == 0,
it indicates no XDP program, so return error and free resources.

Signed-off-by: William Tu <u9012063@gmail.com>
Signed-off-by: Ilya Maximets <i.maximets@samsung.com>

ovsdb-data: Don't put strings with digits in quotes.

No need to use quotes for strings like "br0".

Keeping UUIDs always in quotes to avoid different treatment of those
that starts with digits and those that starts with letters.

Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
Acked-by: Ben Pfaff <blp@ovn.org>

netdev-afxdp: Convert AFXDP_DEBUG to custom stats.

These are valid statistics of a network interface and should be
exposed via custom stats.

The same MACRO trick as in vswitchd/bridge.c is used to reduce code
duplication and easily add new stats if necessary in the future.

Acked-by: William Tu <u9012063@gmail.com>
Signed-off-by: Ilya Maximets <i.maximets@samsung.com>

netdev-afxdp: Fix use of unconfigured device.

In case of failure of 'xsk_configure_all()', 'n_rxq' and 'xdpmode'
will remain in a new state. This will result in successful
reconfiguration (immediate return, because configuration is already
applied) if 'netdev_reconfigure()' will be called again.

Same issue was fixed previously for netdev-dpdk using 'dev->started'
flag in commit:
606f66507250 ("netdev-dpdk: Don't use PMD driver if not configured successfully")

Let's use similar approach with checking the 'dev->xsks' which only
exists if configuration was successful.

Additionally implemented 'netdev_afxdp_construct()' function to
explicitly initialize all the specific fields and request the
reconfiguration.

CC: William Tu <u9012063@gmail.com>
Fixes: 0de1b425962d ("netdev-afxdp: add new netdev type for AF_XDP.")
Acked-by: William Tu <u9012063@gmail.com>
Signed-off-by: Ilya Maximets <i.maximets@samsung.com>

Prepare for post-2.12.0 (2.12.90).

Acked-by: Ilya Maximets <i.maximets@samsung.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>

Prepare for 2.12.0.

Acked-by: Ilya Maximets <i.maximets@samsung.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>

test: Fix fragment-related tests that fail on 4.19+ due to small-sized packets

These fragment-related tests are failing on later kernels (4.19.x)
because kernel quietly drops any packet fragment that is not the last
but has a size smaller than IPV6_MIN_MTU. This patch fixes
them by increasing their sizes to IPV6_MIN_MTU.

Reviewed-by: Darrell Ball <dlu998@gmail.com>
Reivewed-at: https://github.com/openvswitch/ovs/pull/278
Signed-off-by: Yifeng Sun <pkusunyifeng@gmail.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>

tnl-neigh-cache: Purge learnt neighbors when port/bridge is deleted

Say an ARP entry is learnt on a OVS port and when such a port is deleted,
learnt entry should be removed from the port. It would have be aged out after
ARP ageout time. This code will clean up immediately.

Added test case(tunnel - neighbor entry add and deletion) in tunnel.at, to
verify neighbors are added and removed on deletion of a ports and bridges.

Discussion for this addition is at:
https://mail.openvswitch.org/pipermail/ovs-discuss/2019-June/048754.html

Signed-off-by: Vasu Dasari <vdasari@gmail.com>
Reviewed-by: Flavio Fernandes <flavio@flaviof.com>
Reviewed-by: Ben Pfaff <blp@ovn.org>
Signed-off-by: Ben Pfaff <blp@ovn.org>

system-traffic: Make nsh test more robust.

The patch adds '-n' to tcpdump to avoid address coverting. Add '-l' for rhel8
to avoid buffering. Since '-U' is used to output to stdout, simply use 'cat'
to search result. Use OVS_WAIT_UNTIL instead of sleep, and also remove/add
some newlines. Finally, move tcpdump captured interface into the namespace,
(capture p1 instead of ovs-p1), and tested using af_xdp.

Signed-off-by: William Tu <u9012063@gmail.com>
Signed-off-by: Ilya Maximets <i.maximets@samsung.com>

netdev-afxdp: add new netdev type for AF_XDP.

The patch introduces experimental AF_XDP support for OVS netdev.
AF_XDP, the Address Family of the eXpress Data Path, is a new Linux socket
type built upon the eBPF and XDP technology. It is aims to have comparable
performance to DPDK but cooperate better with existing kernel's networking
stack. An AF_XDP socket receives and sends packets from an eBPF/XDP program
attached to the netdev, by-passing a couple of Linux kernel's subsystems
As a result, AF_XDP socket shows much better performance than AF_PACKET
For more details about AF_XDP, please see linux kernel's
Documentation/networking/af_xdp.rst. Note that by default, this feature is
not compiled in.

Signed-off-by: William Tu <u9012063@gmail.com>
Signed-off-by: Ilya Maximets <i.maximets@samsung.com>

ovs-thread: Add pthread spin lock support.

The patch adds the basic spin lock functions:
ovs_spin_{lock, try_lock, unlock, init, destroy}.

Signed-off-by: William Tu <u9012063@gmail.com>
Acked-by: Ilya Maximets <i.maximets@samsung.com>
Acked-by: Ben Pfaff <blp@ovn.org>
Signed-off-by: Ilya Maximets <i.maximets@samsung.com>

AUTHORS: Add Harry Van Haaren.

Signed-off-by: Ian Stokes <ian.stokes@intel.com>

dpif-netdev: Add specialized generic scalar functions

This commit adds a number of specialized functions, that handle
common miniflow fingerprints. This enables compiler optimization,
resulting in higher performance. Below a quick description of
how this optimization actually works;

"Specialized functions" are "instances" of the generic implementation,
but the compiler is given extra context when compiling. In the case of
iterating miniflow datastructures, the most interesting value to enable
compile time optimizations is the loop trip count per unit.

In order to create a specialized function, there is a generic implementation,
which uses a for() loop without the compiler knowing the loop trip count at
compile time. The loop trip count is passed in as an argument to the function:

uint32_t miniflow_impl_generic(struct miniflow *mf, uint32_t loop_count)
{
 for(uint32_t i = 0; i < loop_count; i++)
 // do work
}

In order to "specialize" the function, we call the generic implementation
with hard-coded numbers - these are compile time constants!

uint32_t miniflow_impl_loop5(struct miniflow *mf, uint32_t loop_count)
{
 // use hard coded constant for compile-time constant-propogation
 return miniflow_impl_generic(mf, 5);
}

Given the compiler is aware of the loop trip count at compile time,
it can perform an optimization known as "constant propogation". Combined
with inlining of the miniflow_impl_generic() function, the compiler is
now enabled to *compile time* unroll the loop 5x, and produce "flat" code.

The last step to using the specialized functions is to utilize a
function-pointer to choose the specialized (or generic) implementation.
The selection of the function pointer is performed at subtable creation
time, when miniflow fingerprint of the subtable is known. This technique
is known as "multiple dispatch" in some literature, as it uses multiple
items of information (miniflow bit counts) to select the dispatch function.

By pointing the function pointer at the optimized implementation, OvS
benefits from the compile time optimizations at runtime.

Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
Tested-by: Malvika Gupta <malvika.gupta@arm.com>
Acked-by: Ilya Maximets <i.maximets@samsung.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>

dpif-netdev: Refactor generic implementation

This commit refactors the generic implementation. The
goal of this refactor is to simplify the code to enable
"specialization" of the functions at compile time.

Given compile-time optimizations, the compiler is able
to unroll loops, and create optimized code sequences due
to compile time knowledge of loop-trip counts.

In order to enable these compiler optimizations, we must
refactor the code to pass the loop-trip counts to functions
as compile time constants.

This patch allows the number of miniflow-bits set per "unit"
in the miniflow to be passed around as a function argument.

Note that this patch does NOT yet take advantage of doing so,
this is only a refactor to enable it in the next patches.

Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
Tested-by: Malvika Gupta <malvika.gupta@arm.com>
Acked-by: Ilya Maximets <i.maximets@samsung.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>

dpif-netdev: Split out generic lookup function

This commit splits the generic hash-lookup-verify function to
its own file, for cleaner seperation between optimized versions.

Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
Tested-by: Malvika Gupta <malvika.gupta@arm.com>
Acked-by: Ilya Maximets <i.maximets@samsung.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>

dpif-netdev: Move dpcls lookup structures to .h

This commit moves some data-structures to be available
in the dpif-netdev-private.h header. This allows specific
implementations of the subtable lookup function to include
just that header file, and not require that the code exists
in dpif-netdev.c

Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
Tested-by: Malvika Gupta <malvika.gupta@arm.com>
Acked-by: Ilya Maximets <i.maximets@samsung.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>

dpif-netdev: Implement function pointers/subtable

This allows plugging-in of different subtable hash-lookup-verify
routines, and allows special casing of those functions based on
known context (eg: # of bits set) of the specific subtable.

Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
Tested-by: Malvika Gupta <malvika.gupta@arm.com>
Acked-by: Ilya Maximets <i.maximets@samsung.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>

ovsdb-idl: Mark row "parsed" in ovsdb_idl_txn_write__

Once a column is set in a row using ovsdb_idl_txn_write__ we also mark
the row "parsed". Otherwise the memory allocated by
sbrec_<table>_parse_<col>() will never be freed. After marking the row
"parsed", the ovsdb_idl_txn_disassemble function will properly free the
newly allocated memory for the column (ovsdb_idl_row_unparse).

The problem is present only for rows that are inserted by the
running application because rows that are loaded from the database
will always have row->parsed == true.

One way to detect the leak is to start northd with valgrind:

valgrind --tool=memcheck --leak-check=yes ./ovn-northd

Then create a logical switch and bind a logical port to it:

ovn-nbctl ls-add ls1
ovn-nbctl lsp-add ls1 ls1-vm1

The valgrind report:

==9270== Memcheck, a memory error detector
==9270== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==9270== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright
info
==9270== Command: ./ovn-northd
==9270==

<snip>

==9270==
==9270== 8 bytes in 1 blocks are definitely lost in loss record 30 of
292
==9270== at 0x4C29BC3: malloc (vg_replace_malloc.c:299)
==9270== by 0x4D31EF: xmalloc (util.c:138)
==9270== by 0x45CB8E: sbrec_multicast_group_parse_ports (ovn-sb-idl.c:18141)
==9270== by 0x4BB12D: ovsdb_idl_txn_write__ (ovsdb-idl.c:4489)
==9270== by 0x4BB1B5: ovsdb_idl_txn_write (ovsdb-idl.c:4527)
==9270== by 0x45D167: sbrec_multicast_group_set_ports (ovn-sb-idl.c:18561)
==9270== by 0x40F416: ovn_multicast_update_sbrec (ovn-northd.c:2947)
==9270== by 0x41FC55: build_lflows (ovn-northd.c:7981)
==9270== by 0x421830: ovnnb_db_run (ovn-northd.c:8522)
==9270== by 0x422B2D: ovn_db_run (ovn-northd.c:9089)
==9270== by 0x423909: main (ovn-northd.c:9409)
==9270==
==9270== 157 (32 direct, 125 indirect) bytes in 1 blocks are definitely
lost in loss record 199 of 292
==9270== at 0x4C29BC3: malloc (vg_replace_malloc.c:299)
==9270== by 0x4D31EF: xmalloc (util.c:138)
==9270== by 0x471E3D: resize (hmap.c:100)
==9270== by 0x4720C8: hmap_expand_at (hmap.c:175)
==9270== by 0x4C74F1: hmap_insert_at (hmap.h:277)
==9270== by 0x4C825A: smap_add__ (smap.c:392)
==9270== by 0x4C7783: smap_add (smap.c:55)
==9270== by 0x451054: sbrec_datapath_binding_parse_external_ids (ovn-sb-idl.c:7181)
==9270== by 0x4BB12D: ovsdb_idl_txn_write__ (ovsdb-idl.c:4489)
==9270== by 0x4BB1B5: ovsdb_idl_txn_write (ovsdb-idl.c:4527)
==9270== by 0x451436: sbrec_datapath_binding_set_external_ids (ovn-sb-idl.c:7444)
==9270== by 0x4090F1: ovn_datapath_update_external_ids (ovn-northd.c:817)
==9270==

<snip>

==9270==
==9270== LEAK SUMMARY:
==9270== definitely lost: 1,322 bytes in 47 blocks
==9270== indirectly lost: 4,653 bytes in 240 blocks
==9270== possibly lost: 0 bytes in 0 blocks
==9270== still reachable: 254,004 bytes in 7,780 blocks
==9270== suppressed: 0 bytes in 0 blocks
==9270== Reachable blocks (those to which a pointer was found) are not
shown.
==9270== To see them, rerun with: --leak-check=full
--show-leak-kinds=all
==9270==
==9270== For counts of detected and suppressed errors, rerun with: -v
==9270== ERROR SUMMARY: 9 errors from 9 contexts (suppressed: 0 from 0)

Signed-off-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>

doc: Remove experimental tag for SMC cache.

SMC cache was introduced in 2.10 with experimental tag.
SMC cache is a layer of software cache located after EMC
cache. The purpose is to improve the performance of use
cases that many flows missing the EMC cache.

One can enable SMC cache using smc-enable=true option.

Signed-off-by: Yipeng Wang <yipeng1.wang@intel.com>
Acked-by: Ilya Maximets <i.maximets@samsung.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>

flow: Avoid unsafe comparison of minimasks.

The following, run inside the OVS sandbox, caused OVS to abort when
Address Sanitizer was used:

 ovs-vsctl add-br br-int
 ovs-ofctl add-flow br-int "table=0,cookie=0x1234,priority=10000,icmp,actions=drop"
 ovs-ofctl --strict del-flows br-int "table=0,cookie=0x1234/-1,priority=10000"

Sample report from Address Sanitizer:

==3029==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x603000043260 at pc 0x7f6b09c2459b bp 0x7ffcb67e7540 sp 0x7ffcb67e6cf0
READ of size 40 at 0x603000043260 thread T0
 #0 0x7f6b09c2459a (/lib/x86_64-linux-gnu/libasan.so.5+0xb859a)
 #1 0x565110a748a5 in minimask_equal ../lib/flow.c:3510
 #2 0x565110a9ea41 in minimatch_equal ../lib/match.c:1821
 #3 0x56511091e864 in collect_rules_strict ../ofproto/ofproto.c:4516
 #4 0x56511093d526 in delete_flow_start_strict ../ofproto/ofproto.c:5959
 #5 0x56511093d526 in ofproto_flow_mod_start ../ofproto/ofproto.c:7949
 #6 0x56511093d77b in handle_flow_mod__ ../ofproto/ofproto.c:6122
 #7 0x56511093db71 in handle_flow_mod ../ofproto/ofproto.c:6099
 #8 0x5651109407f6 in handle_single_part_openflow ../ofproto/ofproto.c:8406
 #9 0x5651109407f6 in handle_openflow ../ofproto/ofproto.c:8587
 #10 0x5651109e40da in ofconn_run ../ofproto/connmgr.c:1318
 #11 0x5651109e40da in connmgr_run ../ofproto/connmgr.c:355
 #12 0x56511092b129 in ofproto_run ../ofproto/ofproto.c:1826
 #13 0x5651108f23cd in bridge_run__ ../vswitchd/bridge.c:2965
 #14 0x565110904887 in bridge_run ../vswitchd/bridge.c:3023
 #15 0x5651108e659c in main ../vswitchd/ovs-vswitchd.c:127
 #16 0x7f6b093b709a in __libc_start_main ../csu/libc-start.c:308
 #17 0x5651108e9009 in _start (/home/blp/nicira/ovs/_build/vswitchd/ovs-vswitchd+0x11d009)

This fixes the problem, which although largely theoretical could crop up
with odd implementations of memcmp(), perhaps ones optimized in various
"clever" ways. All in all, it seems best to avoid the theoretical problem.

Acked-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>

AUTHORS: Add Michele Baldessari.

Signed-off-by: Ben Pfaff <blp@ovn.org>

OVN resource agent - make promotion synchronous

Currently inside the ovsdb_server_promote() function we call 'promote_ovnnb'
and 'promote_ovnsb' and then just record the new master state in the
CIB.

This creates a race because those two promote commands are asynchronous
so when we exit the ovsdb_server_promote() function the underlying DBs
are not guaranteed to be in master state. That means that clients might
connect to an instance that is in read-only mode.

We add a simple sleep loop where we wait for the underlying DB state to
confirm the master state. We do not need to add a timeout loop because
in case of an issue the resource timeout set within pacemaker will kick
in and the resource agent script will be killed by pacemaker.

Tested this within an openstack environment using ovn with roughly ~20
reboots and was unable to trigger the issue (before the patch we would
trigger the issue after a couple of reboots tops).

Acked-by: Numan Siddique <nusiddiq@redhat.com>
Acked-By: Daniel Alvarez <dalvarez@redhat.com>
Signed-off-by: Michele Baldessari <michele@acksyn.org>
Signed-off-by: Ben Pfaff <blp@ovn.org>

ofp-flow: Improve error message when cookie cannot be set.

The "cookie" value has two meanings in "ovs-ofctl add-flow", etc. With
a mask, it indicates a match; without a mask, it indicates that the
cookie should be set. In some cases, the cookie cannot be set, which may
mean that the user meant to indicate a match. The error message for this
case was poor; this improves it.

Suggested-by: "Yi Yang (杨燚)-云服务集团" <yangyi01@inspur.com>
Reviewed-by: Greg Rose <gvrose8192@gmail.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>

NEWS: Mention IGMP support for OVN

NEWS update was missed while adding the IGMP code.

Fixes: 605535f9adf2 ("OVN: Add ovn-northd IGMP support")
Signed-off-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>

odp-util: Fix NSH mask parsing.

'odp_flow_key_to_flow__' is used to parse both keys and masks.
However, 'odp_nsh_key_from_attr' expects only 'key' as an argument
and fails to parse masks with ODP_FIT_ERROR which causes userspace
system tests failures:

# make check-system-userspace TESTSUITEFLAGS='-k nsh'

|odp_util(revalidator)|WARN|
OVS_NSH_KEY_ATTR_MD1 present but declared mdtype 0 is not 1 (NSH_M_TYPE1)

|odp_util(revalidator)|WARN|
the flow mask in error is:
 <...> nsh(flags=0ttl=0,mdtype=0,np=255,spi=0x0,si=0),
for the following flow key:
 <...> nsh_ttl=8,nsh_mdtype=1,nsh_np=3,nsh_spi=0x64,nsh_si=3,
 nsh_c1=0x1020304,nsh_c2=0x5060708,nsh_c3=0x90a0b0c,nsh_c4=0xd0e0f10
 <...>

Fix that by passing the additional argument 'is_mask' to make it be
like all other parsing functions.

Additionally fixed missing comma in the 'format_nsh_key'.

CC: Yi Yang <yi.y.yang@intel.com>
Fixes: f59cb331c481 ("nsh: rework NSH netlink keys and actions")
Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
Acked-by: William Tu <u9012063@gmail.com>

flow: Wildcard UDP ports when using SYMMETRIC_L4 hash for select groups.

UDP source and destination ports are not used to derive the hash index
used for selecting the bucket in case of SYMMETRIC_L4 hash based select
groups. However, they are un-wildcarded in the megaflow entry match criteria.
This results in distinct megaflow entry being created for each pair of UDP
source and destination ports unnecessarily and causes significant performance
deterioration when the megaflow cache limit is reached.

This patch wildcards UDP ports when using select group with SYMMETRIC_L4
hash function.

Signed-off-by: Vishal Deep Ajmera <vishal.deep.ajmera@ericsson.com>
CC: Jan Scheurich <jan.scheurich@ericsson.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>

OVN: Add ovn-northd IGMP support

New IP Multicast Snooping Options are added to the Northbound DB
Logical_Switch:other_config column. These allow enabling IGMP snooping and
querier on the logical switch and get translated by ovn-northd to rows in
the IP_Multicast Southbound DB table.

ovn-northd monitors for changes done by ovn-controllers in the Southbound DB
IGMP_Group table. Based on the entries in IGMP_Group ovn-northd creates
Multicast_Group entries in the Southbound DB, one per IGMP_Group address X,
containing the list of logical switch ports (aggregated from all controllers)
that have IGMP_Group entries for that datapath and address X. ovn-northd
also creates a logical flow that matches on IP multicast traffic destined
to address X and outputs it on the tunnel key of the corresponding
Multicast_Group entry.

Signed-off-by: Dumitru Ceara <dceara@redhat.com>
Acked-by: Mark Michelson <mmichels@redhat.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>

OVN: Add IGMP SB definitions and ovn-controller support

A new IP_Multicast table is added to Southbound DB. This table stores the
multicast related configuration for each datapath. Each row will be
populated by ovn-northd and will control:
- if IGMP Snooping is enabled or not, the snooping table size and multicast
 group idle timeout.
- if IGMP Querier is enabled or not (only if snooping is enabled too), query
 interval, query source addresses (Ethernet and IP) and the max-response
 field to be stored in outgoing queries.
- an additional "seq_no" column is added such that ovn-sbctl or if needed a
 CMS can flush currently learned groups. This can be achieved by incrementing
 the "seq_no" value.

A new IGMP_Group table is added to Southbound DB. This table stores all the
multicast groups learned by ovn-controllers. The table is indexed by
datapath, group address and chassis. For a learned multicast group on a
specific datapath each ovn-controller will store its own row in this table.
Each row contains the list of chassis-local ports on which the group was
learned. Rows in the IGMP_Group table are updated or deleted only by the
ovn-controllers that created them.

A new action ("igmp") is added to punt IGMP packets on a specific logical
switch datapath to ovn-controller if IGMP snooping is enabled.

Per datapath IGMP multicast snooping support is added to pinctrl:
- incoming IGMP reports are processed and multicast groups are maintained
 (using the OVS mcast-snooping library).
- each OVN controller syncs its in-memory IGMP groups to the Southbound DB
 in the IGMP_Group table.
- pinctrl also sends periodic IGMPv3 general queries for all datapaths where
 querier is enabled.

Signed-off-by: Mark Michelson <mmichels@redhat.com>
Co-authored-by: Mark Michelson <mmichels@redhat.com>
Signed-off-by: Dumitru Ceara <dceara@redhat.com>
Acked-by: Mark Michelson <mmichels@redhat.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>

packets: Add IGMPv3 query packet definitions

Signed-off-by: Dumitru Ceara <dceara@redhat.com>
Acked-by: Mark Michelson <mmichels@redhat.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>

ovn-northd: Fix the ovn-northd continuous looping

ovn-northd wakes up continuously from poll_block(). This issue can be reproduced
in the sandbox with the below commands

ovn-nbctl lr-add lr0
ovn-nbctl ls-add public
ovn-nbctl lrp-add lr0 lr0-public 00:00:20:20:12:13 172.168.0.100/24
ovn-nbctl lsp-add public public-lr0
ovn-nbctl lsp-set-type public-lr0 router
ovn-nbctl lsp-set-addresses public-lr0 router
ovn-nbctl lsp-set-options public-lr0 router-port=lr0-public
ovn-nbctl lrp-set-gateway-chassis lr0-public chassis-1 20

This issue is seen after the commit [1], which makes use of the function -
sbrec_port_binding_update_nat_addresses_addvalue() to add a value to
Port_Binding.nat_addresses column.

Looks like the IDL client code is sending the transactions to the ovsdb-server repeatedly
to update the Port_Binding.nat_addresses even though the Southbound DB has updated
the column when this function is used. The actual bug seems to be in the IDL client code
and that needs to be fixed. This patch as a quick fix, fixes ovn-northd's continuous loop
by not using this function, instead making use of sbrec_port_binding_set_nat_addresses().

The below messages are seen continuously when the ovn-nortdh debug logs are enabled.

****

2019-07-12T17:26:13.837Z|74512|jsonrpc|DBG|unix:sb1.ovsdb: received reply,
result=[{},{"count":1},{"count":1}], id=18628
2019-07-12T17:26:13.837Z|74513|poll_loop|DBG|wakeup due to 0-ms timeout at ../lib/ovsdb-idl.c:5397 (75% CPU usage)
2019-07-12T17:26:13.837Z|74514|jsonrpc|DBG|unix:sb1.ovsdb: send request,
method="transact", params=["OVN_Southbound",{"lock":"ovn_northd","op":"assert"},
{"where":[["_uuid","==",["uuid","56a9eb75-8d3b-4144-b4e7-1bb749645011"]]],"row":
{"nat_addresses":["set",[]]},"op":"update","table":"Port_Binding"},{"mutations":[["nat_addresses",
"insert",["set",["00:00:20:20:12:13 172.168.0.100 is_chassis_resident(\"cr-lr0-public\")"]]]],
"where":[["_uuid","==",["uuid","56a9eb75-8d3b-4144-b4e7-1bb749645011"]]],"op":"mutate","table":"Port_Binding"}], id=18629

2019-07-12T17:26:13.837Z|74516|jsonrpc|DBG|unix:sb1.ovsdb: received reply, result=[{},{"count":1},{"count":1}], id=18629
2019-07-12T17:26:13.837Z|74517|poll_loop|DBG|wakeup due to 0-ms timeout at ../lib/ovsdb-idl.c:5397 (75% CPU usage)
2019-07-12T17:26:13.837Z|74518|jsonrpc|DBG|unix:sb1.ovsdb: send request,
method="transact", params=["OVN_Southbound",{"lock":"ovn_northd","op":"assert"},
{"where":[["_uuid","==",["uuid","56a9eb75-8d3b-4144-b4e7-1bb749645011"]]],
"row":{"nat_addresses":["set",[]]},"op":"update","table":"Port_Binding"},
{"mutations":[["nat_addresses","insert",["set",["00:00:20:20:12:13 172.168.0.100
is_chassis_resident(\"cr-lr0-public\")"]]]],"where":[["_uuid","==",["uuid",
"56a9eb75-8d3b-4144-b4e7-1bb749645011"]]],"op":"mutate","table":"Port_Binding"}], id=18630
2019-07-12T17:26:13.837Z|74520|jsonrpc|DBG|unix:sb1.ovsdb: received reply, result=[{},{"count":1},{"count":1}], id=18630
******

The OpenStack CI tests for networking-ovn is frequently failing few tests after this
commit. The failure seems to be related to timing issues as ovn-northd is hogging
the CPU continuously. We are also seeing travis CI test failures after this commit.

[1] - ed198fb3b92e

Fixes: ed198fb3b92e("ovn: Send GARP for the router ports with reside-on-redirect-chassis options set")
Signed-off-by: Numan Siddique <nusiddiq@redhat.com>
Tested-by: Greg Rose <gvrose8192@gmail.com>
Reviewed-by: Greg Rose <gvrose8192@gmail.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>

ovs-macros: An option to suspend test execution on error

Origins for this patch are captured at
https://mail.openvswitch.org/pipermail/ovs-discuss/2019-June/048923.html.

Summarizing here, when a test fails, it would be good to pause test execution
and let the developer poke around the system to see current status of system.

As part of this patch, made a small tweaks to ovs-macros.at, so that when test
suite fails, ovs_on_exit() function will be called. And in this function, a check
is made to see if an environment variable to OVS_PAUSE_TEST is set. If it is
set, then test suite is paused and will continue to wait for user input
Ctrl-D. Meanwhile user can poke around the system to see why test case has
failed. Once done with investigation, user can press ctrl-d to cleanup the
test suite.

For example, to re-run test case 139:

export OVS_PAUSE_TEST=1
cd tests/system-userspace-testsuite.dir/139
sudo -E ./run

When error occurs, above command would display something like this:
=====================================================
Set environment variable to use various ovs utilities
export OVS_RUNDIR=/opt/vdasari/Developer/ovs/_build-gcc/tests/system-userspace-testsuite.dir/139
Press ENTER to continue:

=====================================================
And from another window, one can execute ovs-xxx commands like:
export OVS_RUNDIR=/opt/vdasari/Developer/ovs/_build-gcc/tests/system-userspace-testsuite.dir/139
$ ovs-ofctl dump-ports br0
.
.

To be able to pause while performing `make check`, one can do:
$ OVS_PAUSE_TEST=1 make check TESTSUITEFLAGS='-v'

Acked-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Vasu Dasari <vdasari@gmail.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>

OVN: use trigger_event action to report 'empty_lb_rule' events

Add northd logical flows in order to reports that the controller
received an IP packet for LB rule witn no backends.
This configuration is used by OpenShift to spin up a idle POD

Signed-off-by: Mark Michelson <mmichels@redhat.com>
Co-authored-by: Mark Michelson <mmichels@redhat.com>
Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>

OVN: introduce trigger_event() action

Add trigger_event() ovn action in order to allow ovs-vswitchd to report
CMS related events.
This commit introduces a new event, empty_lb_backends. This event is
raised if a received packet is destined for a load balancer VIP that has
no configured backend destinations. For this event, the event info
includes the load balancer VIP, the load balancer UUID, and the
transport protocol.
The use case for this particular event is for the CMS to supply backend
resources to handle this traffic. For example, in Openshift, this event
can be used to spin up new containers to handle the incoming traffic.

Signed-off-by: Mark Michelson <mmichels@redhat.com>
Co-authored-by: Mark Michelson <mmichels@redhat.com>
Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>

OVN: introduce Controller_Event table

Add Controller_Event table to OVN SBDB in order to
report CMS related event.
Introduce event_table hashmap array and controller_event related
structures to ovn-controller in order to track pending events
forwarded by ovs-vswitchd. Moreover integrate event_table hashmap
array with event_table ovn-sbdb table

Signed-off-by: Mark Michelson <mmichels@redhat.com>
Co-authored-by: Mark Michelson <mmichels@redhat.com>
Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>

Shutdown SSL connection before closing socket

Without shutting down the SSL connection, log messages like:

stream_ssl|WARN|SSL_read: unexpected SSL connection close
jsonrpc|WARN|ssl:127.0.0.1:47052: receive error: Protocol error
reconnect|WARN|ssl:127.0.0.1:47052: connection dropped (Protocol error)

would occur whenever the socket is closed. This just adds an
SSLStream.close() that calls shutdown() and ignores SSL errors, the
same way that lib/stream-ssl.c does in ssl_close().

Signed-off-by: Terry Wilson <twilson@redhat.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>

tests: Add ovsdb-cluster-testsuite to gitignore.

CC: Han Zhou <hzhou8@ebay.com>
Fixes: 2bcb3b7052c8 ("ovsdb raft: Move ovsdb cluster tests to separate testsuite.")
Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
Acked-by: Ben Pfaff <blp@ovn.org>
Acked-by: Han Zhou <hzhou8@ebay.com>

checkpatch: Check FOR_EACH loops with numbers.

OVS has defines for loops like 'BITMAP_FOR_EACH_1' or
'ULLONG_FOR_EACH_1', but the regexp in checkpatch doesn't match with
numbers and skips these loops while checking.

This patch adds numbers into regexp and adds some FER_EACH loops to
the unit tests.

Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
Acked-by: Ben Pfaff <blp@ovn.org>
Acked-by: Aaron Conole <aconole@redhat.com>

ovn: Fix the test failures in travis CI.

After the commit [1], below test cases are failing repeatedly in travis CI.

2663: ovn -- 4 HV, 1 LS, 1 LR, packet test with HA distributed router gateway port FAILED (ovn.at:8597)
2664: ovn -- 4 HV, 3 LS, 2 LR, packet test with HA distributed router gateway port FAILED (ovn.at:8844)
2667: ovn -- vlan traffic for external network with distributed router gateway port FAILED (ovn.at:9580)
2691: ovn -- router - check packet length - icmp defrag FAILED (ovn.at:13624)

With the commit [1], ovn-controller sends GARPs for the IPs of the distributed
router ports. The failing tests did not handle the situation if multiple GARPs
are sent. The failures are mostly timing related. This patch fixes these issues.

[1] - d65586b6fa97 ("ovn: Send GARP for router port IPs of a router port connected to bridged logical switch")

Fixes: d65586b6fa97 ("ovn: Send GARP for router port IPs of a router port connected to bridged logical switch")
CC: Ilya Maximets <i.maximets@samsung.com>
Signed-off-by: Numan Siddique <nusiddiq@redhat.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>

netdev-vport: Make ip6gre netdev type to use TC rules

The offload api functions already assigned to every tunnel class.
For ip6gre tunnel class only need to also assign the get_ifindex
function, similarly as done in commit 5e63eaa969a3 ("netdev-vport: Make
gre netdev type to use TC rules").

Signed-off-by: Eli Britstein <elibr@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>

ovn-performance.at: Fix syntax error in ACL.

This doesn't impact the effectiveness of the test but just fix an
obvious error in ACL syntax which was noticed when looking at test
logs.

Signed-off-by: Han Zhou <hzhou8@ebay.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>

ovn-performance.at: Missing steps for connecting LS to LR.

The test creates 2 logical switches and connect them with a logical router.
However, it didn't set the option "router-port", so the 2 LS datapaths
were not connected. This results in missing test coverage for port-binding
incremental processing: assume I-P has a bug and port-binding change always
trigger recompute, since each HV monitors only its own datapath (i.e.
HV1 -> ls1, HV2 -> ls2) then it never got notification of the other
port-binding change, thus recompute is never triggered when port-binding
is updated on the other datapath. With this fix, each HV's local datapaths
will include both ls1 and ls2, so port-binding change notification will
be received properly and unexpected recompute would be captured.

Signed-off-by: Han Zhou <hzhou8@ebay.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>

ovsdb-server: drop all connections on read/write status change

Prior to this patch, only db change aware connections were dropped
on a read/write status change. However, current schema in OVN does
not allow clients to monitor whether a particular DB changes this
status. In order to accomplish this, we'd need to change the schema
and adapting ovsdb-server and existing clients.

Before tackling that, this patch is changing ovsdb-server to drop
*all* the existing connections upon a read/write status change. This
will force clients to reconnect and honor the change.

Reported-at: https://mail.openvswitch.org/pipermail/ovs-discuss/2019-July/048981.html
Signed-off-by: Daniel Alvarez <dalvarez@redhat.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>

datapath: fix csum updates for MPLS actions

Upstream commit:
 commit 0e3183cd2a64843a95b62f8bd4a83605a4cf0615
 Author: John Hurley <john.hurley@netronome.com>
 Date: Thu Jun 27 14:37:30 2019 +0100

 net: openvswitch: fix csum updates for MPLS actions

 Skbs may have their checksum value populated by HW. If this is a checksum
 calculated over the entire packet then the CHECKSUM_COMPLETE field is
 marked. Changes to the data pointer on the skb throughout the network
 stack still try to maintain this complete csum value if it is required
 through functions such as skb_postpush_rcsum.

 The MPLS actions in Open vSwitch modify a CHECKSUM_COMPLETE value when
 changes are made to packet data without a push or a pull. This occurs when
 the ethertype of the MAC header is changed or when MPLS lse fields are
 modified.

 The modification is carried out using the csum_partial function to get the
 csum of a buffer and add it into the larger checksum. The buffer is an
 inversion of the data to be removed followed by the new data. Because the
 csum is calculated over 16 bits and these values align with 16 bits, the
 effect is the removal of the old value from the CHECKSUM_COMPLETE and
 addition of the new value.

 However, the csum fed into the function and the outcome of the
 calculation are also inverted. This would only make sense if it was the
 new value rather than the old that was inverted in the input buffer.

 Fix the issue by removing the bit inverts in the csum_partial calculation.

 The bug was verified and the fix tested by comparing the folded value of
 the updated CHECKSUM_COMPLETE value with the folded value of a full
 software checksum calculation (reset skb->csum to 0 and run
 skb_checksum_complete(skb)). Prior to the fix the outcomes differed but
 after they produce the same result.

Fixes: 25cd9ba0abc0 ("openvswitch: Add basic MPLS support to kernel")
Fixes: bc7cc5999fd3 ("openvswitch: update checksum in {push,pop}_mpls")
Signed-off-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fixes: ccf4378615e9 ("datapath: Add basic MPLS support to kernel")
Fixes: b51367aad315 ("datapath: update checksum in {push,pop}_mpls")
Cc: John Hurley <john.hurley@netronome.com>
Acked-by: William Tu <u9012063@gmail.com>
Signed-off-by: Greg Rose <gvrose8192@gmail.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>

compat: ip6_gre: fix possible use-after-free in ip6erspan_rcv

Upstream commit:
 commit 2a3cabae4536edbcb21d344e7aa8be7a584d2afb
 Author: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
 Date: Sat Apr 6 17:16:53 2019 +0200

 net: ip6_gre: fix possible use-after-free in ip6erspan_rcv

 erspan_v6 tunnels run __iptunnel_pull_header on received skbs to remove
 erspan header. This can determine a possible use-after-free accessing
 pkt_md pointer in ip6erspan_rcv since the packet will be 'uncloned'
 running pskb_expand_head if it is a cloned gso skb (e.g if the packet has
 been sent though a veth device). Fix it resetting pkt_md pointer after
 __iptunnel_pull_header

Fixes: 1d7e2ed22f8d ("net: erspan: refactor existing erspan code")
Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fixes: c387d8177f20 ("compat: Add ipv6 GRE and IPV6 Tunneling")
Cc: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
Acked-by: William Tu <u9012063@gmail.com>
Signed-off-by: Greg Rose <gvrose8192@gmail.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>

dpif-netdev: Clarify PMD reloading scheme.

It became more complicated, hence needs to be documented.

Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>

dpif-netdev: Catch reloads faster.

Looking at the reload flag only every 1024 loops can be a long time
under load, since we might be handling 32 packets per rxq, per iteration,
which means up to poll_cnt * 32 * 1024 packets.
Look at the flag every loop, no major performance impact seen.

Signed-off-by: David Marchand <david.marchand@redhat.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Ilya Maximets <i.maximets@samsung.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>

dpif-netdev: Only reload static tx qid when needed.

pmd->static_tx_qid is allocated under a mutex by the different pmd
threads.
Unconditionally reallocating it will make those pmd threads sleep
when contention occurs.
During "normal" reloads like for rebalancing queues between pmd threads,
this can make pmd threads waste time on this.
Reallocating the tx qid is only needed when removing other pmd threads
as it is the only situation when the qid pool can become uncontiguous.

Add a flag to instruct the pmd to reload tx qid for this case which is
Step 1 in current code.

Signed-off-by: David Marchand <david.marchand@redhat.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Ilya Maximets <i.maximets@samsung.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>

dpif-netdev: Do not sleep when swapping queues.

When swapping queues from a pmd thread to another (q0 polled by pmd0/q1
polled by pmd1 -> q1 polled by pmd0/q0 polled by pmd1), the current
"Step 5" puts both pmds to sleep waiting for the control thread to wake
them up later.

Prefer to make them spin in such a case to avoid sleeping an
undeterministic amount of time.

Signed-off-by: David Marchand <david.marchand@redhat.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Ilya Maximets <i.maximets@samsung.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>

dpif-netdev: Trigger parallel pmd reloads.

pmd reloads are currently serialised in each steps calling
reload_affected_pmds.
Any pmd processing packets, waiting on a mutex etc... will make other
pmd threads wait for a delay that can be undeterministic when syscalls
adds up.

Switch to a little busy loop on the control thread using the existing
per-pmd reload boolean.

The memory order on this atomic is rel-acq to have an explicit
synchronisation between the pmd threads and the control thread.

Signed-off-by: David Marchand <david.marchand@redhat.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Ilya Maximets <i.maximets@samsung.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>

dpif-netdev: Convert exit latch to flag.

No need for a latch here since we don't have to wait.
A simple boolean flag is enough.

The memory order on the reload flag is changed to rel-acq ordering to
serve as a synchronisation point between the pmd threads and the control
thread that asks for termination.

Fixes: e4cfed38b159 ("dpif-netdev: Add poll-mode-device thread.")
Signed-off-by: David Marchand <david.marchand@redhat.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Ian Stokes <ian.stokes@intel.com>
Acked-by: Ilya Maximets <i.maximets@samsung.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>

dist-docs: Fix bugs in text to HTML conversion.

This fixes two bugs. First, & has a special meaning in the replacement
text for a sed "s" command, so this escapes it. Second, this code
misprocessed bold or underlined &<>: >^H> would become >^H> which
would display as &gt> in most browers.

Finally, this improves the HTML output so that bold ABC becomes ABC
instead of ABC.

Reported-by: Nicolas Bouliane <nbouliane@digitalocean.com>
Reported-at: https://twitter.com/nicboul/status/1126959264772259842
Signed-off-by: Ben Pfaff <blp@ovn.org>

OVN: run local logical flows first in S_ROUTER_OUT_SNAT table

Run local logical flows first if the gw router port is scheduled
on the local chassis in order to properly manage snat traffic

Tested-by: Eran Kuris <ekuris@redhat.com>
Acked-by: Numan Siddique <nusiddiq@redhat.com>
Acked-by: Mark Michelson <mmichels@redhat.com>
Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>

rhel: Fixed a bug for checking the correct major version and revision.

Fixed a bug where checking for major version 3.10 and major revision not
equal to 327 or 693 or 957 should have gone to the default else at the end.
In the current code, the default else condition will not get executed
for kernel with major version 3.10 and major revision not equal
to 327/693/957 resulting in failure to load the kernel module.

Fixes: 402efbe4e176 ("rhel: Add 4.12 kernel support in ovs-kmod-manage.sh")
Signed-off-by: Ashish Varma <ashishvarma.ovs@gmail.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>

ovn-controller: Update stale chassis entry at init

The first time ovn-controller initializes the Chassis entry (shortly
after start up) we first look if there is a stale Chassis record in the
OVN_Southbound DB by checking if any of the old Encap entries associated
to the Chassis record match the new tunnel configuration. If found it
means that ovn-controller didn't shutdown gracefully last time it was
run so it didn't cleanup the Chassis table. Potentially in the meantime
the OVS system-id was also changed. We then update the stale entry with
the new configuration and store the last configured chassis-id in memory
to avoid walking the Chassis table every time.

Signed-off-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>

ovn-controller: Refactor chassis.c to abstract the string parsing

Abstract out the chassis config string processing and use library data
structures (e.g., sset).
Rename the get_chassis_id function in ovn-controller.c to
get_ovs_chassis_id to avoid confusion with the newly added
chassis_get_id function from chassis.c which returns the last
successfully configured chassis-id.

Signed-off-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>

ovn-controller: Fix chassis ovn-sbdb record init

The chassis_run code didn't take into account the scenario when the
system-id was changed in the Open_vSwitch table. Due to this the code
was trying to insert a new Chassis record in the OVN_Southbound DB with
the same Encaps as the previous Chassis record. The transaction used
to insert the new records was aborting due to the ["type", "ip"]
index constraint violation as we were creating new Encap entries with
the same "type" and "ip" as the old ones.

In order to fix this issue the flow is now:
1. the first time ovn-controller initializes the Chassis (shortly after
start up) we store the chassis-id.
2. for subsequent chassis_run calls we use last configured
chassis-id stored at the previous step to lookup the old Chassis record.
3. when ovn-controller shuts down gracefully we lookup the Chassis
record based on the chassis-id stored in memory at steps 1 and 2 above.
This is to avoid failing to cleanup the Chassis record in OVN_Southbound
DB if the OVS system-id changes between the last call to chassis_run and
chassis_cleanup.

Reported-at: https://bugzilla.redhat.com/1708146
Reported-by: Haidong Li <haili@redhat.com>
Signed-off-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>

netdev-dpdk: Enable tx-retries-max config.

vhost tx retries can provide some mitigation against
dropped packets due to a temporarily slow guest/limited queue
size for an interface, but on the other hand when a system
is fully loaded those extra cycles retrying could mean
packets are dropped elsewhere.

Up to now max vhost tx retries have been hardcoded, which meant
no tuning and no way to disable for debugging to see if extra
cycles spent retrying resulted in rx drops on some other
interface.

Add an option to change the max retries, with a value of
0 effectively disabling vhost tx retries.

Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Acked-by: Ilya Maximets <i.maximets@samsung.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>

netdev-dpdk: Add custom stat for vhost tx retries.

vhost tx retries may occur, and it can be a sign that
the guest is not optimally configured.

Add a custom stat so a user will know if vhost tx retries are
occurring and hence give a hint that guest config should be
examined.

Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>

doc: Move vhost tx retry info to separate section.

vhost tx retry is applicable to vhost-user and vhost-user-client,
but was in the section that compares them. Also, moved further
down the doc as prefer to have more fundamental info about vhost
nearer the top.

Fixes: 6d6513bfc657 ("doc: Add info on vhost tx retries.")
Reported-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Reviewed-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ian Stokes <ian.stokes@intel.com>

OVN: do not distribute traffic for local FIP

Do not send traffic for local FIP through the overlay tunnels but
manage it in the local hypervisor

Acked-by: Numan Siddique <nusiddiq@redhat.com>
Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>

Documentation: Clarify connection tracking tutorial

The current documentation states that "all packets entering OVS for
the first time are "untracked"". However there is a minor exception
to this in the case where a packet (re)enters the same datapath and
the namespace has not changed. In that case there is no need to
scrub the packet and in this case the connection may already be
in the "tracked" state.

Reported-by: Quan Tian <qtian@vmware.com>
Signed-off-by: Greg Rose <gvrose8192@gmail.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>

rconn: Increase precision of timers.

Until now, the rconn timers have been precise only to the nearest second.
This increases them to millisecond precision, which seems cleaner these
days.

Signed-off-by: Ben Pfaff <blp@ovn.org>

rconn: Remove write-only struct members.

Acked-by: Ilya Maximets <i.maximets@samsung.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>

sat-math: Add functions for saturating arithmetic on "long long int".

The first users will be added in an upcoming commit.

Also add tests.

Acked-by: Ilya Maximets <i.maximets@samsung.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>

ofproto-dpif-upcall: Remove unused macro MAX_QUEUE_LENGTH.

Signed-off-by: Yunjian Wang <wangyunjian@huawei.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>

ovn-controller: Omit tracking external_ids columns

Running the command "ovn-nbctl set logical_switch_port foo external_ids:foo=bar"
results in the incremetal processing engine to recompute the flows on the
chassis where the logical port 'foo' is claimed.

This patch avoids this unnecessary recomputation by omitting the tracking of
external_ids column of all the Southbound DB tables except DNS, Chassis
and Datapath_Binding tables. ovn-controller is refering to the external_ids
column of these tables.

Signed-off-by: Numan Siddique <nusiddiq@redhat.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>

OVN: Enable E-W Traffic, Vlan backed DVR

Background:
[1] https://mail.openvswitch.org/pipermail/ovs-dev/2018-October/353066.html
[2] https://docs.google.com/document/d/1uoQH478wM1OZ16HrxzbOUvk5LvFnfNEWbkPT6Zmm9OU/edit?usp=sharing

Key difference between an overlay logical switch and
vlan backed logical switch is that for vlan logical switches
packets are not encapsulated.

Hence, if a distributed router port is connected to vlan backed
logical switch, then router port mac as source mac could be
seen from multiple hypervisors. Same <mac,vlan> pairs coming
from multiple ports from a top of the rack switch (TOR) perspective
could be seen as a security threat and it could send alarms, drop
the packets or block the ports etc.

This patch addresses the same by introducing the concept of chassis mac.
A chassis mac is CMS provisioned unique mac per chassis. For any routed packet
(i.e source mac is router port mac) going on the wire on a vlan type
logical switch, we will replace its source mac with chassis mac.

This replacing of source mac with chassis mac will happen in table=65
of the logical switch datapath. A flow is added at priority 150, which
matches the source mac and replaces it with chassis mac if the value
is a router port mac.

Example flow:
cookie=0x0, duration=67765.830s, table=65, n_packets=0, n_bytes=0,
idle_age=65534, hard_age=65534, priority=150,reg15=0x1,metadata=0x4,
dl_src=00:00:01:01:02:03 actions=mod_dl_src:aa:bb:cc:dd:ee:ff,
mod_vlan_vid:1000,output:16

Here, 00:00:01:01:02:03 is router port mac and aa:bb:cc:dd:ee:ff
is chassis mac.

Acked-by: Numan Siddique <nusiddiq@redhat.com>
Signed-off-by: Ankur Sharma <ankur.sharma@nutanix.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>

ovn-controller: Provide the option to configure inactivity probe interval for OpenFlow conn

If the ovn-controller main loop takes more than 5 seconds (if there are lots of logical
flows) before it calls poll_block(), it causes the poll_block to wake up immediately,
since rconn module has to send echo request. With the incremental processing, this is
not an issue as ovn-controller will not recompute again. But for older versions, this
is an issue as it causes flow recomputations and this would result in 100% cpu all the
time.

With this patch, CMS can configure a higher value depending the workload.

The main intention of this patch is to fix this recompuation issue for older versions
(there by requesting backport), it still would be beneficial with the
incremental processing engine.

Signed-off-by: Numan Siddique <nusiddiq@redhat.com>
Tested-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>

db-ctl-base: fix memory leak in cmd-get() function

Memory leak occured in case specified key was not found in table
record.

Signed-off-by: Damijan Skvarc <damjan.skvarc@gmail.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>

ovn: Send GARP for router port IPs of a router port connected to bridged logical switch

This patch handles sending GARPs for

- router port IPs of a distributed router port

- router port IPs of a router port which belongs to gateway router
(with the option - redirect-chassis set in Logical_Router.options)

Acked-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Numan Siddique <nusiddiq@redhat.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>

ovn: Send GARP for the router ports with reside-on-redirect-chassis options set

With the commit [1], the routing for the provider logical switches
connected to a router is centralized on the master gateway chassis
(if the option - reside-on-redirect-chassis) is set. When the
failover happens and a standby gateway chassis becomes master,
it should send GARPs for the router port macs. Without this, the
physical switch doesn't learn the new location of the router port macs
immediately and this could result in traffic disruption.

This patch addresses this issue so that the ovn-controller which claims the
distributed gatweway router port sends out the GARPs.

ovn-controller sends the GARPs if the Port_Binding.nat_addresses column
is set. This patch makes use of this column, instead of adding a new column
even though the name - nat_addresses seems a bit misnomer. The documentation is
updated to highlight the usage of this column.

This patch doesn't handle sending the GARPs for the gateway router port IPs.
This will be handled in a separate patch.

[1] - 85706c34d53d ("ovn: Avoid tunneling for VLAN packets redirected to a gateway chassis")

Acked-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Numan Siddique <nusiddiq@redhat.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>

ovn-northd: Refactor the code which sets nat_addresses

The present code which sets the Port_Binding.nat_addresses
can be simplied. This patch does this. This would help in
upcoming commits to set the nat_addresses column with the
mac and IPs of distributed logical router ports and logical
router ports with 'reside-on-redirect-chassis' set.

Acked-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Numan Siddique <nusiddiq@redhat.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>

tunnel: Add layer 2 IPv6 GRE encapsulation support.

The patch adds ip6gre support. Tunnel type 'ip6gre' with packet_type=
legacy_l2 is a layer 2 GRE tunnel over IPv6, carrying inner ethernet packets
and encap with GRE header with outer IPv6 header. Encapsulation of layer 3
packet over IPv6 GRE, ip6gre, is not supported yet. I tested it by running:
 # make check-kernel TESTSUITEFLAGS='-k ip6gre'
under kernel 5.2 and for userspace:
 # make check TESTSUITEFLAGS='-k ip6gre'

Tested-by: Greg Rose <gvrose8192@gmail.com>
Tested-at: https://travis-ci.org/gvrose8192/ovs-experimental/builds/552977116
Reviewed-by: Greg Rose <gvrose8192@gmail.com>
Reviewed-by: Eli Britstein <elibr@mellanox.com>
Signed-off-by: William Tu <u9012063@gmail.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>

compat: Clean up tunnel_id_to_key

This function was just a duplicate of tunnel_id_to_key32 - I'm not sure
why it was ever needed but let's dump it now.

Signed-off-by: Greg Rose <gvrose8192@gmail.com>
Acked-by: William Tu <u9012063@gmail.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>

compat: Clean up gre_calc_hlen

It's proliferated throughout three .c files so let's pull them all
together in gre.h where the inline function belongs. This requires
some adjustments to the compat layer so that the various iterations
of gre_calc_hlen and ip_gre_calc_hlen since the 3.10 kernel are
handled correctly.

Signed-off-by: Greg Rose <gvrose8192@gmail.com>
Acked-by: William Tu <u9012063@gmail.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>

compat: Remove duplicate metadata destination code

ip_gre.c and ip6_gre.c both had duplicate code for handling the tunnel
metadata destinations. Move the duplicate code over into the right
header file, dst_metadata.h.

Signed-off-by: Greg Rose <gvrose8192@gmail.com>
Acked-by: William Tu <u9012063@gmail.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>

ossfuzz: Remove duplicate tcp flags parsing in flow extract target

During a code audit, the flow extraction fuzzer target was seen to be
parsing tcp flags from the fuzzer supplied input twice. This is
probably a typo since the second call to `parse_tcp_flags()` is
identical to the first.
Since a call to `parse_tcp_flags()` parses the Ethernet and IP headers
contained in the packet, the second (buggy) call to `parse_tcp_flags()`
creates an expectation that there is a second set of Ethernet and IP
headers beyond the first which is incorrect. This patch fixes this
problem by removing the duplicate code in question.

Signed-off-by: Bhargava Shastry <bshas3@gmail.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>