Han Zhou [Thu, 22 Aug 2019 21:08:22 +0000 (14:08 -0700)]
raft: Save and read new election timer in header snapshot.
This patch store the latest election timer in snapshot during log
compression, and when server restarts it reads the value from the log.
Without this, any previous changes to election timer will be lost
in the log, and if server restarts, it will use the default value
instead of the changed value.
Fixes: commit 8e35461 ("ovsdb raft: Support leader election time change online.") Signed-off-by: Han Zhou <hzhou8@ebay.com> Signed-off-by: Ben Pfaff <blp@ovn.org>
Han Zhou [Thu, 22 Aug 2019 21:08:21 +0000 (14:08 -0700)]
raft.c: Election timer initial reset with value from log.
After election timer is changed through cluster/change-election-timer
command, if a server restarts, it firstly initializes with the default
value and use it to reset the timer. Although it reads the latest
timer value later from the log, the first timeout may be much shorter
than expected by other servers that use latest timeout, and it would
start election before it receives the first heartbeat from the leader.
This patch fixes it by changing the order of reading log and resetting
timer so that the latest value is read from the log before the initial
resetting of the timer.
Fixes: commit 8e35461 ("ovsdb raft: Support leader election time change online.") Signed-off-by: Han Zhou <hzhou8@ebay.com> Signed-off-by: Ben Pfaff <blp@ovn.org>
ofproto-dpif: Fix using uninitialised memory in user_action_cookie.
Designated initializers are not suitable for initializing non-packed
structures and unions which are subjects for comparison by memcmp().
Whole memory for 'struct user_action_cookie' must be explicitly cleared
before using because it will be copied with memcpy and later compared
by memcmp in ofpbuf_equal().
Few issues found be valgrind:
Thread 13 revalidator11:
Conditional jump or move depends on uninitialised value(s)
at 0x4C35D96: __memcmp_sse4_1 (in vgpreload_memcheck.so)
by 0x9D4404: ofpbuf_equal (ofpbuf.h:273)
by 0x9D4404: revalidate_ukey__ (ofproto-dpif-upcall.c:2219)
by 0x9D4404: revalidate_ukey (ofproto-dpif-upcall.c:2286)
by 0x9D62AC: revalidate (ofproto-dpif-upcall.c:2685)
by 0x9D62AC: udpif_revalidator (ofproto-dpif-upcall.c:942)
by 0xA9C732: ovsthread_wrapper (ovs-thread.c:383)
by 0x5FF86DA: start_thread (pthread_create.c:463)
by 0x6AF488E: clone (clone.S:95)
Uninitialised value was created by a stack allocation
at 0x9D4450: compose_slow_path (ofproto-dpif-upcall.c:1062)
Thread 11 revalidator16:
Conditional jump or move depends on uninitialised value(s)
at 0x4C35D96: __memcmp_sse4_1 (in vgpreload_memcheck.so)
by 0x9D4404: ofpbuf_equal (ofpbuf.h:273)
by 0x9D4404: revalidate_ukey__ (ofproto-dpif-upcall.c:2220)
by 0x9D4404: revalidate_ukey (ofproto-dpif-upcall.c:2287)
by 0x9D62BC: revalidate (ofproto-dpif-upcall.c:2686)
by 0x9D62BC: udpif_revalidator (ofproto-dpif-upcall.c:942)
by 0xA9C6D2: ovsthread_wrapper (ovs-thread.c:383)
by 0x5FF86DA: start_thread (pthread_create.c:463)
by 0x6AF488E: clone (clone.S:95)
Uninitialised value was created by a stack allocation
at 0x9DC4E0: compose_sflow_action (ofproto-dpif-xlate.c:3211)
The struct was never marked as 'packed', however it was manually
adjusted to be so in practice.
Old IPFIX related commit first made the structure non-contiguous.
Commit 8de6ff3ea864 ("ofproto-dpif: Use a fixed size userspace cookie.")
added uninitialized parts of the additional union space and the next
one introduced new holes between structure fields for all cases.
CC: Justin Pettit <jpettit@ovn.org> Fixes: 8b7ea2d48033 ("Extend OVS IPFIX exporter to export tunnel headers") Fixes: 8de6ff3ea864 ("ofproto-dpif: Use a fixed size userspace cookie.") Fixes: fcb9579be3c7 ("ofproto: Add 'ofproto_uuid' and 'ofp_in_port' to user action cookie.") Signed-off-by: Ilya Maximets <i.maximets@samsung.com> Acked-by: Ben Pfaff <blp@ovn.org>
Lorenzo Bianconi [Thu, 22 Aug 2019 16:51:40 +0000 (18:51 +0200)]
Remove ageing check in run_put_mac_binding
Remove ageing check in run_put_mac_binding routine on mac-binding info
since if ovn-controller main thread is heavy loaded the info will be
discarded and the mac_binding table will not never be updated
Acked-by: Numan Siddique <nusiddiq@redhat.com> Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
Lorenzo Bianconi [Mon, 29 Jul 2019 11:41:03 +0000 (13:41 +0200)]
OVN: fix DNAT/SNAT system-ovn unit tests
Fix conntrack checks in the following tests in tests/system-ovn.at:
- ovn -- DNAT and SNAT on distributed router - N/S
- ovn -- DNAT and SNAT on distributed router - E/W
Fixes: a6ee09882283 ("OVN: run local logical flows first in S_ROUTER_OUT_SNAT table") Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com> Signed-off-by: Ben Pfaff <blp@ovn.org>
Aliasgar Ginwala [Sat, 17 Aug 2019 07:21:45 +0000 (00:21 -0700)]
OVS: Containerize components
1. Start OVS components in containers so that building and shipping
of OVS components is easy.
2. Load OVS kernel modules on host from container to avoid installing ovs
on host.
3. Update documentation about how to build/run ovs in docker.
ofproto-dpif: Fix for recirc issue with mpls traffic with dp_hash
Fix infinite recirculation loop for MPLS packets sent to dp_hash-based
select group
Issue:
When a MPLS encapsulated packet is received, the MPLS header is removed,
a recirculation id assigned and then recirculated into the pipeline.
If the flow rules require the packet to be then sent over DP-HASH based
select group buckets, the packet has to be recirculated again. However,
the same recirculation id was used and this resulted in the packet being
repeatedly recirculated until it got dropped because the maximum recirculation
limit was hit.
Fix:
Include the “was_mpls” boolean which indicates whether the packet was MPLS
encapsulated when computing the hash. After popping the MPLS header this will
result in a different hash value than before and new recirculation id will
get generated.
DPCTL flows with and without the fix are shown below
Without Fix:
recirc_id(0x1),dp_hash(0x5194bf18/0xf),in_port(2),packet_type(ns=0,id=0),
eth_type(0x0800),ipv4(frag=no), packets:20, bytes:1960,
used:0.329s, actions:1
recirc_id(0x1),in_port(2),packet_type(ns=0,id=0),eth_type(0x0800),
ipv4(frag=no), packets:20, bytes:1960, used:0.329s,
actions:hash(sym_l4(0)),recirc(0x1)
recirc_id(0),in_port(2),packet_type(ns=0,id=0),eth_type(0x8847),
mpls(label=22/0xfffff,tc=0/0,ttl=64/0x0,bos=1/1), packets:20, bytes:2040,
used:0.329s, actions:pop_mpls(eth_type=0x800),recirc(0x1)
upcall: Configure datapath min-revalidate-pps through ovs-vsctl.
This patch adds a new configuration option, "min-revalidate-pps" to the
Open_vSwitch "other-config" column. This sets minimum pps that flow must
have in order to be revalidated when revalidation duration exceeds half of
max-revalidator config variable.
Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Acked-by: Roi Dayan <roid@mellanox.com> Signed-off-by: Ben Pfaff <blp@ovn.org>
upcall: Configure datapath max-revalidator through ovs-vsctl.
This patch adds a new configuration option, "max-revalidator" to the
Open_vSwitch "other-config" column. This sets maximum allowed ravalidator
timeout. Actual timeout value is determined at runtime as minimum of
"max-idle" and "max-revalidator".
Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Acked-by: Roi Dayan <roid@mellanox.com> Signed-off-by: Ben Pfaff <blp@ovn.org>
Han Zhou [Mon, 19 Aug 2019 23:30:35 +0000 (16:30 -0700)]
ovsdb monitor: Fix crash when using non-zero last-id with standalone DB.
When a client uses monitor-cond-since with a non-zero last-id but the
server is not in cluster mode for the DB being monitored, it leads to
segmentation fault because the txn_history list is not initialized in
this case.
Program terminated with signal SIGSEGV, Segmentation fault.
1536 struct ovsdb_txn *txn = h_node->txn;
(gdb) bt
0 ovsdb_monitor_get_changes_after (txn_uuid=txn_uuid@entry=0x7ffe8605b7e0, dbmon=0x17c1b40, p_mcs=p_mcs@entry=0x17c4900) at ovsdb/monitor.c:1536
1 0x000000000040da2d in ovsdb_jsonrpc_monitor_create (request_id=0x1804630, version=<optimized out>, params=0x17ad330, db=0x18015b0, s=<optimized out>) at ovsdb/jsonrpc-server.c:1469
2 ovsdb_jsonrpc_session_got_request (request=0x17ad520, s=<optimized out>) at ovsdb/jsonrpc-server.c:1002
3 ovsdb_jsonrpc_session_run (s=<optimized out>) at ovsdb/jsonrpc-server.c:556
...
Although it doesn't happen in normal use cases, no one can prevent a
client to send this on purpose or in a corner case when a client firstly
connected to a clustered DB but later the server restarted with a
non-clustered DB.
This patch fixes it by always initialize the txn_history list to avoid
the undefined behavior in this case. It adds a test case to cover it, too.
Fixes: 695e815 ("ovsdb-server: Transaction history tracking.") Reported-by: Aliasgar Ginwala <aginwala@ebay.com> Signed-off-by: Han Zhou <hzhou8@ebay.com> Signed-off-by: Ben Pfaff <blp@ovn.org>
Han Zhou [Mon, 19 Aug 2019 16:30:00 +0000 (09:30 -0700)]
ovsdb raft: Support leader election time change online.
A new unixctl command cluster/change-election-timer is implemented to
change leader election timeout base value according to the scale needs.
The change takes effect upon consensus of the cluster, implemented through
the append-request RPC. A new field "election-timer" is added to raft log
entry for this purpose.
Signed-off-by: Han Zhou <hzhou8@ebay.com> Signed-off-by: Ben Pfaff <blp@ovn.org>
Han Zhou [Mon, 19 Aug 2019 16:29:59 +0000 (09:29 -0700)]
raft.c: Set candidate_retrying if no leader elected since last election.
candiate_retrying is used to determine if the current node is disconnected
from the cluster when the node is in candiate role. However, a node
can flap between candidate and follower role before a leader is elected
when majority of the cluster is down, so is_connected() will flap, too, which
confuses clients.
This patch avoids the flapping with the help of a new member had_leader,
so that if no leader was elected since last election, we know we are
still retrying, and keep as disconnected from the cluster.
Signed-off-by: Han Zhou <hzhou8@ebay.com> Signed-off-by: Ben Pfaff <blp@ovn.org>
Han Zhou [Mon, 19 Aug 2019 16:29:58 +0000 (09:29 -0700)]
raft.c: Stale leader should disconnect from cluster.
As mentioned in RAFT paper, section 6.2:
Leaders: A server might be in the leader state, but if it isn’t the current
leader, it could be needlessly delaying client requests. For example, suppose a
leader is partitioned from the rest of the cluster, but it can still
communicate with a particular client. Without additional mechanism, it could
delay a request from that client forever, being unable to replicate a log entry
to any other servers. Meanwhile, there might be another leader of a newer term
that is able to communicate with a majority of the cluster and would be able to
commit the client’s request. Thus, a leader in Raft steps down if an election
timeout elapses without a successful round of heartbeats to a majority of its
cluster; this allows clients to retry their requests with another server.
Reported-by: Aliasgar Ginwala <aginwala@ebay.com> Tested-by: Aliasgar Ginwala <aginwala@ebay.com> Signed-off-by: Han Zhou <hzhou8@ebay.com> Signed-off-by: Ben Pfaff <blp@ovn.org>
Han Zhou [Mon, 19 Aug 2019 16:29:57 +0000 (09:29 -0700)]
ovsdb-idl.c: Allows retry even when using a single remote.
When clustered mode is used, the client needs to retry connecting
to new servers when certain failures happen. Today it is allowed to
retry new connection only if multiple remotes are used, which prevents
using LB VIP with clustered nodes. This patch makes sure the retry
logic works when using LB VIP: although same IP is used for retrying,
the LB can actually redirect the connection to a new node.
Signed-off-by: Han Zhou <hzhou8@ebay.com> Signed-off-by: Ben Pfaff <blp@ovn.org>
Ben Pfaff [Wed, 21 Aug 2019 17:17:21 +0000 (10:17 -0700)]
sat-math: Do not use __builtin_s*_overflow() with sparse.
Some versions of sparse do not understand __builtin_saddll_overflow() and
related GCC builtins for calculations with overflow detection. This patch
avoids using them when sparse is in use.
Aliasgar Ginwala [Tue, 20 Aug 2019 01:36:13 +0000 (18:36 -0700)]
ovs-lib: Fix standalone db migration to raft
Current code of create-cluster from standalone db takes backup of existing
standalone db and then generates a new clustered dbs from backup dbs. Hence,
during migration if nb and sb dbs are still present, create-cluster will fail
saying file exists and will not really convert dbs to clustered dbs. This
patch fixes the same.
e.g message that pops up while migration from standalone to raft cluster:
* Backing up database to /etc/openvswitch/ovnnb_db.db.backup5.13.0-1278623084
ovsdb-tool: I/O error: /etc/openvswitch/ovnnb_db.db: create failed (File exists)
* Creating cluster database /etc/openvswitch/ovnnb_db.db from existing one
Signed-off-by: aginwala <aginwala@ebay.com> Signed-off-by: Ben Pfaff <blp@ovn.org>
Anand Kumar [Thu, 15 Aug 2019 16:39:06 +0000 (09:39 -0700)]
datapath-windows: Fix updating ct label when mask is specified
When an existing label needs to be changed by specifing bits to be
updated using mask, instead of updating only the masked bits,
new label was getting overridden. This patch fixes this issue.
Signed-off-by: Anand Kumar <kumaranand@vmware.com> Acked-by: Alin Gabriel Serdean <aserdean@ovn.org> Signed-off-by: Alin Gabriel Serdean <aserdean@ovn.org>
Ilya Maximets [Mon, 5 Aug 2019 15:14:46 +0000 (18:14 +0300)]
travis: Drop OSX workarounds.
TravisCI currently uses xcode9.4 as a default image and it
it has good version of libtool out-of-the-box.
Removing these workarounds saves 4-6 minutes of OSX build.
Ilya Maximets [Mon, 5 Aug 2019 12:26:02 +0000 (15:26 +0300)]
travis: Combine kernel builds.
Single kernel build job takes ~3 minutes in average. Most of
this time takes VM spawning and initial configuration.
Combining these 24 jobs in 4 allows us to better utilize workers
and not waste time on spawning VMs.
Ilya Maximets [Mon, 5 Aug 2019 07:58:18 +0000 (10:58 +0300)]
travis: Cache DPDK build.
This change enables cache for DPDK build directory, so we'll never
build same version of DPDK again. This speeds up each DPDK related
job by 4-6 minutes effectively saving 30-50 minutes of the total time.
Ex. Full TravisCI run on 'trusty' images:
Without cache:
Ran for 1 hr 9 min 29 sec
Total time 4 hrs 55 min 13 sec
With populated cache:
Ran for 1 hr 2 min 18 sec
Total time 4 hrs 20 min 9 sec
Saved:
Real time: ~7 minutes.
Total worker time: ~35 minutes.
Eelco Chaudron [Thu, 8 Aug 2019 13:27:05 +0000 (09:27 -0400)]
netdev-afxdp: fix corner case where umem entries were not released
If for some reason the last element in the batch was already pushed on
the stack, none of the elements where pushed. This was leading to
buffer starvation.
John Hurley [Tue, 30 Jul 2019 11:05:17 +0000 (12:05 +0100)]
ovs-tc: offload MPLS set actions to TC datapath
Recent modifications to TC allows the modifying of fields within the
outermost MPLS header of a packet. OvS datapath rules impliment an MPLS
set action by supplying a new MPLS header that should overwrite the
current one.
Convert the OvS datapath MPLS set action to a TC modify action and allow
such rules to be offloaded to a TC datapath.
Signed-off-by: John Hurley <john.hurley@netronome.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: Simon Horman <simon.horman@netronome.com>
John Hurley [Tue, 30 Jul 2019 11:05:16 +0000 (12:05 +0100)]
ovs-tc: offload MPLS push actions to TC datapath
TC can now be used to push an MPLS header onto a packet. The MPLS label is
the only information that needs to be passed here with the rest reverting
to default values if none are supplied. OvS, however, gives the entire
MPLS header to be pushed along with the MPLS protocol to use. TC can
optionally accept these values so can be made replicate the OvS datapath
rule.
Convert OvS MPLS push datapath rules to TC format and offload to a TC
datapath.
Signed-off-by: John Hurley <john.hurley@netronome.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: Simon Horman <simon.horman@netronome.com>
John Hurley [Tue, 30 Jul 2019 11:05:15 +0000 (12:05 +0100)]
ovs-tc: offload MPLS pop actions to TC datapath
TC now supports an action to pop the outer MPLS header from a packet. The
next protocol after the header is required alongside this. Currently, OvS
datapath rules also supply this information.
Offload OvS MPLS pop actions to TC along with the next protocol.
Signed-off-by: John Hurley <john.hurley@netronome.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: Simon Horman <simon.horman@netronome.com>
John Hurley [Tue, 30 Jul 2019 11:05:14 +0000 (12:05 +0100)]
compat: add compatibility headers for tc mpls action
OvS includes compat code for several TC actions including vlan, mirred and
tunnel key. MPLS actions have recently been added to TC in the kernel. In
preparation for adding TC offload code for MPLS, add the MPLS compat code.
Signed-off-by: John Hurley <john.hurley@netronome.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: Simon Horman <simon.horman@netronome.com>
dpif-netlink: Allow offloading of flows with dl_type 0x1234.
'dpif_probe_feature()' always has DPIF_FP_PROBE flag set. Other probing
code uses dpif_execute() with DPIF_OP_EXECUTE, hence never calls
parse_flow_put().
Thus, this 'if' statement is wrong and should be removed as it only
forbids offloading of the real legitimate flows with dl_type 0x1234.
Dummy flows never reach this code.
CC: Paul Blakey <paulb@mellanox.com> Fixes: 8b668ee3f0cc ("dpif-netlink: Use netdev flow put api to insert a flow") Reported-by: Eli Britstein <elibr@mellanox.com> Acked-by: Roi Dayan <roid@mellanox.com> Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
In case of failure of 'xsk_configure_all()', 'n_rxq' and 'xdpmode'
will remain in a new state. This will result in successful
reconfiguration (immediate return, because configuration is already
applied) if 'netdev_reconfigure()' will be called again.
Same issue was fixed previously for netdev-dpdk using 'dev->started'
flag in commit: 606f66507250 ("netdev-dpdk: Don't use PMD driver if not configured successfully")
Let's use similar approach with checking the 'dev->xsks' which only
exists if configuration was successful.
Additionally implemented 'netdev_afxdp_construct()' function to
explicitly initialize all the specific fields and request the
reconfiguration.
CC: William Tu <u9012063@gmail.com> Fixes: 0de1b425962d ("netdev-afxdp: add new netdev type for AF_XDP.") Acked-by: William Tu <u9012063@gmail.com> Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
Yifeng Sun [Thu, 18 Jul 2019 22:59:45 +0000 (15:59 -0700)]
test: Fix fragment-related tests that fail on 4.19+ due to small-sized packets
These fragment-related tests are failing on later kernels (4.19.x)
because kernel quietly drops any packet fragment that is not the last
but has a size smaller than IPV6_MIN_MTU. This patch fixes
them by increasing their sizes to IPV6_MIN_MTU.
Reviewed-by: Darrell Ball <dlu998@gmail.com>
Reivewed-at: https://github.com/openvswitch/ovs/pull/278 Signed-off-by: Yifeng Sun <pkusunyifeng@gmail.com> Signed-off-by: Ben Pfaff <blp@ovn.org>
tnl-neigh-cache: Purge learnt neighbors when port/bridge is deleted
Say an ARP entry is learnt on a OVS port and when such a port is deleted,
learnt entry should be removed from the port. It would have be aged out after
ARP ageout time. This code will clean up immediately.
Added test case(tunnel - neighbor entry add and deletion) in tunnel.at, to
verify neighbors are added and removed on deletion of a ports and bridges.
Discussion for this addition is at:
https://mail.openvswitch.org/pipermail/ovs-discuss/2019-June/048754.html
Signed-off-by: Vasu Dasari <vdasari@gmail.com> Reviewed-by: Flavio Fernandes <flavio@flaviof.com> Reviewed-by: Ben Pfaff <blp@ovn.org> Signed-off-by: Ben Pfaff <blp@ovn.org>
William Tu [Fri, 19 Jul 2019 14:49:01 +0000 (07:49 -0700)]
system-traffic: Make nsh test more robust.
The patch adds '-n' to tcpdump to avoid address coverting. Add '-l' for rhel8
to avoid buffering. Since '-U' is used to output to stdout, simply use 'cat'
to search result. Use OVS_WAIT_UNTIL instead of sleep, and also remove/add
some newlines. Finally, move tcpdump captured interface into the namespace,
(capture p1 instead of ovs-p1), and tested using af_xdp.
Signed-off-by: William Tu <u9012063@gmail.com> Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
William Tu [Thu, 18 Jul 2019 20:11:14 +0000 (13:11 -0700)]
netdev-afxdp: add new netdev type for AF_XDP.
The patch introduces experimental AF_XDP support for OVS netdev.
AF_XDP, the Address Family of the eXpress Data Path, is a new Linux socket
type built upon the eBPF and XDP technology. It is aims to have comparable
performance to DPDK but cooperate better with existing kernel's networking
stack. An AF_XDP socket receives and sends packets from an eBPF/XDP program
attached to the netdev, by-passing a couple of Linux kernel's subsystems
As a result, AF_XDP socket shows much better performance than AF_PACKET
For more details about AF_XDP, please see linux kernel's
Documentation/networking/af_xdp.rst. Note that by default, this feature is
not compiled in.
Signed-off-by: William Tu <u9012063@gmail.com> Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
This commit adds a number of specialized functions, that handle
common miniflow fingerprints. This enables compiler optimization,
resulting in higher performance. Below a quick description of
how this optimization actually works;
"Specialized functions" are "instances" of the generic implementation,
but the compiler is given extra context when compiling. In the case of
iterating miniflow datastructures, the most interesting value to enable
compile time optimizations is the loop trip count per unit.
In order to create a specialized function, there is a generic implementation,
which uses a for() loop without the compiler knowing the loop trip count at
compile time. The loop trip count is passed in as an argument to the function:
uint32_t miniflow_impl_generic(struct miniflow *mf, uint32_t loop_count)
{
for(uint32_t i = 0; i < loop_count; i++)
// do work
}
In order to "specialize" the function, we call the generic implementation
with hard-coded numbers - these are compile time constants!
uint32_t miniflow_impl_loop5(struct miniflow *mf, uint32_t loop_count)
{
// use hard coded constant for compile-time constant-propogation
return miniflow_impl_generic(mf, 5);
}
Given the compiler is aware of the loop trip count at compile time,
it can perform an optimization known as "constant propogation". Combined
with inlining of the miniflow_impl_generic() function, the compiler is
now enabled to *compile time* unroll the loop 5x, and produce "flat" code.
The last step to using the specialized functions is to utilize a
function-pointer to choose the specialized (or generic) implementation.
The selection of the function pointer is performed at subtable creation
time, when miniflow fingerprint of the subtable is known. This technique
is known as "multiple dispatch" in some literature, as it uses multiple
items of information (miniflow bit counts) to select the dispatch function.
By pointing the function pointer at the optimized implementation, OvS
benefits from the compile time optimizations at runtime.
Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com> Tested-by: Malvika Gupta <malvika.gupta@arm.com> Acked-by: Ilya Maximets <i.maximets@samsung.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>
Harry van Haaren [Thu, 18 Jul 2019 13:03:05 +0000 (14:03 +0100)]
dpif-netdev: Refactor generic implementation
This commit refactors the generic implementation. The
goal of this refactor is to simplify the code to enable
"specialization" of the functions at compile time.
Given compile-time optimizations, the compiler is able
to unroll loops, and create optimized code sequences due
to compile time knowledge of loop-trip counts.
In order to enable these compiler optimizations, we must
refactor the code to pass the loop-trip counts to functions
as compile time constants.
This patch allows the number of miniflow-bits set per "unit"
in the miniflow to be passed around as a function argument.
Note that this patch does NOT yet take advantage of doing so,
this is only a refactor to enable it in the next patches.
Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com> Tested-by: Malvika Gupta <malvika.gupta@arm.com> Acked-by: Ilya Maximets <i.maximets@samsung.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>
Harry van Haaren [Thu, 18 Jul 2019 13:03:03 +0000 (14:03 +0100)]
dpif-netdev: Move dpcls lookup structures to .h
This commit moves some data-structures to be available
in the dpif-netdev-private.h header. This allows specific
implementations of the subtable lookup function to include
just that header file, and not require that the code exists
in dpif-netdev.c
Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com> Tested-by: Malvika Gupta <malvika.gupta@arm.com> Acked-by: Ilya Maximets <i.maximets@samsung.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>
Harry van Haaren [Thu, 18 Jul 2019 13:03:02 +0000 (14:03 +0100)]
dpif-netdev: Implement function pointers/subtable
This allows plugging-in of different subtable hash-lookup-verify
routines, and allows special casing of those functions based on
known context (eg: # of bits set) of the specific subtable.
Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com> Tested-by: Malvika Gupta <malvika.gupta@arm.com> Acked-by: Ilya Maximets <i.maximets@samsung.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>
Dumitru Ceara [Wed, 17 Jul 2019 19:05:04 +0000 (21:05 +0200)]
ovsdb-idl: Mark row "parsed" in ovsdb_idl_txn_write__
Once a column is set in a row using ovsdb_idl_txn_write__ we also mark
the row "parsed". Otherwise the memory allocated by
sbrec_<table>_parse_<col>() will never be freed. After marking the row
"parsed", the ovsdb_idl_txn_disassemble function will properly free the
newly allocated memory for the column (ovsdb_idl_row_unparse).
The problem is present only for rows that are inserted by the
running application because rows that are loaded from the database
will always have row->parsed == true.
One way to detect the leak is to start northd with valgrind:
==9270== Memcheck, a memory error detector
==9270== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==9270== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright
info
==9270== Command: ./ovn-northd
==9270==
<snip>
==9270==
==9270== 8 bytes in 1 blocks are definitely lost in loss record 30 of
292
==9270== at 0x4C29BC3: malloc (vg_replace_malloc.c:299)
==9270== by 0x4D31EF: xmalloc (util.c:138)
==9270== by 0x45CB8E: sbrec_multicast_group_parse_ports (ovn-sb-idl.c:18141)
==9270== by 0x4BB12D: ovsdb_idl_txn_write__ (ovsdb-idl.c:4489)
==9270== by 0x4BB1B5: ovsdb_idl_txn_write (ovsdb-idl.c:4527)
==9270== by 0x45D167: sbrec_multicast_group_set_ports (ovn-sb-idl.c:18561)
==9270== by 0x40F416: ovn_multicast_update_sbrec (ovn-northd.c:2947)
==9270== by 0x41FC55: build_lflows (ovn-northd.c:7981)
==9270== by 0x421830: ovnnb_db_run (ovn-northd.c:8522)
==9270== by 0x422B2D: ovn_db_run (ovn-northd.c:9089)
==9270== by 0x423909: main (ovn-northd.c:9409)
==9270==
==9270== 157 (32 direct, 125 indirect) bytes in 1 blocks are definitely
lost in loss record 199 of 292
==9270== at 0x4C29BC3: malloc (vg_replace_malloc.c:299)
==9270== by 0x4D31EF: xmalloc (util.c:138)
==9270== by 0x471E3D: resize (hmap.c:100)
==9270== by 0x4720C8: hmap_expand_at (hmap.c:175)
==9270== by 0x4C74F1: hmap_insert_at (hmap.h:277)
==9270== by 0x4C825A: smap_add__ (smap.c:392)
==9270== by 0x4C7783: smap_add (smap.c:55)
==9270== by 0x451054: sbrec_datapath_binding_parse_external_ids (ovn-sb-idl.c:7181)
==9270== by 0x4BB12D: ovsdb_idl_txn_write__ (ovsdb-idl.c:4489)
==9270== by 0x4BB1B5: ovsdb_idl_txn_write (ovsdb-idl.c:4527)
==9270== by 0x451436: sbrec_datapath_binding_set_external_ids (ovn-sb-idl.c:7444)
==9270== by 0x4090F1: ovn_datapath_update_external_ids (ovn-northd.c:817)
==9270==
<snip>
==9270==
==9270== LEAK SUMMARY:
==9270== definitely lost: 1,322 bytes in 47 blocks
==9270== indirectly lost: 4,653 bytes in 240 blocks
==9270== possibly lost: 0 bytes in 0 blocks
==9270== still reachable: 254,004 bytes in 7,780 blocks
==9270== suppressed: 0 bytes in 0 blocks
==9270== Reachable blocks (those to which a pointer was found) are not
shown.
==9270== To see them, rerun with: --leak-check=full
--show-leak-kinds=all
==9270==
==9270== For counts of detected and suppressed errors, rerun with: -v
==9270== ERROR SUMMARY: 9 errors from 9 contexts (suppressed: 0 from 0)
Signed-off-by: Dumitru Ceara <dceara@redhat.com> Signed-off-by: Ben Pfaff <blp@ovn.org>
Yipeng Wang [Wed, 17 Jul 2019 11:38:57 +0000 (04:38 -0700)]
doc: Remove experimental tag for SMC cache.
SMC cache was introduced in 2.10 with experimental tag.
SMC cache is a layer of software cache located after EMC
cache. The purpose is to improve the performance of use
cases that many flows missing the EMC cache.
One can enable SMC cache using smc-enable=true option.
Signed-off-by: Yipeng Wang <yipeng1.wang@intel.com> Acked-by: Ilya Maximets <i.maximets@samsung.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>
==3029==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x603000043260 at pc 0x7f6b09c2459b bp 0x7ffcb67e7540 sp 0x7ffcb67e6cf0
READ of size 40 at 0x603000043260 thread T0
#0 0x7f6b09c2459a (/lib/x86_64-linux-gnu/libasan.so.5+0xb859a)
#1 0x565110a748a5 in minimask_equal ../lib/flow.c:3510
#2 0x565110a9ea41 in minimatch_equal ../lib/match.c:1821
#3 0x56511091e864 in collect_rules_strict ../ofproto/ofproto.c:4516
#4 0x56511093d526 in delete_flow_start_strict ../ofproto/ofproto.c:5959
#5 0x56511093d526 in ofproto_flow_mod_start ../ofproto/ofproto.c:7949
#6 0x56511093d77b in handle_flow_mod__ ../ofproto/ofproto.c:6122
#7 0x56511093db71 in handle_flow_mod ../ofproto/ofproto.c:6099
#8 0x5651109407f6 in handle_single_part_openflow ../ofproto/ofproto.c:8406
#9 0x5651109407f6 in handle_openflow ../ofproto/ofproto.c:8587
#10 0x5651109e40da in ofconn_run ../ofproto/connmgr.c:1318
#11 0x5651109e40da in connmgr_run ../ofproto/connmgr.c:355
#12 0x56511092b129 in ofproto_run ../ofproto/ofproto.c:1826
#13 0x5651108f23cd in bridge_run__ ../vswitchd/bridge.c:2965
#14 0x565110904887 in bridge_run ../vswitchd/bridge.c:3023
#15 0x5651108e659c in main ../vswitchd/ovs-vswitchd.c:127
#16 0x7f6b093b709a in __libc_start_main ../csu/libc-start.c:308
#17 0x5651108e9009 in _start (/home/blp/nicira/ovs/_build/vswitchd/ovs-vswitchd+0x11d009)
This fixes the problem, which although largely theoretical could crop up
with odd implementations of memcmp(), perhaps ones optimized in various
"clever" ways. All in all, it seems best to avoid the theoretical problem.
Acked-by: Dumitru Ceara <dceara@redhat.com> Signed-off-by: Ben Pfaff <blp@ovn.org>
Currently inside the ovsdb_server_promote() function we call 'promote_ovnnb'
and 'promote_ovnsb' and then just record the new master state in the
CIB.
This creates a race because those two promote commands are asynchronous
so when we exit the ovsdb_server_promote() function the underlying DBs
are not guaranteed to be in master state. That means that clients might
connect to an instance that is in read-only mode.
We add a simple sleep loop where we wait for the underlying DB state to
confirm the master state. We do not need to add a timeout loop because
in case of an issue the resource timeout set within pacemaker will kick
in and the resource agent script will be killed by pacemaker.
Tested this within an openstack environment using ovn with roughly ~20
reboots and was unable to trigger the issue (before the patch we would
trigger the issue after a couple of reboots tops).
Acked-by: Numan Siddique <nusiddiq@redhat.com> Acked-By: Daniel Alvarez <dalvarez@redhat.com> Signed-off-by: Michele Baldessari <michele@acksyn.org> Signed-off-by: Ben Pfaff <blp@ovn.org>
Ben Pfaff [Tue, 16 Jul 2019 17:18:18 +0000 (10:18 -0700)]
ofp-flow: Improve error message when cookie cannot be set.
The "cookie" value has two meanings in "ovs-ofctl add-flow", etc. With
a mask, it indicates a match; without a mask, it indicates that the
cookie should be set. In some cases, the cookie cannot be set, which may
mean that the user meant to indicate a match. The error message for this
case was poor; this improves it.
Suggested-by: "Yi Yang (杨燚)-云服务集团" <yangyi01@inspur.com> Reviewed-by: Greg Rose <gvrose8192@gmail.com> Signed-off-by: Ben Pfaff <blp@ovn.org>
'odp_flow_key_to_flow__' is used to parse both keys and masks.
However, 'odp_nsh_key_from_attr' expects only 'key' as an argument
and fails to parse masks with ODP_FIT_ERROR which causes userspace
system tests failures:
# make check-system-userspace TESTSUITEFLAGS='-k nsh'
|odp_util(revalidator)|WARN|
OVS_NSH_KEY_ATTR_MD1 present but declared mdtype 0 is not 1 (NSH_M_TYPE1)
|odp_util(revalidator)|WARN|
the flow mask in error is:
<...> nsh(flags=0ttl=0,mdtype=0,np=255,spi=0x0,si=0),
for the following flow key:
<...> nsh_ttl=8,nsh_mdtype=1,nsh_np=3,nsh_spi=0x64,nsh_si=3,
nsh_c1=0x1020304,nsh_c2=0x5060708,nsh_c3=0x90a0b0c,nsh_c4=0xd0e0f10
<...>
Fix that by passing the additional argument 'is_mask' to make it be
like all other parsing functions.
Additionally fixed missing comma in the 'format_nsh_key'.
CC: Yi Yang <yi.y.yang@intel.com> Fixes: f59cb331c481 ("nsh: rework NSH netlink keys and actions") Signed-off-by: Ilya Maximets <i.maximets@samsung.com> Acked-by: William Tu <u9012063@gmail.com>
flow: Wildcard UDP ports when using SYMMETRIC_L4 hash for select groups.
UDP source and destination ports are not used to derive the hash index
used for selecting the bucket in case of SYMMETRIC_L4 hash based select
groups. However, they are un-wildcarded in the megaflow entry match criteria.
This results in distinct megaflow entry being created for each pair of UDP
source and destination ports unnecessarily and causes significant performance
deterioration when the megaflow cache limit is reached.
This patch wildcards UDP ports when using select group with SYMMETRIC_L4
hash function.
Signed-off-by: Vishal Deep Ajmera <vishal.deep.ajmera@ericsson.com> CC: Jan Scheurich <jan.scheurich@ericsson.com> Signed-off-by: Ben Pfaff <blp@ovn.org>
Dumitru Ceara [Mon, 15 Jul 2019 20:25:03 +0000 (22:25 +0200)]
OVN: Add ovn-northd IGMP support
New IP Multicast Snooping Options are added to the Northbound DB
Logical_Switch:other_config column. These allow enabling IGMP snooping and
querier on the logical switch and get translated by ovn-northd to rows in
the IP_Multicast Southbound DB table.
ovn-northd monitors for changes done by ovn-controllers in the Southbound DB
IGMP_Group table. Based on the entries in IGMP_Group ovn-northd creates
Multicast_Group entries in the Southbound DB, one per IGMP_Group address X,
containing the list of logical switch ports (aggregated from all controllers)
that have IGMP_Group entries for that datapath and address X. ovn-northd
also creates a logical flow that matches on IP multicast traffic destined
to address X and outputs it on the tunnel key of the corresponding
Multicast_Group entry.
Signed-off-by: Dumitru Ceara <dceara@redhat.com> Acked-by: Mark Michelson <mmichels@redhat.com> Signed-off-by: Ben Pfaff <blp@ovn.org>
Dumitru Ceara [Mon, 15 Jul 2019 20:24:49 +0000 (22:24 +0200)]
OVN: Add IGMP SB definitions and ovn-controller support
A new IP_Multicast table is added to Southbound DB. This table stores the
multicast related configuration for each datapath. Each row will be
populated by ovn-northd and will control:
- if IGMP Snooping is enabled or not, the snooping table size and multicast
group idle timeout.
- if IGMP Querier is enabled or not (only if snooping is enabled too), query
interval, query source addresses (Ethernet and IP) and the max-response
field to be stored in outgoing queries.
- an additional "seq_no" column is added such that ovn-sbctl or if needed a
CMS can flush currently learned groups. This can be achieved by incrementing
the "seq_no" value.
A new IGMP_Group table is added to Southbound DB. This table stores all the
multicast groups learned by ovn-controllers. The table is indexed by
datapath, group address and chassis. For a learned multicast group on a
specific datapath each ovn-controller will store its own row in this table.
Each row contains the list of chassis-local ports on which the group was
learned. Rows in the IGMP_Group table are updated or deleted only by the
ovn-controllers that created them.
A new action ("igmp") is added to punt IGMP packets on a specific logical
switch datapath to ovn-controller if IGMP snooping is enabled.
Per datapath IGMP multicast snooping support is added to pinctrl:
- incoming IGMP reports are processed and multicast groups are maintained
(using the OVS mcast-snooping library).
- each OVN controller syncs its in-memory IGMP groups to the Southbound DB
in the IGMP_Group table.
- pinctrl also sends periodic IGMPv3 general queries for all datapaths where
querier is enabled.
Signed-off-by: Mark Michelson <mmichels@redhat.com> Co-authored-by: Mark Michelson <mmichels@redhat.com> Signed-off-by: Dumitru Ceara <dceara@redhat.com> Acked-by: Mark Michelson <mmichels@redhat.com> Signed-off-by: Ben Pfaff <blp@ovn.org>
This issue is seen after the commit [1], which makes use of the function -
sbrec_port_binding_update_nat_addresses_addvalue() to add a value to
Port_Binding.nat_addresses column.
Looks like the IDL client code is sending the transactions to the ovsdb-server repeatedly
to update the Port_Binding.nat_addresses even though the Southbound DB has updated
the column when this function is used. The actual bug seems to be in the IDL client code
and that needs to be fixed. This patch as a quick fix, fixes ovn-northd's continuous loop
by not using this function, instead making use of sbrec_port_binding_set_nat_addresses().
The below messages are seen continuously when the ovn-nortdh debug logs are enabled.
****
2019-07-12T17:26:13.837Z|74512|jsonrpc|DBG|unix:sb1.ovsdb: received reply,
result=[{},{"count":1},{"count":1}], id=18628
2019-07-12T17:26:13.837Z|74513|poll_loop|DBG|wakeup due to 0-ms timeout at ../lib/ovsdb-idl.c:5397 (75% CPU usage)
2019-07-12T17:26:13.837Z|74514|jsonrpc|DBG|unix:sb1.ovsdb: send request,
method="transact", params=["OVN_Southbound",{"lock":"ovn_northd","op":"assert"},
{"where":[["_uuid","==",["uuid","56a9eb75-8d3b-4144-b4e7-1bb749645011"]]],"row":
{"nat_addresses":["set",[]]},"op":"update","table":"Port_Binding"},{"mutations":[["nat_addresses",
"insert",["set",["00:00:20:20:12:13 172.168.0.100 is_chassis_resident(\"cr-lr0-public\")"]]]],
"where":[["_uuid","==",["uuid","56a9eb75-8d3b-4144-b4e7-1bb749645011"]]],"op":"mutate","table":"Port_Binding"}], id=18629
2019-07-12T17:26:13.837Z|74516|jsonrpc|DBG|unix:sb1.ovsdb: received reply, result=[{},{"count":1},{"count":1}], id=18629
2019-07-12T17:26:13.837Z|74517|poll_loop|DBG|wakeup due to 0-ms timeout at ../lib/ovsdb-idl.c:5397 (75% CPU usage)
2019-07-12T17:26:13.837Z|74518|jsonrpc|DBG|unix:sb1.ovsdb: send request,
method="transact", params=["OVN_Southbound",{"lock":"ovn_northd","op":"assert"},
{"where":[["_uuid","==",["uuid","56a9eb75-8d3b-4144-b4e7-1bb749645011"]]],
"row":{"nat_addresses":["set",[]]},"op":"update","table":"Port_Binding"},
{"mutations":[["nat_addresses","insert",["set",["00:00:20:20:12:13 172.168.0.100
is_chassis_resident(\"cr-lr0-public\")"]]]],"where":[["_uuid","==",["uuid",
"56a9eb75-8d3b-4144-b4e7-1bb749645011"]]],"op":"mutate","table":"Port_Binding"}], id=18630
2019-07-12T17:26:13.837Z|74520|jsonrpc|DBG|unix:sb1.ovsdb: received reply, result=[{},{"count":1},{"count":1}], id=18630
******
The OpenStack CI tests for networking-ovn is frequently failing few tests after this
commit. The failure seems to be related to timing issues as ovn-northd is hogging
the CPU continuously. We are also seeing travis CI test failures after this commit.
Fixes: ed198fb3b92e("ovn: Send GARP for the router ports with reside-on-redirect-chassis options set") Signed-off-by: Numan Siddique <nusiddiq@redhat.com> Tested-by: Greg Rose <gvrose8192@gmail.com> Reviewed-by: Greg Rose <gvrose8192@gmail.com> Signed-off-by: Ben Pfaff <blp@ovn.org>
ovs-macros: An option to suspend test execution on error
Origins for this patch are captured at
https://mail.openvswitch.org/pipermail/ovs-discuss/2019-June/048923.html.
Summarizing here, when a test fails, it would be good to pause test execution
and let the developer poke around the system to see current status of system.
As part of this patch, made a small tweaks to ovs-macros.at, so that when test
suite fails, ovs_on_exit() function will be called. And in this function, a check
is made to see if an environment variable to OVS_PAUSE_TEST is set. If it is
set, then test suite is paused and will continue to wait for user input
Ctrl-D. Meanwhile user can poke around the system to see why test case has
failed. Once done with investigation, user can press ctrl-d to cleanup the
test suite.
For example, to re-run test case 139:
export OVS_PAUSE_TEST=1
cd tests/system-userspace-testsuite.dir/139
sudo -E ./run
When error occurs, above command would display something like this:
=====================================================
Set environment variable to use various ovs utilities
export OVS_RUNDIR=/opt/vdasari/Developer/ovs/_build-gcc/tests/system-userspace-testsuite.dir/139
Press ENTER to continue:
=====================================================
And from another window, one can execute ovs-xxx commands like:
export OVS_RUNDIR=/opt/vdasari/Developer/ovs/_build-gcc/tests/system-userspace-testsuite.dir/139
$ ovs-ofctl dump-ports br0
.
.
To be able to pause while performing `make check`, one can do:
$ OVS_PAUSE_TEST=1 make check TESTSUITEFLAGS='-v'
Lorenzo Bianconi [Thu, 11 Jul 2019 15:48:45 +0000 (17:48 +0200)]
OVN: use trigger_event action to report 'empty_lb_rule' events
Add northd logical flows in order to reports that the controller
received an IP packet for LB rule witn no backends.
This configuration is used by OpenShift to spin up a idle POD
Signed-off-by: Mark Michelson <mmichels@redhat.com> Co-authored-by: Mark Michelson <mmichels@redhat.com> Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com> Signed-off-by: Ben Pfaff <blp@ovn.org>
Lorenzo Bianconi [Thu, 11 Jul 2019 15:48:44 +0000 (17:48 +0200)]
OVN: introduce trigger_event() action
Add trigger_event() ovn action in order to allow ovs-vswitchd to report
CMS related events.
This commit introduces a new event, empty_lb_backends. This event is
raised if a received packet is destined for a load balancer VIP that has
no configured backend destinations. For this event, the event info
includes the load balancer VIP, the load balancer UUID, and the
transport protocol.
The use case for this particular event is for the CMS to supply backend
resources to handle this traffic. For example, in Openshift, this event
can be used to spin up new containers to handle the incoming traffic.
Signed-off-by: Mark Michelson <mmichels@redhat.com> Co-authored-by: Mark Michelson <mmichels@redhat.com> Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com> Signed-off-by: Ben Pfaff <blp@ovn.org>
Lorenzo Bianconi [Thu, 11 Jul 2019 15:48:43 +0000 (17:48 +0200)]
OVN: introduce Controller_Event table
Add Controller_Event table to OVN SBDB in order to
report CMS related event.
Introduce event_table hashmap array and controller_event related
structures to ovn-controller in order to track pending events
forwarded by ovs-vswitchd. Moreover integrate event_table hashmap
array with event_table ovn-sbdb table
Signed-off-by: Mark Michelson <mmichels@redhat.com> Co-authored-by: Mark Michelson <mmichels@redhat.com> Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com> Signed-off-by: Ben Pfaff <blp@ovn.org>
would occur whenever the socket is closed. This just adds an
SSLStream.close() that calls shutdown() and ignores SSL errors, the
same way that lib/stream-ssl.c does in ssl_close().
Signed-off-by: Terry Wilson <twilson@redhat.com> Signed-off-by: Ben Pfaff <blp@ovn.org>
OVS has defines for loops like 'BITMAP_FOR_EACH_1' or
'ULLONG_FOR_EACH_1', but the regexp in checkpatch doesn't match with
numbers and skips these loops while checking.
This patch adds numbers into regexp and adds some FER_EACH loops to
the unit tests.
After the commit [1], below test cases are failing repeatedly in travis CI.
2663: ovn -- 4 HV, 1 LS, 1 LR, packet test with HA distributed router gateway port FAILED (ovn.at:8597)
2664: ovn -- 4 HV, 3 LS, 2 LR, packet test with HA distributed router gateway port FAILED (ovn.at:8844)
2667: ovn -- vlan traffic for external network with distributed router gateway port FAILED (ovn.at:9580)
2691: ovn -- router - check packet length - icmp defrag FAILED (ovn.at:13624)
With the commit [1], ovn-controller sends GARPs for the IPs of the distributed
router ports. The failing tests did not handle the situation if multiple GARPs
are sent. The failures are mostly timing related. This patch fixes these issues.
[1] - d65586b6fa97 ("ovn: Send GARP for router port IPs of a router port connected to bridged logical switch")
Fixes: d65586b6fa97 ("ovn: Send GARP for router port IPs of a router port connected to bridged logical switch") CC: Ilya Maximets <i.maximets@samsung.com> Signed-off-by: Numan Siddique <nusiddiq@redhat.com> Signed-off-by: Ben Pfaff <blp@ovn.org>
Eli Britstein [Thu, 4 Jul 2019 07:36:42 +0000 (07:36 +0000)]
netdev-vport: Make ip6gre netdev type to use TC rules
The offload api functions already assigned to every tunnel class.
For ip6gre tunnel class only need to also assign the get_ifindex
function, similarly as done in commit 5e63eaa969a3 ("netdev-vport: Make
gre netdev type to use TC rules").
Signed-off-by: Eli Britstein <elibr@mellanox.com> Reviewed-by: Roi Dayan <roid@mellanox.com> Signed-off-by: Ben Pfaff <blp@ovn.org>
Han Zhou [Wed, 10 Jul 2019 04:23:10 +0000 (21:23 -0700)]
ovn-performance.at: Missing steps for connecting LS to LR.
The test creates 2 logical switches and connect them with a logical router.
However, it didn't set the option "router-port", so the 2 LS datapaths
were not connected. This results in missing test coverage for port-binding
incremental processing: assume I-P has a bug and port-binding change always
trigger recompute, since each HV monitors only its own datapath (i.e.
HV1 -> ls1, HV2 -> ls2) then it never got notification of the other
port-binding change, thus recompute is never triggered when port-binding
is updated on the other datapath. With this fix, each HV's local datapaths
will include both ls1 and ls2, so port-binding change notification will
be received properly and unexpected recompute would be captured.
Signed-off-by: Han Zhou <hzhou8@ebay.com> Signed-off-by: Ben Pfaff <blp@ovn.org>
Daniel Alvarez [Tue, 9 Jul 2019 10:16:30 +0000 (12:16 +0200)]
ovsdb-server: drop all connections on read/write status change
Prior to this patch, only db change aware connections were dropped
on a read/write status change. However, current schema in OVN does
not allow clients to monitor whether a particular DB changes this
status. In order to accomplish this, we'd need to change the schema
and adapting ovsdb-server and existing clients.
Before tackling that, this patch is changing ovsdb-server to drop
*all* the existing connections upon a read/write status change. This
will force clients to reconnect and honor the change.
Reported-at: https://mail.openvswitch.org/pipermail/ovs-discuss/2019-July/048981.html Signed-off-by: Daniel Alvarez <dalvarez@redhat.com> Signed-off-by: Ben Pfaff <blp@ovn.org>
net: openvswitch: fix csum updates for MPLS actions
Skbs may have their checksum value populated by HW. If this is a checksum
calculated over the entire packet then the CHECKSUM_COMPLETE field is
marked. Changes to the data pointer on the skb throughout the network
stack still try to maintain this complete csum value if it is required
through functions such as skb_postpush_rcsum.
The MPLS actions in Open vSwitch modify a CHECKSUM_COMPLETE value when
changes are made to packet data without a push or a pull. This occurs when
the ethertype of the MAC header is changed or when MPLS lse fields are
modified.
The modification is carried out using the csum_partial function to get the
csum of a buffer and add it into the larger checksum. The buffer is an
inversion of the data to be removed followed by the new data. Because the
csum is calculated over 16 bits and these values align with 16 bits, the
effect is the removal of the old value from the CHECKSUM_COMPLETE and
addition of the new value.
However, the csum fed into the function and the outcome of the
calculation are also inverted. This would only make sense if it was the
new value rather than the old that was inverted in the input buffer.
Fix the issue by removing the bit inverts in the csum_partial calculation.
The bug was verified and the fix tested by comparing the folded value of
the updated CHECKSUM_COMPLETE value with the folded value of a full
software checksum calculation (reset skb->csum to 0 and run
skb_checksum_complete(skb)). Prior to the fix the outcomes differed but
after they produce the same result.
Fixes: 25cd9ba0abc0 ("openvswitch: Add basic MPLS support to kernel") Fixes: bc7cc5999fd3 ("openvswitch: update checksum in {push,pop}_mpls") Signed-off-by: John Hurley <john.hurley@netronome.com> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Acked-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net> Fixes: ccf4378615e9 ("datapath: Add basic MPLS support to kernel") Fixes: b51367aad315 ("datapath: update checksum in {push,pop}_mpls") Cc: John Hurley <john.hurley@netronome.com> Acked-by: William Tu <u9012063@gmail.com> Signed-off-by: Greg Rose <gvrose8192@gmail.com> Signed-off-by: Ben Pfaff <blp@ovn.org>
net: ip6_gre: fix possible use-after-free in ip6erspan_rcv
erspan_v6 tunnels run __iptunnel_pull_header on received skbs to remove
erspan header. This can determine a possible use-after-free accessing
pkt_md pointer in ip6erspan_rcv since the packet will be 'uncloned'
running pskb_expand_head if it is a cloned gso skb (e.g if the packet has
been sent though a veth device). Fix it resetting pkt_md pointer after
__iptunnel_pull_header
Fixes: 1d7e2ed22f8d ("net: erspan: refactor existing erspan code") Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Fixes: c387d8177f20 ("compat: Add ipv6 GRE and IPV6 Tunneling") Cc: Lorenzo Bianconi <lorenzo.bianconi@redhat.com> Acked-by: William Tu <u9012063@gmail.com> Signed-off-by: Greg Rose <gvrose8192@gmail.com> Signed-off-by: Ben Pfaff <blp@ovn.org>
David Marchand [Tue, 9 Jul 2019 16:19:58 +0000 (18:19 +0200)]
dpif-netdev: Catch reloads faster.
Looking at the reload flag only every 1024 loops can be a long time
under load, since we might be handling 32 packets per rxq, per iteration,
which means up to poll_cnt * 32 * 1024 packets.
Look at the flag every loop, no major performance impact seen.
Signed-off-by: David Marchand <david.marchand@redhat.com> Acked-by: Eelco Chaudron <echaudro@redhat.com> Acked-by: Ilya Maximets <i.maximets@samsung.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>
David Marchand [Tue, 9 Jul 2019 16:19:57 +0000 (18:19 +0200)]
dpif-netdev: Only reload static tx qid when needed.
pmd->static_tx_qid is allocated under a mutex by the different pmd
threads.
Unconditionally reallocating it will make those pmd threads sleep
when contention occurs.
During "normal" reloads like for rebalancing queues between pmd threads,
this can make pmd threads waste time on this.
Reallocating the tx qid is only needed when removing other pmd threads
as it is the only situation when the qid pool can become uncontiguous.
Add a flag to instruct the pmd to reload tx qid for this case which is
Step 1 in current code.
Signed-off-by: David Marchand <david.marchand@redhat.com> Acked-by: Eelco Chaudron <echaudro@redhat.com> Acked-by: Ilya Maximets <i.maximets@samsung.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>
David Marchand [Tue, 9 Jul 2019 16:19:56 +0000 (18:19 +0200)]
dpif-netdev: Do not sleep when swapping queues.
When swapping queues from a pmd thread to another (q0 polled by pmd0/q1
polled by pmd1 -> q1 polled by pmd0/q0 polled by pmd1), the current
"Step 5" puts both pmds to sleep waiting for the control thread to wake
them up later.
Prefer to make them spin in such a case to avoid sleeping an
undeterministic amount of time.
Signed-off-by: David Marchand <david.marchand@redhat.com> Acked-by: Eelco Chaudron <echaudro@redhat.com> Acked-by: Ilya Maximets <i.maximets@samsung.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>
David Marchand [Tue, 9 Jul 2019 16:19:55 +0000 (18:19 +0200)]
dpif-netdev: Trigger parallel pmd reloads.
pmd reloads are currently serialised in each steps calling
reload_affected_pmds.
Any pmd processing packets, waiting on a mutex etc... will make other
pmd threads wait for a delay that can be undeterministic when syscalls
adds up.
Switch to a little busy loop on the control thread using the existing
per-pmd reload boolean.
The memory order on this atomic is rel-acq to have an explicit
synchronisation between the pmd threads and the control thread.
Signed-off-by: David Marchand <david.marchand@redhat.com> Acked-by: Eelco Chaudron <echaudro@redhat.com> Acked-by: Ilya Maximets <i.maximets@samsung.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>
David Marchand [Tue, 9 Jul 2019 16:19:54 +0000 (18:19 +0200)]
dpif-netdev: Convert exit latch to flag.
No need for a latch here since we don't have to wait.
A simple boolean flag is enough.
The memory order on the reload flag is changed to rel-acq ordering to
serve as a synchronisation point between the pmd threads and the control
thread that asks for termination.
Fixes: e4cfed38b159 ("dpif-netdev: Add poll-mode-device thread.") Signed-off-by: David Marchand <david.marchand@redhat.com> Acked-by: Eelco Chaudron <echaudro@redhat.com> Acked-by: Ian Stokes <ian.stokes@intel.com> Acked-by: Ilya Maximets <i.maximets@samsung.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>
Ben Pfaff [Fri, 10 May 2019 22:02:43 +0000 (15:02 -0700)]
dist-docs: Fix bugs in text to HTML conversion.
This fixes two bugs. First, & has a special meaning in the replacement
text for a sed "s" command, so this escapes it. Second, this code
misprocessed bold or underlined &<>: >^H> would become >^H> which
would display as >> in most browers.
Finally, this improves the HTML output so that bold ABC becomes <b>ABC</b>
instead of <b>A</b><b>B</b><b>C</b>.
Reported-by: Nicolas Bouliane <nbouliane@digitalocean.com>
Reported-at: https://twitter.com/nicboul/status/1126959264772259842 Signed-off-by: Ben Pfaff <blp@ovn.org>
OVN: run local logical flows first in S_ROUTER_OUT_SNAT table
Run local logical flows first if the gw router port is scheduled
on the local chassis in order to properly manage snat traffic
Tested-by: Eran Kuris <ekuris@redhat.com> Acked-by: Numan Siddique <nusiddiq@redhat.com> Acked-by: Mark Michelson <mmichels@redhat.com> Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com> Signed-off-by: Ben Pfaff <blp@ovn.org>
rhel: Fixed a bug for checking the correct major version and revision.
Fixed a bug where checking for major version 3.10 and major revision not
equal to 327 or 693 or 957 should have gone to the default else at the end.
In the current code, the default else condition will not get executed
for kernel with major version 3.10 and major revision not equal
to 327/693/957 resulting in failure to load the kernel module.
Fixes: 402efbe4e176 ("rhel: Add 4.12 kernel support in ovs-kmod-manage.sh") Signed-off-by: Ashish Varma <ashishvarma.ovs@gmail.com> Signed-off-by: Ben Pfaff <blp@ovn.org>
Dumitru Ceara [Mon, 8 Jul 2019 10:07:26 +0000 (12:07 +0200)]
ovn-controller: Update stale chassis entry at init
The first time ovn-controller initializes the Chassis entry (shortly
after start up) we first look if there is a stale Chassis record in the
OVN_Southbound DB by checking if any of the old Encap entries associated
to the Chassis record match the new tunnel configuration. If found it
means that ovn-controller didn't shutdown gracefully last time it was
run so it didn't cleanup the Chassis table. Potentially in the meantime
the OVS system-id was also changed. We then update the stale entry with
the new configuration and store the last configured chassis-id in memory
to avoid walking the Chassis table every time.
Signed-off-by: Dumitru Ceara <dceara@redhat.com> Signed-off-by: Ben Pfaff <blp@ovn.org>
Dumitru Ceara [Mon, 8 Jul 2019 10:07:12 +0000 (12:07 +0200)]
ovn-controller: Refactor chassis.c to abstract the string parsing
Abstract out the chassis config string processing and use library data
structures (e.g., sset).
Rename the get_chassis_id function in ovn-controller.c to
get_ovs_chassis_id to avoid confusion with the newly added
chassis_get_id function from chassis.c which returns the last
successfully configured chassis-id.
Signed-off-by: Dumitru Ceara <dceara@redhat.com> Signed-off-by: Ben Pfaff <blp@ovn.org>
Dumitru Ceara [Mon, 8 Jul 2019 10:07:00 +0000 (12:07 +0200)]
ovn-controller: Fix chassis ovn-sbdb record init
The chassis_run code didn't take into account the scenario when the
system-id was changed in the Open_vSwitch table. Due to this the code
was trying to insert a new Chassis record in the OVN_Southbound DB with
the same Encaps as the previous Chassis record. The transaction used
to insert the new records was aborting due to the ["type", "ip"]
index constraint violation as we were creating new Encap entries with
the same "type" and "ip" as the old ones.
In order to fix this issue the flow is now:
1. the first time ovn-controller initializes the Chassis (shortly after
start up) we store the chassis-id.
2. for subsequent chassis_run calls we use last configured
chassis-id stored at the previous step to lookup the old Chassis record.
3. when ovn-controller shuts down gracefully we lookup the Chassis
record based on the chassis-id stored in memory at steps 1 and 2 above.
This is to avoid failing to cleanup the Chassis record in OVN_Southbound
DB if the OVS system-id changes between the last call to chassis_run and
chassis_cleanup.
Reported-at: https://bugzilla.redhat.com/1708146 Reported-by: Haidong Li <haili@redhat.com> Signed-off-by: Dumitru Ceara <dceara@redhat.com> Signed-off-by: Ben Pfaff <blp@ovn.org>
Kevin Traynor [Tue, 2 Jul 2019 00:32:30 +0000 (01:32 +0100)]
netdev-dpdk: Enable tx-retries-max config.
vhost tx retries can provide some mitigation against
dropped packets due to a temporarily slow guest/limited queue
size for an interface, but on the other hand when a system
is fully loaded those extra cycles retrying could mean
packets are dropped elsewhere.
Up to now max vhost tx retries have been hardcoded, which meant
no tuning and no way to disable for debugging to see if extra
cycles spent retrying resulted in rx drops on some other
interface.
Add an option to change the max retries, with a value of
0 effectively disabling vhost tx retries.
Signed-off-by: Kevin Traynor <ktraynor@redhat.com> Acked-by: Eelco Chaudron <echaudro@redhat.com> Acked-by: Flavio Leitner <fbl@sysclose.org> Acked-by: Ilya Maximets <i.maximets@samsung.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>
Kevin Traynor [Tue, 2 Jul 2019 00:32:28 +0000 (01:32 +0100)]
doc: Move vhost tx retry info to separate section.
vhost tx retry is applicable to vhost-user and vhost-user-client,
but was in the section that compares them. Also, moved further
down the doc as prefer to have more fundamental info about vhost
nearer the top.
Fixes: 6d6513bfc657 ("doc: Add info on vhost tx retries.") Reported-by: David Marchand <david.marchand@redhat.com> Signed-off-by: Kevin Traynor <ktraynor@redhat.com> Reviewed-by: David Marchand <david.marchand@redhat.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>
The current documentation states that "all packets entering OVS for
the first time are "untracked"". However there is a minor exception
to this in the case where a packet (re)enters the same datapath and
the namespace has not changed. In that case there is no need to
scrub the packet and in this case the connection may already be
in the "tracked" state.
Reported-by: Quan Tian <qtian@vmware.com> Signed-off-by: Greg Rose <gvrose8192@gmail.com> Signed-off-by: Ben Pfaff <blp@ovn.org>
Ben Pfaff [Tue, 11 Jun 2019 16:55:16 +0000 (09:55 -0700)]
rconn: Increase precision of timers.
Until now, the rconn timers have been precise only to the nearest second.
This increases them to millisecond precision, which seems cleaner these
days.
Running the command "ovn-nbctl set logical_switch_port foo external_ids:foo=bar"
results in the incremetal processing engine to recompute the flows on the
chassis where the logical port 'foo' is claimed.
This patch avoids this unnecessary recomputation by omitting the tracking of
external_ids column of all the Southbound DB tables except DNS, Chassis
and Datapath_Binding tables. ovn-controller is refering to the external_ids
column of these tables.
Signed-off-by: Numan Siddique <nusiddiq@redhat.com> Signed-off-by: Ben Pfaff <blp@ovn.org>
Key difference between an overlay logical switch and
vlan backed logical switch is that for vlan logical switches
packets are not encapsulated.
Hence, if a distributed router port is connected to vlan backed
logical switch, then router port mac as source mac could be
seen from multiple hypervisors. Same <mac,vlan> pairs coming
from multiple ports from a top of the rack switch (TOR) perspective
could be seen as a security threat and it could send alarms, drop
the packets or block the ports etc.
This patch addresses the same by introducing the concept of chassis mac.
A chassis mac is CMS provisioned unique mac per chassis. For any routed packet
(i.e source mac is router port mac) going on the wire on a vlan type
logical switch, we will replace its source mac with chassis mac.
This replacing of source mac with chassis mac will happen in table=65
of the logical switch datapath. A flow is added at priority 150, which
matches the source mac and replaces it with chassis mac if the value
is a router port mac.
ovn-controller: Provide the option to configure inactivity probe interval for OpenFlow conn
If the ovn-controller main loop takes more than 5 seconds (if there are lots of logical
flows) before it calls poll_block(), it causes the poll_block to wake up immediately,
since rconn module has to send echo request. With the incremental processing, this is
not an issue as ovn-controller will not recompute again. But for older versions, this
is an issue as it causes flow recomputations and this would result in 100% cpu all the
time.
With this patch, CMS can configure a higher value depending the workload.
The main intention of this patch is to fix this recompuation issue for older versions
(there by requesting backport), it still would be beneficial with the
incremental processing engine.
Signed-off-by: Numan Siddique <nusiddiq@redhat.com> Tested-by: Dumitru Ceara <dceara@redhat.com> Signed-off-by: Ben Pfaff <blp@ovn.org>