David Marchand [Mon, 13 Jul 2020 08:06:21 +0000 (10:06 +0200)]
dpdk: Add commands to configure log levels.
Enabling debug logs in dpdk can be a challenge to be sure of what is
actually enabled, add commands to list and change those log levels.
However, these commands do not help when tracking issues in dpdk init
itself: dump log levels right after init.
Example:
$ ovs-appctl dpdk/log-list
global log level is debug
id 0: lib.eal, level is info
id 1: lib.malloc, level is info
id 2: lib.ring, level is info
id 3: lib.mempool, level is info
id 4: lib.timer, level is info
id 5: pmd, level is info
[...]
id 37: pmd.net.bnxt.driver, level is notice
id 38: pmd.net.e1000.init, level is notice
id 39: pmd.net.e1000.driver, level is notice
id 40: pmd.net.enic, level is info
[...]
$ ovs-appctl dpdk/log-set debug pmd.*:notice
$ ovs-appctl dpdk/log-list
global log level is debug
id 0: lib.eal, level is debug
id 1: lib.malloc, level is debug
id 2: lib.ring, level is debug
id 3: lib.mempool, level is debug
id 4: lib.timer, level is debug
id 5: pmd, level is debug
[...]
id 37: pmd.net.bnxt.driver, level is notice
id 38: pmd.net.e1000.init, level is notice
id 39: pmd.net.e1000.driver, level is notice
id 40: pmd.net.enic, level is notice
[...]
Signed-off-by: David Marchand <david.marchand@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Roi Dayan [Wed, 15 Jul 2020 12:40:49 +0000 (15:40 +0300)]
rhel: openvswitch-fedora.spec.in: Fix installed but not packaged.
With the cited commit, we get an error from rpmbuild about installed
but not packaged /usr/lib64/libopenvswitchavx512.a.
Fix it by treating it as the other la files.
Fixes: 352b6c7116cd ("dpif-lookup: add avx512 gather implementation.") Signed-off-by: Roi Dayan <roid@mellanox.com> Acked-by: Ian Stokes <ian.stokes@intel.com> Tested-by: Greg Rose <gvrose8192@gmail.com> Acked-by: Greg Rose <gvrose8192@gmail.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Jeff Squyres [Thu, 9 Jul 2020 23:57:47 +0000 (16:57 -0700)]
bond: Add 'primary' interface concept for active-backup mode.
In AB bonding, if the current active slave becomes disabled, a
replacement slave is arbitrarily picked from the remaining set of
enabled slaves. This commit adds the concept of a "primary" slave: an
interface that will always be (or become) the current active slave if
it is enabled.
The rationale for this functionality is to allow the designation of a
preferred interface for a given bond. For example:
1. Bond is created with interfaces p1 (primary) and p2, both enabled.
2. p1 becomes the current active slave (because it was designated as
the primary).
3. Later, p1 fails/becomes disabled.
4. p2 is chosen to become the current active slave.
5. Later, p1 becomes re-enabled.
6. p1 is chosen to become the current active slave (because it was
designated as the primary)
Note that p1 becomes the active slave once it becomes re-enabled, even
if nothing has happened to p2.
This "primary" concept exists in Linux kernel network interface
bonding, but did not previously exist in OVS bonding.
Only one primary slave interface is supported per bond, and is only
supported for active/backup bonding.
The primary slave interface is designated via
"other_config:bond-primary" when creating a bond.
Also, while adding tests for the "primary" concept, make a few small
improvements to the non-primary AB bonding test.
Signed-off-by: Jeff Squyres <jsquyres@cisco.com> Reviewed-by: Aaron Conole <aconole@redhat.com> Tested-by: Greg Rose <gvrose8192@gmail.com> Acked-by: Greg Rose <gvrose8192@gmail.com> Acked-by: Flavio Leitner <fbl@sysclose.org> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
dpif-netdev: Avoid deadlock with offloading during PMD thread deletion.
Main thread will try to pause/stop all revalidators during datapath
reconfiguration via datapath purge callback (dp_purge_cb) while
holding 'dp->port_mutex'. And deadlock happens in case any of
revalidator threads is already waiting on 'dp->port_mutex' while
dumping offloaded flows:
main thread revalidator
--------------------------------- ----------------------------------
reconfigure_datapath()
-> reconfigure_pmd_threads()
-> dp_netdev_del_pmd()
-> dp_purge_cb()
-> udpif_pause_revalidators()
-> ovs_barrier_block(&udpif->pause_barrier)
<waiting for revalidators to reach barrier>
<DEADLOCK>
We're not allowed to call offloading API without holding global
port mutex from the userspace datapath due to thread safety
restrictions on netdev-offload-dpdk module. And it's also not easy
to rework datapath reconfiguration process in order to move actual
PMD removal and datapath purge out of the port mutex.
So, for now, not sleeping on a mutex if it's not immediately available
seem like an easiest workaround. This will have impact on flow
statistics update rate and on ability to get the latest statistics
before removing the flow (latest stats will be lost in case we were
not able to take the mutex). However, this will allow us to operate
normally avoiding the deadlock.
The last bit is that to avoid flapping of flow attributes and
statistics we're not failing the operation, but returning last
statistics and attributes returned by offload provider. Since those
might be updated in different threads, stores and reads are atomic.
Reported-by: Frank Wang (王培辉) <wangpeihui@inspur.com>
Reported-at: https://mail.openvswitch.org/pipermail/ovs-dev/2020-June/371753.html Fixes: a309e4f52660 ("dpif-netdev: Update offloaded flows statistics.") Acked-by: Kevin Traynor <ktraynor@redhat.com> Acked-by: Ian Stokes <ian.stokes@intel.com> Tested-by: Eli Britstein <elibr@mellanox.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
William Tu [Mon, 13 Jul 2020 20:34:32 +0000 (13:34 -0700)]
ovs-bugtool: Fix Python3 bytes str issue.
The patch fixes two errors due to type mismatched, when converting
between str and bytes:
File "/usr/local/sbin/ovs-bugtool", line 649, in main
cmd_output(CAP_NETWORK_STATUS, [OVS_DPCTL, 'dump-flows', '-m', d])
File "/usr/local/sbin/ovs-bugtool", line 278, in cmd_output
label = ' '.join(a)
TypeError: sequence item 3: expected str instance, bytes found
And
File "/usr/sbin/ovs-bugtool", line 721, in main
collect_data()
File "/usr/sbin/ovs-bugtool", line 366, in collect_data
run_procs(process_lists.values())
File "/usr/sbin/ovs-bugtool", line 1354, in run_procs
p.inst.write("\n** timeout **\n")
File "/usr/sbin/ovs-bugtool", line 1403, in write
BytesIO.write(self, s)
TypeError: a bytes-like object is required, not 'str'
VMware-BZ: #2602135
Fixed: 9e6c00bca9af ("bugtool: Fix for Python3.") Acked-by: Yi-Hung Wei <yihung.wei@gmail.com> Signed-off-by: William Tu <u9012063@gmail.com>
ofproto: report coverage on hitting datapath flow limit
Whenever the number of flows in the datapath crosses above
the flow limit set/autoconfigured, it is helpful to report
this event through coverage counter for an operator/devops
engineer to know and take proactive corrections in the
switch configuration.
Today, these events are reported in ovs vswitch log when
a new flow can not be inserted in upcall processing in which
case ovs writes a warning, otherwise an auto correction
made by ovs to flush old flows without any intimation at all.
Signed-off-by: Gowrishankar Muthukrishnan <gmuthukr@redhat.com> Signed-off-by: William Tu <u9012063@gmail.com>
Ian Stokes [Thu, 2 Jul 2020 15:09:27 +0000 (16:09 +0100)]
dpdk: Use DPDK 19.11.2 release.
Modify travis linux build script to use DPDK 19.11.2 stable release and
update docs to reference 19.11.2 stable release. Update release faq to
reflect latest validated DPDK versions for all branches.
Signed-off-by: Ian Stokes <ian.stokes@intel.com> Acked-by: Kevin Traynor <ktraynor@redhat.com>
This commit adds a section to the dpdk/bridge.rst netdev documentation,
detailing the added DPCLS functionality. The newly added commands are
documented, and sample output is provided.
Running the DPCLS autovalidator with unit tests by default is possible
through re-compiling the autovalidator to have the highest priority at
startup time. This avoids making changes to all tests, and enables
debug and CI builds to validate every lookup implementation with all
unit tests.
Add NEWS updates for CPU ISA, dynamic subtables, and AVX512 lookup.
Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com> Acked-by: William Tu <u9012063@gmail.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>
Harry van Haaren [Mon, 13 Jul 2020 12:42:14 +0000 (13:42 +0100)]
dpif-lookup: add avx512 gather implementation.
This commit adds an AVX-512 dpcls lookup implementation.
It uses the AVX-512 SIMD ISA to perform multiple miniflow
operations in parallel.
To run this implementation, the "avx512f" and "bmi2" ISAs are
required. These ISA checks are performed at runtime while
probing the subtable implementation. If a CPU does not provide
both "avx512f" and "bmi2", then this code does not execute.
The avx512 code is built as a separate static library, with added
CFLAGS to enable the required ISA features. By building only this
static library with avx512 enabled, it is ensured that the main OVS
core library is *not* using avx512, and that OVS continues to run
as before on CPUs that do not support avx512.
The approach taken in this implementation is to use the
gather instruction to access the packet miniflow, allowing
any miniflow blocks to be loaded into an AVX-512 register.
This maximizes the usefulness of the register, and hence this
implementation handles any subtable with up to miniflow 8 bits.
Note that specialization of these avx512 lookup routines
still provides performance value, as the hashing of the
resulting data is performed in scalar code, and compile-time
loop unrolling occurs when specialized to miniflow bits.
This commit checks at configure time if the assembling in use
has a known bug in assembling AVX512 code. If this bug is present,
all AVX512 code is disabled. Checking the version string of the binutils
or assembler is not a good method to detect the issue, as back ported
fixes would not be reflected.
Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com> Acked-by: William Tu <u9012063@gmail.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>
Harry van Haaren [Mon, 13 Jul 2020 12:42:13 +0000 (13:42 +0100)]
dpdk: enable CPU feature detection.
This commit implements a method to retrieve the CPU ISA capabilities.
These ISA capabilities can be used in OVS to at runtime select a function
implementation to make the best use of the available ISA on the CPU.
Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com> Acked-by: William Tu <u9012063@gmail.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>
This commit adds a new command, "dpif-netdev/subtable-lookup-prio-get"
which prints the available subtable lookup functions in this OVS binary.
Example output from the command:
Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com> Acked-by: William Tu <u9012063@gmail.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>
Harry van Haaren [Mon, 13 Jul 2020 12:42:11 +0000 (13:42 +0100)]
dpif-netdev: add subtable lookup prio set command.
This commit adds a command for the dpif-netdev to set a specific
lookup function to a particular priority level. The command enables
runtime switching of the dpcls subtable lookup implementation.
Selection is performed based on a priority. Higher priorities take
precedence, e.g. priority 5 will be selected instead of a priority 3.
If lookup functions have the same priority, the first one in the list
is selected.
The two options available are 'autovalidator' and 'generic'.
The below command will set a new priority for the given function:
$ ovs-appctl dpif-netdev/subtable-lookup-prio-set generic 2
The autovalidator implementation can be selected at runtime now:
$ ovs-appctl dpif-netdev/subtable-lookup-prio-set autovalidator 5
Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com> Acked-by: William Tu <u9012063@gmail.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>
This commit refactors the existing dpif subtable function pointer
infrastructure, and implements an autovalidator component.
The refactoring of the existing dpcls subtable lookup function
handling, making it more generic, and cleaning up how to enable
more implementations in future.
In order to ensure all implementations provide identical results,
the autovalidator is added. The autovalidator itself implements
the subtable lookup function prototype, but internally iterates
over all other available implementations. The end result is that
testing of each implementation becomes automatic, when the auto-
validator implementation is selected.
Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com> Acked-by: William Tu <u9012063@gmail.com> Signed-off-by: Ian Stokes <ian.stokes@intel.com>
Tonghao Zhang [Tue, 9 Jun 2020 00:53:41 +0000 (08:53 +0800)]
dpif-netdev: Return error code when no mark available.
The max number of mark is (UINT32_MAX - 1), that is
enough to be used. But theoretically, if there are no
mark available, the later different flows will shared
the mark INVALID_FLOW_MARK, that may break the function.
If there are no available mark to be used, return error
code.
Fixes: 02bb2824e51d ("dpif-netdev: do hw flow offload in a thread") Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Tonghao Zhang [Tue, 9 Jun 2020 00:53:40 +0000 (08:53 +0800)]
dpif-netdev: Add check mark to avoid ovs-vswitchd crash.
When changing the pmd interfaces attribute, ovs-vswitchd will
reload pmd and flush offload flows. reload_affected_pmds may
be invoked twice or more. In that case, the flows may been
queued to "dp_netdev_flow_offload" thread again.
For example:
$ ovs-vsctl -- set interface <Interface> options:dpdk-lsc-interrupt=true
ovs-vswitchd main flow-offload thread
append F to queue ...
...
append F to queue
... del F
... del F (crash [1])
Patch 29cf9c1b3b9c ("userspace: Add TCP Segmentation Offload support") uses
__virtio16 which is defined in kernel 3.19. Ubuntu 14.04 is using 3.13
kernel that lacks the virtio_types definition. This patch fixes that.
Fixes: 29cf9c1b3b9c ("userspace: Add TCP Segmentation Offload support") Acked-by: Greg Rose <gvrose8192@gmail.com> Signed-off-by: Yi-Hung Wei <yihung.wei@gmail.com> Signed-off-by: William Tu <u9012063@gmail.com>
Eli Britstein [Wed, 8 Jul 2020 06:38:31 +0000 (06:38 +0000)]
netdev-offload-dpdk: Support tnl/push using vxlan encap attribute.
For DPDK, there is the RAW_ENCAP attribute which gets raw buffer of the
encapsulation header. For specific protocol, such as vxlan, there is a
more specific attribute, VXLAN_ENCAP, which gets the parsed fields of
the outer header. In case tunnel type is vxlan, parse the header
and use the specific attribute, with fallback to RAW_ENCAP.
Signed-off-by: Eli Britstein <elibr@mellanox.com> Reviewed-by: Roni Bar Yanai <roniba@mellanox.com> Acked-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Eli Britstein [Wed, 8 Jul 2020 06:38:30 +0000 (06:38 +0000)]
netdev-offload-dpdk: Support offload of clone tnl_push/output actions.
Tunnel encapsulation is done by tnl_push and output actions nested in a
clone action. Support offloading of such flows with
RTE_FLOW_ACTION_TYPE_RAW_ENCAP attribute.
Signed-off-by: Eli Britstein <elibr@mellanox.com> Reviewed-by: Oz Shlomo <ozsh@mellanox.com> Acked-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
There is no real difference between the 'class' and 'type' in the
context of common lookup operations inside netdev-offload module
because it only checks the value of pointers without using the
value itself. However, 'type' has some meaning and can be used by
offload provides on the initialization phase to check if this type
of Flow API in pair with the netdev type could be used in particular
datapath type. For example, this is needed to check if Linux flow
API could be used for current tunneling vport because it could be
used only if tunneling vport belongs to system datapath, i.e. has
backing linux interface.
This is needed to unblock tunneling offloads in userspace datapath
with DPDK flow API.
Acked-by: Eli Britstein <elibr@mellanox.com> Acked-by: Roni Bar Yanai <roniba@mellanox.com> Acked-by: Ophir Munk <ophirmu@mellanox.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
netdev: Allow storing dpif type into netdev structure.
Storing of the dpif type of the owning datapath interface will allow
us to easily distinguish, for example, userspace tunneling ports from
the system ones. This is required in terms of HW offloading to avoid
offloading of userspace flows to kernel interfaces that doesn't belong
to userspace datapath, but have same dpif_port names.
Acked-by: Eli Britstein <elibr@mellanox.com> Acked-by: Roni Bar Yanai <roniba@mellanox.com> Acked-by: Ophir Munk <ophirmu@mellanox.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Eli Britstein [Wed, 8 Jul 2020 06:38:25 +0000 (06:38 +0000)]
netdev-offload-dpdk: Remove pre-validate of patterns function.
The function of adding patterns by requested matches checks that it
consumed all the required matches, and err if not. For functional
purpose there is no need for pre-validation. For performance such
validation may decrease the time spent for failing flows, but at the
expense of increasing the time spent for the good flows, and code
complexity. Remove the pre-validation function.
Signed-off-by: Eli Britstein <elibr@mellanox.com> Reviewed-by: Roni Bar Yanai <roniba@mellanox.com> Acked-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Eli Britstein [Wed, 8 Jul 2020 06:38:22 +0000 (06:38 +0000)]
dpif-netdev: Don't use zero flow mark.
Zero flow mark is used to indicate the HW to remove the mark. A packet
marked with zero mark is received in SW without a mark at all, so it
cannot be used as a valid mark. Change the pool range to fix it.
Fixes: 241bad15d99a ("dpif-netdev: associate flow with a mark id") Signed-off-by: Eli Britstein <elibr@mellanox.com> Reviewed-by: Roni Bar Yanai <roniba@mellanox.com> Acked-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
William Tu [Tue, 7 Jul 2020 03:11:05 +0000 (20:11 -0700)]
netdev-offload-tc: Add drop action support.
Currently drop action is not offloaded when using userspace datapath
with tc offload. The patch programs tc gact (generic action) chain
ID 0 to drop the packet by setting it to TC_ACT_SHOT.
Example:
$ ovs-appctl dpctl/add-flow netdev@ovs-netdev \
'recirc_id(0),in_port(2),eth(),eth_type(0x0806),\
arp(op=2,tha=00:50:56:e1:4b:ab,tip=10.255.1.116)' drop
Or no action also infers drop
$ ovs-appctl dpctl/add-flow netdev@ovs-netdev \
'recirc_id(0),in_port(2),eth(),eth_type(0x0806),\
arp(op=2,tha=00:50:56:e1:4b:ab,tip=10.255.1.116)' ''
Signed-off-by: William Tu <u9012063@gmail.com> Acked-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Signed-off-by: Simon Horman <simon.horman@netronome.com>
Ben Pfaff [Sat, 21 Mar 2020 22:17:27 +0000 (15:17 -0700)]
python: Fix plural forms of OVSDB types.
Fixes two problems. First, the plural of chassis is also chassis.
Second, for linguistic analysis we need to consider plain words, not
words that have (e.g.) \fB and \fR pasted into them for nroff output.
This makes the OVN manpage for ovn-sb(5) talk about "set of Chassis"
not "set of Chassiss".
Acked-by: Numan Siddique <numans@ovn.org> Signed-off-by: Ben Pfaff <blp@ovn.org>
Nitin Katiyar [Thu, 22 Aug 2019 16:53:30 +0000 (22:23 +0530)]
dpif-netdev: Do RCU synchronization at fixed interval in PMD main loop.
Each PMD updates the global sequence number for RCU synchronization
purpose with other OVS threads. This is done at every 1025th iteration
in PMD main loop.
If the PMD thread is responsible for polling large number of queues
that are carrying traffic, it spends a lot of time processing packets
and this results in significant delay in performing the housekeeping
activities.
If the OVS main thread is waiting to synchronize with the PMD threads
and if those threads delay performing housekeeping activities for
more than 3 sec then LACP processing will be impacted and it will lead
to LACP flaps. Similarly, other controls protocols run by OVS main
thread are impacted.
For e.g. a PMD thread polling 200 ports/queues with average of 1600
processing cycles per packet with batch size of 32 may take 10240000
(200 * 1600 * 32) cycles per iteration. In system with 2.0 GHz CPU
it means more than 5 ms per iteration. So, for 1024 iterations to
complete it would be more than 5 seconds.
This gets worse when there are PMD threads which are less loaded.
It reduces possibility of getting mutex lock in ovsrcu_try_quiesce()
by heavily loaded PMD and next attempt to quiesce would be after 1024
iterations.
With this patch, PMD RCU synchronization will be performed after fixed
interval instead after a fixed number of iterations. This will ensure
that even if the packet processing load is high the RCU synchronization
will not be delayed long.
Co-authored-by: Anju Thomas <anju.thomas@ericsson.com> Signed-off-by: Anju Thomas <anju.thomas@ericsson.com> Signed-off-by: Nitin Katiyar <nitin.katiyar@ericsson.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Dumitru Ceara [Thu, 2 Jul 2020 14:20:57 +0000 (16:20 +0200)]
ovsdb-idl: Force IDL retry when missing updates encountered.
Adds a generic recovery mechanism which triggers an IDL retry with fast
resync disabled in case the IDL has detected that it ended up in an
inconsistent state due to other bugs in the ovsdb-server/ovsdb-idl
implementation.
Additionally, this commit also:
- bumps IDL semantic error logs to level ERR to make them more
visible.
- triggers an IDL retry in cases when the IDL client used to try to
recover (i.e., trying to add an existing row, trying to remove a non
existent row).
Fixes: db2b5757328c ("lib: add monitor2 support in ovsdb-idl.") Acked-by: Han Zhou <hzhou@ovn.org> Signed-off-by: Dumitru Ceara <dceara@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Tonghao Zhang [Sun, 15 Mar 2020 21:56:03 +0000 (05:56 +0800)]
dpif-netdev: Delete the artificial flow limit.
The MAX_FLOWS constant was there from the introduction of dpif-netdev,
however, later new flow-limit mechanism was implemented that
controls number of datapath flows in a dynamic way on ofproto level.
So, we can just remove the limit and fully rely on ofproto to decide
what flow limit we need. There are no limitations for flow table size
in dpif-netdev beside the artificial one.
'other_config:flow-limit' seems suitable to control this.
Peng He [Sun, 5 Jul 2020 13:07:23 +0000 (21:07 +0800)]
conntrack-tp: fix lock order in conn_update_expiration
*conn_update_expiration* violates the lock order of conn->lock and
ct->lock. In the comments of conntrack, the conn->lock should be
held after ct->lock when ct->lock needs to be taken.
Fixes: 2078901a4c142 ("userspace: Add conntrack timeout policy support.") Signed-off-by: Peng He <hepeng.0320@bytedance.com> Signed-off-by: William Tu <u9012063@gmail.com>
Flavio Leitner [Wed, 10 Jun 2020 19:49:45 +0000 (16:49 -0300)]
ctags: Include new annotations to ctags ignore list.
The annotation OVS_NO_THREAD_SAFETY_ANALYSIS and OVS_LOCKABLE are
not part of the list, so ctags can't find functions using them.
The annotation list comes from a regex and to include more items
make the regex more difficult to read and maintain. Convert to a
static list because it isn't supposed to change much and there
is no standard names.
Also add a comment to remind to keep the list up-to-date.
Signed-off-by: Flavio Leitner <fbl@sysclose.org> Signed-off-by: William Tu <u9012063@gmail.com>
wenxu [Mon, 29 Jun 2020 09:31:18 +0000 (17:31 +0800)]
lib/tc: only update the stats for non-empty counter
A packet with first frag and execute act_ct action.
The packet will stole by defrag. So the stats counter
for "gact action goto chain" will always 0. The openvswitch
update each action in order. So the flower stats finally
alway be zero. The rule will be delete adter max-idle time
even there are packet executing the action.
Ben Pfaff [Fri, 26 Jun 2020 19:46:10 +0000 (12:46 -0700)]
jsonrpc: Don't assert for 0 remotes in jsonrpc_session_open_multiple().
It's pretty easy to get 0 remotes here from ovn-northd if you specify
--ovnnb-db='' or --ovnnb-db=' ' on the command line. The internals
of jsonrpc_session aren't equipped to cope with that, so just add a
dummy remote instead.
Acked-by: Numan Siddique <numans@ovn.org> Signed-off-by: Ben Pfaff <blp@ovn.org>
Rui Cao [Tue, 23 Jun 2020 06:46:22 +0000 (06:46 +0000)]
datapath-windows, conntrack: Fix conntrack new state
On windows, if we send a connection setup packet in one direction
twice, it will make the connection to be in established state. The
same issue happened in Linux userspace conntrack module and has
been fixed.
Yi-Hung Wei [Fri, 26 Jun 2020 18:21:06 +0000 (11:21 -0700)]
bridge: Fix null dereference on ct_timeout_policy record
Accoridng to vswitch.ovsschema, each CT_Zone record may have
zero or one associcated CT_Timeout_policy. Thus, this patch
checks if ovsrec_ct_timeout_policy exist before accesses the
record.
VMWare-BZ: 2585825 Fixes: 45339539f69d ("ovs-vsctl: Add conntrack zone commands.") Fixes: 993cae678bca ("ofproto-dpif: Consume CT_Zone, and CT_Timeout_Policy tables") Reported-by: Yang Song <yangsong@vmware.com> Signed-off-by: Yi-Hung Wei <yihung.wei@gmail.com> Signed-off-by: William Tu <u9012063@gmail.com>
Matteo Croce [Mon, 25 May 2020 18:05:15 +0000 (20:05 +0200)]
ofproto-dpif.at: Add unit test for lb_output action.
Extend the balance-tcp one so it tests lb-output action too.
The test checks that that the option is shown in bond/show,
and that the lb_output action is programmed in the datapath.
userspace: Avoid dp_hash recirculation for balance-tcp bond mode.
Problem:
In OVS, flows with output over a bond interface of type “balance-tcp”
gets translated by the ofproto layer into "HASH" and "RECIRC" datapath
actions. After recirculation, the packet is forwarded to the bond
member port based on 8-bits of the datapath hash value computed through
dp_hash. This causes performance degradation in the following ways:
1. The recirculation of the packet implies another lookup of the
packet’s flow key in the exact match cache (EMC) and potentially
Megaflow classifier (DPCLS). This is the biggest cost factor.
2. The recirculated packets have a new “RSS” hash and compete with the
original packets for the scarce number of EMC slots. This implies more
EMC misses and potentially EMC thrashing causing costly DPCLS lookups.
3. The 256 extra megaflow entries per bond for dp_hash bond selection
put additional load on the revalidation threads.
Owing to this performance degradation, deployments stick to “balance-slb”
bond mode even though it does not do active-active load balancing for
VXLAN- and GRE-tunnelled traffic because all tunnel packet have the
same source MAC address.
Proposed optimization:
This proposal introduces a new load-balancing output action instead of
recirculation.
Maintain one table per-bond (could just be an array of uint16's) and
program it the same way internal flows are created today for each
possible hash value (256 entries) from ofproto layer. Use this table to
load-balance flows as part of output action processing.
Currently xlate_normal() -> output_normal() ->
bond_update_post_recirc_rules() -> bond_may_recirc() and
compose_output_action__() generate 'dp_hash(hash_l4(0))' and
'recirc(<RecircID>)' actions. In this case the RecircID identifies the
bond. For the recirculated packets the ofproto layer installs megaflow
entries that match on RecircID and masked dp_hash and send them to the
corresponding output port.
Instead, we will now generate action as
'lb_output(<bond id>)'
This combines hash computation (only if needed, else re-use RSS hash)
and inline load-balancing over the bond. This action is used *only* for
balance-tcp bonds in userspace datapath (the OVS kernel datapath
remains unchanged).
Example:
Current scheme:
With 8 UDP flows (with random UDP src port):
flow-dump from pmd on cpu core: 2
recirc_id(0),in_port(7),<...> actions:hash(hash_l4(0)),recirc(0x1)
Roi Dayan [Tue, 16 Jun 2020 13:03:57 +0000 (16:03 +0300)]
netdev-offload-tc: Revert tunnel src/dst port masks handling
The cited commit intended to add tc support for masking tunnel src/dst
ips and ports. It's not possible to do tunnel ports masking with
openflow rules and the default mask for tunnel ports set to 0 in
tnl_wc_init(), unlike tunnel ports default mask which is full mask.
So instead of never passing tunnel ports to tc, revert the changes
to tunnel ports to always pass the tunnel port.
In sw classification is done by the kernel, but for hw we must match
the tunnel dst port.
Fixes: 5f568d049130 ("netdev-offload-tc: Allow to match the IP and port mask of tunnel") Signed-off-by: Roi Dayan <roid@mellanox.com> Reviewed-by: Eli Britstein <elibr@mellanox.com> Signed-off-by: Simon Horman <simon.horman@netronome.com>
Dumitru Ceara [Fri, 10 Jan 2020 09:34:43 +0000 (10:34 +0100)]
ofproto-dpif-trace: Improve NAT tracing.
When ofproto/trace detects a recirc action it resumes execution at the
specified next table. However, if the ct action performs SNAT/DNAT,
e.g., ct(commit,nat(src=1.1.1.1:4000),table=42), the src/dst IPs and
ports in the oftrace_recirc_node->flow field are not updated. This leads
to misleading outputs from ofproto/trace as real packets would actually
first get NATed and might match different flows when recirculated.
Assume the first IP/port from the NAT src/dst action will be used by
conntrack for the translation and update the oftrace_recirc_node->flow
accordingly. This is not entirely correct as conntrack might choose a
different IP/port but the result is more realistic than before.
This fix covers new connections. However, for reply traffic that executes
actions of the form ct(nat, table=42) we still don't update the flow as
we don't have any information about conntrack state when tracing.
Also move the oftrace_recirc_node processing out of ofproto_trace()
and to its own function, ofproto_trace_recirc_node() for better
readability/
Signed-off-by: Dumitru Ceara <dceara@redhat.com> Signed-off-by: Ben Pfaff <blp@ovn.org>
Han Zhou [Fri, 15 May 2020 07:17:47 +0000 (00:17 -0700)]
odp-util.c: Fix dp_hash execution with slowpath actions.
When dp_hash is executed with slowpath actions, it results in endless
recirc loop in kernel datapath, and finally drops the packet, with
kernel logs:
openvswitch: ovs-system: deferred action limit reached, drop recirc action
The root cause is that the dp_hash value calculated by slowpath is not
passed to datapath when executing the recirc action, thus when the recirced
packet miss upcall comes to userspace again, it generates the dp_hash
and recirc action again, with same recirc_id, which in turn generates
a megaflow with recirc action with the recird_id same as the recirc_id in
its match condition, which causes a loop in datapath.
For example, this can be reproduced with below setup of OVN environment:
Assume there is a route from the VIF to R3: R0 -> R1 -> R3, and there are two
routes (ECMP) from R3 to the VIF:
R3 -> R1 -> R0
R3 -> R2 -> R0
Now if we ping from the VIF to R3, the OVS flow execution on the HV of the VIF
will hit the R3's datapath which has flows that responds to the ICMP packet
by setting ICMP fields, which requires slowpath actions, and in later flow
tables it will hit the "group" action that selects between the ECMP routes.
By default OVN uses "dp_hash" method for the "group" action.
For the first miss upcall packet, dp_hash value is empty, so the group action
will be translated to "dp_hash" and "recirc".
During action execution, because of the previous actions that sets ICMP fields,
the whole execution requires slowpath, so it tries to execute all actions in
userspace in odp_execute_actions(), including dp_hash action, except the
recirc action, which can only be executed in datapath. So the dp_hash value
is calculated in userspace, and then the packet is injected to datapath for
recirc action execution.
However, the dp_hash calculated by the userspace is not passed to datapath.
Because of this, the packet recirc in datapath doesn't have dp_hash value,
and the miss upcall for the recirced packet hits the same flow tables and
triggers same "dp_hash" and "recirc" action again, with exactly same recirc_id!
This time, the new upcall doesn't require any slowpath execution, so both
the dp_hash and recirc actions are executed in datapath, after creating a
datapath megaflow like:
Numan Siddique [Fri, 5 Jun 2020 08:30:29 +0000 (14:00 +0530)]
ovsdb idl: Try committing the pending txn in ovsdb_idl_loop_run.
The function ovsdb_idl_loop_run(), after calling ovsdb_idl_run(),
returns a transaction object (of type 'struct ovsdb_idl_txn').
The returned transaction object can be NULL if there is a pending
transaction (loop->committing_txn) in the idl loop object.
Normally the clients of idl library, first call ovsdb_idl_loop_run(),
then do their own processing and create any idl transactions during
this processing and then finally call ovsdb_idl_loop_commit_and_wait().
If ovsdb_idl_loop_run() returns NULL transaction object, then much
of the processing done by the client gets wasted as in the case
of ovn-controller.
The client (in this case ovn-controller), can skip the processing
and instead call ovsdb_idl_loop_commit_and_wait() if the transaction
oject is NULL. But ovn-controller uses IDL tracking and it may
loose the tracked changes in that run.
This patch tries to improve this scenario, by checking if the
pending transaction can be committed in the ovsdb_idl_loop_run()
itself and if the pending transaction is cleared (because of the
response messages from ovsdb-server due to a transaction message
in the previous run), ovsdb_idl_loop_run() can return a valid
transaction object.
CC: Han Zhou <hzhou@ovn.org> Signed-off-by: Numan Siddique <numans@ovn.org> Signed-off-by: Ben Pfaff <blp@ovn.org>
Rui Cao [Mon, 15 Jun 2020 06:05:13 +0000 (14:05 +0800)]
dpif-netlink: Fix Windows incompatibility when setting new feature
OVS_DP_ATTR_NAME field is required when sending OVS_DP_CMD_SET to
windows kernel driver. The function "dpif_netlink_set_features"
dose not set the OVS_DP_ATTR_NAME field which will cause set feature
failure and ovs-vswitchd will exist.
This patch fixes the issue by setting "request.name" in request.
Reported-at: https://github.com/openvswitch/ovs-issues/issues/187
Submitted-at: https://github.com/openvswitch/ovs/pull/319 Signed-off-by: Rui Cao <rcao@vmware.com> Signed-off-by: Ben Pfaff <blp@ovn.org>
Dumitru Ceara [Thu, 28 May 2020 12:32:31 +0000 (14:32 +0200)]
ovsdb-idl: Avoid inconsistent IDL state with OVSDB_MONITOR_V3.
Assuming an ovsdb client connected to a database using OVSDB_MONITOR_V3
(i.e., "monitor_cond_since" method) with the initial monitor condition
MC1.
Assuming the following two transactions are executed on the
ovsdb-server:
TXN1: "insert record R1 in table T1"
TXN2: "insert record R2 in table T2"
If the client's monitor condition MC1 for table T2 matches R2 then the
client will receive the following update3 message:
method="update3", "insert record R2 in table T2", last-txn-id=TXN2
At this point, if the presence of the new record R2 in the IDL triggers
the client to update its monitor condition to MC2 and add a clause for
table T1 which matches R1, a monitor_cond_change message is sent to the
server:
method="monitor_cond_change", "clauses from MC2"
In normal operation the ovsdb-server will reply with a new update3
message of the form:
method="update3", "insert record R1 in table T1", last-txn-id=TXN2
However, if the connection drops in the meantime, this last update might
get lost.
It might happen that during the reconnect a new transaction happens
that modifies the original record R1:
TXN3: "modify record R1 in table T1"
When the client reconnects, it will try to perform a fast resync by
sending:
method="monitor_cond_since", "clauses from MC2", last-txn-id=TXN2
Because TXN2 is still in the ovsdb-server transaction history, the
server replies with the changes from the most recent transactions only,
i.e., TXN3:
result="true", last-txbb-id=TXN3, "modify record R1 in table T1"
This causes the IDL on the client in to end up in an inconsistent
state because it has never seen the update that created R1.
Such a scenario is described in:
https://bugzilla.redhat.com/show_bug.cgi?id=1808580#c22
To avoid this issue, the IDL will now maintain (up to) 3 different
types of conditions for each DB table:
- new_cond: condition that has been set by the IDL client but has
not yet been sent to the server through monitor_cond_change.
- req_cond: condition that has been sent to the server but the reply
acknowledging the change hasn't been received yet.
- ack_cond: condition that has been acknowledged by the server.
Whenever the IDL FSM is restarted (e.g., voluntary or involuntary
disconnect):
- if there is a known last_id txn-id the code ensures that new_cond
will contain the most recent condition set by the IDL client
(either req_cond if there was a request in flight, or new_cond
if the IDL client set a condition while the IDL was disconnected)
- if there is no known last_id txn-id the code ensures that ack_cond will
contain the most recent conditions set by the IDL client regardless
whether they were acked by the server or not.
When monitor_cond_since/monitor_cond requests are sent they will
always include ack_cond and if new_cond is not NULL a follow up
monitor_cond_change will be generated afterwards.
On the other hand ovsdb_idl_db_set_condition() will always modify new_cond.
This ensures that updates of type "insert" that happened before the last
transaction known by the IDL but didn't match old monitor conditions are
sent upon reconnect if the monitor condition has changed to include them
in the meantime.
Fixes: 403a6a0cb003 ("ovsdb-idl: Fast resync from server when connection reset.") Signed-off-by: Dumitru Ceara <dceara@redhat.com> Acked-by: Han Zhou <hzhou@ovn.org> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Dumitru Ceara [Thu, 28 May 2020 12:32:17 +0000 (14:32 +0200)]
ovsdb-server.7: Mention update3 as replies to monitor_cond_change.
Monitor_cond_change might trigger updates to be sent to clients as results
to condition changes. These updates can be either update2 (for monitor_cond
monitors) or update3 (for monitor_cond_since monitors). The documentation
used to mention only update2.
Signed-off-by: Dumitru Ceara <dceara@redhat.com> Acked-by: Han Zhou <hzhou@ovn.org> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Ilya Maximets [Wed, 3 Jun 2020 14:58:16 +0000 (16:58 +0200)]
ovs-rcu: Avoid flushing callbacks during postponing.
ovsrcu_flush_cbset() call during ovsrcu_postpone() could cause
use after free in case the caller sets new pointer only after
postponing free for the old one:
Vlad Buslov [Thu, 4 Jun 2020 10:47:01 +0000 (13:47 +0300)]
tc: Support new terse dump kernel API
When dumping flows in terse mode set TCA_DUMP_FLAGS attribute to
TCA_DUMP_FLAGS_TERSE flag to prevent unnecessary copying of data between
kernel and user spaces. Only expect kernel to provide cookie, stats and
flags when dumping filters in terse mode.
Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Reviewed-by: Roi Dayan <roid@mellanox.com> Signed-off-by: Simon Horman <simon.horman@netronome.com>
Vlad Buslov [Thu, 4 Jun 2020 10:47:00 +0000 (13:47 +0300)]
netdev-offload: Implement terse dump support
In order to improve revalidator performance by minimizing unnecessary
copying of data, extend netdev-offloads to support terse dump mode. Extend
netdev_flow_api->flow_dump_create() with 'terse' bool argument. Implement
support for terse dump in functions that convert netlink to flower and
flower to match. Set flow stats "used" value based on difference in number
of flow packets because lastuse timestamp is not included in TC terse dump.
Kernel API support is implemented in following patch.
Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Reviewed-by: Roi Dayan <roid@mellanox.com> Signed-off-by: Simon Horman <simon.horman@netronome.com>
Tonghao Zhang [Tue, 2 Jun 2020 13:50:25 +0000 (21:50 +0800)]
netdev-offload-tc: Expand tunnel source IPs masked match
To support more use case, for example, DDOS, which
packets should be dropped in hardware, this patch
allows users to match only the tunnel source IPs with
masked value.
Tonghao Zhang [Tue, 2 Jun 2020 13:50:24 +0000 (21:50 +0800)]
netdev-offload-tc: Allow to match the IP and port mask of tunnel
This patch allows users to offload the TC flower rules with
tunnel mask. This patch allows masked match of the following,
where previously supported an exact match was supported:
* Remote (dst) tunnel endpoint address
* Local (src) tunnel endpoint address
* Remote (dst) tunnel endpoint UDP port
And also allows masked match of the following, where previously
no match was supported:
* Local (src) tunnel endpoint UDP port
In some case, mask is useful as wildcards. For example, DDOS,
in that case, we don’t want to allow specified hosts IPs or
only source Ports to access the targeted host. For example:
Tonghao Zhang [Tue, 2 Jun 2020 13:50:22 +0000 (21:50 +0800)]
dpif-netlink: Generate ufids for installing TC flowers
To support installing the TC flowers to HW, via "ovs-appctl dpctl/add-flow"
command, there should be an ufid. This patch will check whether ufid exists,
if not, generate an ufid. Should to know that when processing upcall packets,
ufid is generated in parse_odp_packet for kernel datapath.
Configuring the max-idle/max-revalidator, may help testing this patch.
Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Acked-by: Roi Dayan <roid@mellanox.com> Signed-off-by: Simon Horman <simon.horman@netronome.com>
Ilya Maximets [Mon, 25 May 2020 16:21:39 +0000 (18:21 +0200)]
ovsdb: Fix timeout type for wait operation.
According to RFC 7047, 'timeout' is an integer field:
5.2.6. Wait
The "wait" object contains the following members:
"op": "wait" required
"timeout": <integer> optional
...
For some reason initial implementation treated it as a real number.
This causes a build issue with clang that complains that LLONG_MAX
could not be represented as double:
ovsdb/execution.c:733:32: error: implicit conversion from 'long long'
to 'double' changes value from 9223372036854775807 to 9223372036854775808
timeout_msec = MIN(LLONG_MAX, json_real(timeout));
~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/usr/include/sys/limits.h:69:19: note: expanded from macro 'LLONG_MAX'
#define LLONG_MAX __LLONG_MAX /* max for a long long */
^~~~~~~~~~~
/usr/include/x86/_limits.h:74:21: note: expanded from macro '__LLONG_MAX'
#define __LLONG_MAX 0x7fffffffffffffffLL /* max value for a long long */
^~~~~~~~~~~~~~~~~~~~
./lib/util.h:90:21: note: expanded from macro 'MIN'
#define MIN(X, Y) ((X) < (Y) ? (X) : (Y))
^ ~
Fix that by changing parser to treat 'timeout' as integer.
Fixes clang build on FreeBSD 12.1 in CirrusCI.
Fixes: f85f8ebbfac9 ("Initial implementation of OVSDB.") Acked-by: Han Zhou <hzhou@ovn.org> Acked-by: Numan Siddique <numans@ovn.org> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Mark Michelson [Fri, 1 May 2020 19:13:08 +0000 (15:13 -0400)]
ovsdb-idl: Add function to reset min_index.
If an administrator removes all of the databases in a cluster from
disk, then ovsdb IDL clients will have a problem. The databases will all
reset their stored indexes to 0, so The IDL client's min_index will be
higher than the indexes of all databases in the cluster. This results in
the client constantly connecting to databases, detecting the data as
"stale", and then attempting to connect to another.
This function provides a way to reset the IDL to an initial state with
min_index of 0. This way, the client will not wrongly detect the
database data as stale and will recover properly.
Notice that this function is not actually used anywhere in this patch.
This will be used by OVN, though, since OVN is the primary user of
clustered OVSDB.
Signed-off-by: Mark Michelson <mmichels@redhat.com> Acked-by: Han Zhou <hzhou@ovn.org> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
openvswitch: add missing attribute validation for hash
Add missing attribute validation for OVS_PACKET_ATTR_HASH
to the netlink policy.
Fixes: bd1903b7c459 ("net: openvswitch: add hash info to upcall") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Greg Rose <gvrose8192@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
commit bd1903b7c4596ba6f7677d0dfefd05ba5876707d
Author: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Date: Wed Nov 13 23:04:49 2019 +0800
net: openvswitch: add hash info to upcall
When using the kernel datapath, the upcall don't
include skb hash info relatived. That will introduce
some problem, because the hash of skb is important
in kernel stack. For example, VXLAN module uses
it to select UDP src port. The tx queue selection
may also use the hash in stack.
Hash is computed in different ways. Hash is random
for a TCP socket, and hash may be computed in hardware,
or software stack. Recalculation hash is not easy.
Hash of TCP socket is computed:
tcp_v4_connect
-> sk_set_txhash (is random)
__tcp_transmit_skb
-> skb_set_hash_from_sk
There will be one upcall, without information of skb
hash, to ovs-vswitchd, for the first packet of a TCP
session. The rest packets will be processed in Open vSwitch
modules, hash kept. If this tcp session is forward to
VXLAN module, then the UDP src port of first tcp packet
is different from rest packets.
TCP packets may come from the host or dockers, to Open vSwitch.
To fix it, we store the hash info to upcall, and restore hash
when packets sent back.
+---------------+ +-------------------------+
| Docker/VMs | | ovs-vswitchd |
+----+----------+ +-+--------------------+--+
| ^ |
| | |
| | upcall v restore packet hash
(not recalculate)
| +-+--------------------+--+
| tap netdev | | vxlan module
+---------------> +--> Open vSwitch ko +-->
or internal type | |
+-------------------------+
Reported-at: https://mail.openvswitch.org/pipermail/ovs-dev/2019-October/364062.html Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Acked-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net> Tested-by: Aliasgar Ginwala <aginwala@ebay.com> Acked-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Signed-off-by: Han Zhou <hzhou@ovn.org> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Eiichi Tsukata [Wed, 27 May 2020 02:13:34 +0000 (11:13 +0900)]
classifier: Prevent tries vs n_tries race leading to NULL dereference.
Currently classifier tries and n_tries can be updated not atomically,
there is a race condition which can lead to NULL dereference.
The race can happen when main thread updates a classifier tries and
n_tries in classifier_set_prefix_fields() and at the same time revalidator
or handler thread try to lookup them in classifier_lookup__(). Such race
can be triggered when user changes prefixes of flow_table.
[main thread] [revalidator/handler thread]
===========================================================
/* cls->n_tries == 2 */
for (int i = 0; i < cls->n_tries; i++) {
trie_init(cls, i, NULL);
/* n_tries == 0 */
cls->n_tries = n_tries;
/* cls->tries[i]->feild is NULL */
trie_ctx_init(&trie_ctx[i],&cls->tries[i]);
/* trie->field is NULL */
ctx->be32ofs = trie->field->flow_be32ofs;
To prevent the race, instead of re-introducing internal mutex
implemented in the commit fccd7c092e09 ("classifier: Remove internal
mutex."), this patch makes trie field RCU protected and checks it after
read.
Ilya Maximets [Fri, 22 May 2020 20:36:27 +0000 (22:36 +0200)]
raft: Avoid sending equal snapshots.
Snapshots are huge. In some cases we could receive several outdated
append replies from the remote server. This could happen in high
scale cases if the remote server is overloaded and not able to process
all the raft requests in time. As an action to each outdated append
reply we're sending full database snapshot. While remote server is
already overloaded those snapshots will stuck in jsonrpc backlog for
a long time making it grow up to few GB. Since remote server wasn't
able to timely process incoming messages it will likely not able to
process snapshots leading to the same situation with low chances to
recover. Remote server will likely stuck in 'candidate' state, other
servers will grow their memory consumption due to growing jsonrpc
backlogs:
jsonrpc|INFO|excessive sending backlog, jsonrpc: ssl:192.16.0.3:6644,
num of msgs: 3795, backlog: 8838994624.
This patch is trying to avoid that situation by avoiding sending of
equal snapshot install requests. This helps maintain reasonable memory
consumption and allows the cluster to recover on a larger scale.
Acked-by: Han Zhou <hzhou@ovn.org> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Ilya Maximets [Thu, 14 May 2020 20:10:45 +0000 (22:10 +0200)]
ovsdb-server: Fix schema leak while reading db.
parse_txn() function doesn't always take ownership of the 'schema'
passed. So, if the schema of the clustered db has same version as the
one that already in use, parse_txn() will not use it, resulting with a
memory leak:
7,827 (56 direct, 7,771 indirect) bytes in 1 blocks are definitely lost
at 0x483BB1A: calloc (vg_replace_malloc.c:762)
by 0x44AD02: xcalloc (util.c:121)
by 0x40E70E: ovsdb_schema_create (ovsdb.c:41)
by 0x40EA6D: ovsdb_schema_from_json (ovsdb.c:217)
by 0x415EDD: ovsdb_storage_read (storage.c:280)
by 0x408968: read_db (ovsdb-server.c:607)
by 0x40733D: main_loop (ovsdb-server.c:227)
by 0x40733D: main (ovsdb-server.c:469)
While we could put ovsdb_schema_destroy() in a few places inside
'parse_txn()', from the users' point of view it seems better to have a
constant argument and just clone the 'schema' if needed. The caller
will be responsible for destroying the 'schema' it owns.
Fixes: 1b1d2e6daa56 ("ovsdb: Introduce experimental support for clustered databases.") Acked-by: Han Zhou <hzhou@ovn.org> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Ilya Maximets [Fri, 22 May 2020 16:31:19 +0000 (18:31 +0200)]
ovsdb: Add raft memory usage to memory report.
Memory reports could be found in logs or by calling 'memory/show'
appctl command. For ovsdb-server it includes information about db
cells, monitor connections with their backlog size, etc. But it
doesn't contain any information about memory consumed by raft.
Backlogs of raft connections could be insanely large because of
snapshot installation requests that simply contains the whole database.
In not that healthy clusters where one of ovsdb servers is not able to
timely handle all the incoming raft traffic, backlog on a sender's side
could cause significant memory consumption issues.
Adding new 'raft-connections' and 'raft-backlog' counters to the
memory report to better track such conditions.
Acked-by: Han Zhou <hzhou@ovn.org> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
And with the following calls:
ovsdb_idl_txn_write_clone
ovsdb_idl_txn_write__
6. deref_parm_in_call: Function ovsdb_datum_destroy dereferences
datum->values
ovsdb_datum_destroy
Reviewed-by: Yifeng Sun <pkusunyifeng@gmail.com> Signed-off-by: William Tu <u9012063@gmail.com>
William Tu [Thu, 14 May 2020 14:02:43 +0000 (07:02 -0700)]
ovs-bugtool: Add -m option to dump-flows.
This patch adds 'ovs-appctl dpctl/dump-flows -m' to bugtool,
the output will include wildcarded fields and the miniflow bits,
such as 'dp-extra-info:miniflow_bits(4,1)'.
Cc: Emma Finn <emma.finn@intel.com> Acked-by: Greg Rose <gvrose8192@gmail.com> Signed-off-by: William Tu <u9012063@gmail.com>
Greg Rose [Tue, 19 May 2020 22:01:46 +0000 (15:01 -0700)]
Documentation: Fix kernel support matrix
The documentation matrix for OVS branches and which kernels they support
is out of date. Update it to show that since 2.10 the lowest kernel
that we test and support is Linux 3.16.
RHEL and CentOS kernels based upon the original 3.10 kernel are still
supported.
Reported-by: Han Zhou <hzhou@ovn.org>
Reported-at: https://mail.openvswitch.org/pipermail/ovs-dev/2020-May/370742.html Acked-by: Han Zhou <hzhou@ovn.org> Signed-off-by: Greg Rose <gvrose8192@gmail.com> Signed-off-by: William Tu <u9012063@gmail.com>
Aaron Conole [Fri, 15 May 2020 20:36:18 +0000 (16:36 -0400)]
netdev-linux: Update LAG in all cases.
In some cases, when processing a netlink change event, it's possible for
an alternate part of OvS (like the IPv6 endpoint processing) to hold an
active netdev interface. This creates a race-condition, where sometimes
the OvS change processing will take the normal path. This doesn't work
because the netdev device object won't actually be enslaved to the
ovs-system (for instance, a linux bond) and ingress qdisc entries will
be missing.
To address this, we update the LAG information in ALL cases where
LAG information could come in.
Fixes: d22f8927c3c9 ("netdev-linux: monitor and offload LAG slaves to TC") Cc: Marcelo Leitner <mleitner@redhat.com> Cc: John Hurley <john.hurley@netronome.com> Acked-by: Roi Dayan <roid@mellanox.com> Signed-off-by: Aaron Conole <aconole@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Ansis Atteka [Fri, 15 May 2020 19:08:13 +0000 (12:08 -0700)]
debian: Add python3-sphinx to ovs build dependencies
python3-sphinx has become mandatory build dependency since patch 39b5e46 ("Documentation: Convert multiple manpages to ReST."), because,
otherwise, without this dependency installed, packaging of OVS debian
packages fails with an error that generated man pages can't be found.
Fixes: 39b5e46312 ("Documentation: Convert multiple manpages to ReST.") CC: Ben Pfaff <blp@ovn.org> Signed-off-by: Ansis Atteka <aatteka@ovn.org> Reported-by: Artem Teleshev <artem.teleshev@gmail.com> Acked-by: Greg Rose <gvrose8192@gmail.com>
Roi Dayan [Thu, 14 May 2020 13:25:10 +0000 (16:25 +0300)]
debian: Fix package dependencies
In python2 package was python-twisted-conch but it looks like
for python3 it's just python3-twisted.
For zope interface the python3 package name is python3-zope.interface.
Fixes: 1ca0323e7c29 ("Require Python 3 and remove support for Python 2.") Signed-off-by: Roi Dayan <roid@mellanox.com> Acked-by: Ansis Atteka <aatteka@ovn.org>
From man ovs-fields(7), the conntrack origin tuple fields
ct_nw_src/dst, ct_ipv6_src/dst, and ct_tp_src/dst are supposed
to be bitwise maskable, but they are not. This patch enables
those fields to be maskable, and adds a regression test.
Fixes: daf4d3c18da4 ("odp: Support conntrack orig tuple key.") Reported-by: Wenying Dong <wenyingd@vmware.com> Signed-off-by: Yi-Hung Wei <yihung.wei@gmail.com> Signed-off-by: William Tu <u9012063@gmail.com>
William Tu [Tue, 24 Mar 2020 22:10:50 +0000 (15:10 -0700)]
userspace: Enable TSO support for non-DPDK.
This patch enables TSO support for non-DPDK use cases, and
also add check-system-tso testsuite. Before TSO, we have to
disable checksum offload, allowing the kernel to calculate the
TCP/UDP packet checsum. With TSO, we can skip the checksum
validation by enabling checksum offload, and with large packet
size, we see better performance.
Consider container to container use cases:
iperf3 -c (ns0) -> veth peer -> OVS -> veth peer -> iperf3 -s (ns1)
And I got around 6Gbps, similar to TSO with DPDK-enabled.
Acked-by: Flavio Leitner <fbl@sysclose.org> Acked-by: Ilya Maximets <i.maximets@ovn.org> Signed-off-by: William Tu <u9012063@gmail.com>