Jarno Rajahalme [Thu, 18 Aug 2016 21:44:49 +0000 (14:44 -0700)]
NEWS: NAT and bundles fixes for 2.6 release.
Mention Conntrack NAT support for Linux datapath. Resolve conflicting
NEWS about bundles support. Move Linux datapath truncate action
support NEWS to be under the "Linux" heading.
Ilya Maximets [Thu, 18 Aug 2016 13:13:58 +0000 (16:13 +0300)]
netdev-dpdk: Fix vHost stats.
This patch introduces function 'netdev_dpdk_filter_packet_len()' which is
intended to find and remove all packets with 'pkt_len > max_packet_len'
from the Tx batch.
It fixes inaccurate counting of 'tx_bytes' in vHost case if there was
dropped packets and allows to simplify send function.
Fixes: 0072e931b207 ("netdev-dpdk: add support for jumbo frames") Signed-off-by: Ilya Maximets <i.maximets@samsung.com> Acked-by: Ian Stokes <ian.stokes@intel.com> Signed-off-by: Daniele Di Proietto <diproiettod@vmware.com>
Ben Pfaff [Wed, 17 Aug 2016 21:49:38 +0000 (14:49 -0700)]
ovn-northd: Remove unnecessary 'if' test from build_acls().
This 'if' statement checked for two conditions, but neither one was
necessary. First, od->nbs is always nonnull, because the caller already
checked. Second, it doesn't matter whether od->nbs->n_ports is nonzero
because it doesn't affect the behavior of the code protected by the 'if'
statement.
This change is best viewed ignoring white space only changes.
Signed-off-by: Ben Pfaff <blp@ovn.org> Acked-by: Numan Siddique <nusiddiq@redhat.com>
Jesse Gross [Tue, 16 Aug 2016 20:21:17 +0000 (13:21 -0700)]
ovn: Set critical bit in Geneve option.
Currently the Geneve option type that OVN uses is 0, which in
Geneve marks this as non-critical. Non-critical means that if a
receiver does not recognize this option, it is free to ignore it
and continue processing the packet.
OVN uses its option to transmit things like input and output port
which are used to enforce security policies and direct packets to
their correct location. If the recipicient of a packet ignored this
information then it would likely be a security hole. This would seem
to qualify the option as critical.
There's no issue in an instance of OVN as currently written - the
receiver will always match on the option data. However, if a
theoretical future version that did not use this option was connected
or a third-party component was introduced then it's possible that this
might be accidentally ignored.
This patch changes the option type used by OVN to include the
critical bit to properly mark the intention. Obviously, this will
cause interoperability issues with any existing deployments but
it should be fine while OVN is still labeled as experimental.
Signed-off-by: Jesse Gross <jesse@kernel.org> Acked-by: Russell Bryant <russell@ovn.org>
When doing a restart, the routing table will open ports as system, which
prevents internal ports to be opened with the right type. That causes failures
in creating the ports.
We should revisit this patch after finding a proper fix on the routing table
layer.
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@redhat.com> Acked-by: Daniele Di Proietto <diproiettod@vmware.com>
In cases where a DNAT IP is moved to a new router or the SNAT IP is reused
with a new mac address, the NAT IPs become unreachable because the external
switches/routers have stale ARP entries. This commit
aims to fix the problem by sending GARPs for NAT IPs via locanet. There are
two parts to this patch.
[1] Adding the datapath of the l3 gateway port to local datapaths in
ovn-controller. This will result in creation of patch ports between
br-int and the physical bridge (that provides connectivity to local network
via localnet port) and will enable gateway router to have external
connectivity
[2] A new options key "nat-addresses" is added to the logical switch port of
type router, the logical switch that has this port is the one that provides
connectivity to local network via localnet port. The value for the key
"nat-addresses" is the MAC address of the port followed by a list of
SNAT & DNAT IPs. When ovn-controller sees a new IP in nat-addrress option,
it sends a GARP message for the IP via the localnet port. nat-addresses
option is added to the logical switch port of type router and not to the
logical router port, because the logical switch datapath has the localnet
port. Adding nat-addresses option to the router port will involve more
changes to get to the local net port.
ovn: Fix ARP request flow for unknown IP in lrouter
TPA in arp requests generated for unknown MAC-to-IP bindings is currently set
to DST_IP of the original packet. These arps will not be resolved when the
DST_IP is rechable via the default gateway. This patch fixes the issue by
setting the TPA to reg0. In routing stage reg0 is set to IP of the default
gateway when the packet has to go through the default gateway, otherwise reg0
is set to the DST_IP of the original packet.
Ben Pfaff [Sun, 14 Aug 2016 22:22:29 +0000 (15:22 -0700)]
ovn-trace: New utility.
This new utility is intended to fulfill for OVN the purpose that
"ofproto/trace" has for Open vSwitch. First, it's meant to be a useful
tool for troubleshooting and diagnosis and in general for improving one's
understanding of the emergent properties of a flow table. Second, it
simplifies and increases the practical scope of testing, as well as making
testing more reliable and repeatable and failures easier to interpret.
This commit adds only a single test that uses the new utility, based on the
oldest OVN end-to-end test "ovn -- 3 HVs, 1 LS, 3 lports/HV". The
differences between the old and the new test illustrate properties of
tracing. First, the new test does not start any ovn-controller processes
or simulate any hypervisors in a nontrivial way. This is because ovn-trace
does not actually forward packets or rely on the physical structure of the
system. Second, whereas the old test tested not just the logical but also
the physical structure of the system, it needed to have several logical
ports, a total of 9 (3 on each of 3 HVs), whereas since this test only
tests the logical network implementation it can use a smaller number. This
property also means that the new test runs signicantly faster than the old
one (less than a second on my laptop).
In my opinion this approach points the way toward the future of OVN
testing. Certainly, we need end-to-end tests. However, I believe that the
bulk of our tests can be broken into ones that test the logical network
implementation (using tracing) and ones that test physical/logical
translation.
Signed-off-by: Ben Pfaff <blp@ovn.org> Acked-by: Ryan Moats <rmoats@us.ibm.com>
Ben Pfaff [Sat, 6 Aug 2016 06:19:49 +0000 (23:19 -0700)]
expr: New function expr_parse_microflow().
This allows "ovstest test-ovn evaluate-expr" to work with arbitrary
microflows rather than just a few restricted variables, but the main point
is to enable the upcoming "ovn-trace" utility to accept arbitrary
microflows in a format that seems reasonable for OVN.
Signed-off-by: Ben Pfaff <blp@ovn.org> Acked-by: Ryan Moats <rmoats@us.ibm.com> Acked-by: Justin Pettit <jpettit@ovn.org>
Ben Pfaff [Wed, 3 Aug 2016 22:35:07 +0000 (15:35 -0700)]
expr: New function expr_evaluate().
An upcoming commit will need to evaluate individual expressions outside the
context of a classifier. test-ovn already had a function to do this but it
wasn't general-purpose, so this commit makes a general-purpose version and
adopts it for use in test-ovn as well.
Signed-off-by: Ben Pfaff <blp@ovn.org> Acked-by: Ryan Moats <rmoats@us.ibm.com> Acked-by: Justin Pettit <jpettit@ovn.org>
Ben Pfaff [Tue, 2 Aug 2016 05:52:01 +0000 (22:52 -0700)]
ovn-northd: Copy name in logical datapath southbound representations.
This makes it easier to debug based on the southbound database without
looking at the northbound representation. This commit adds the name
to "ovn-sbctl dump-flows" output and it will be even more useful in
an upcoming commit.
Signed-off-by: Ben Pfaff <blp@ovn.org> Acked-by: Ryan Moats <rmoats@us.ibm.com> Acked-by: Justin Pettit <jpettit@ovn.org>
Ben Pfaff [Tue, 2 Aug 2016 05:50:10 +0000 (22:50 -0700)]
meta-flow: New functions mf_subfield_copy() and mf_subfield_swap().
The function nxm_execute_reg_move() was almost a general-purpose function
for manipulating subfields, except for its awkward interface that took a
struct ofpact_reg_move instead of a plain source and destination. This
commit introduces a general-purpose function in meta-flow that corrects
this flaw, and updates the callers. An upcoming commit will introduce a
new user of the function.
This commit also introduces a related function mf_subfield_swap() to swap
the contents of subfields. An upcoming commit will introduce the first
user.
Signed-off-by: Ben Pfaff <blp@ovn.org> Acked-by: Ryan Moats <rmoats@us.ibm.com> Acked-by: Justin Pettit <jpettit@ovn.org>
Ciara Loftus [Mon, 15 Aug 2016 15:11:26 +0000 (16:11 +0100)]
netdev-dpdk: vHost client mode and reconnect
Until now, vHost ports in OVS have only been able to operate in 'server'
mode whereby OVS creates and manages the vHost socket and essentially
acts as the vHost 'server'. With this commit a new mode, 'client' mode,
is available. In this mode, OVS acts as the vHost 'client' and connects
to the socket created and managed by QEMU which now acts as the vHost
'server'. This mode allows for reconnect capability, which allows a
vHost port to resume normal connectivity in event of switch reset.
By default dpdkvhostuser ports still operate in 'server' mode. That is
unless a valid 'vhost-server-path' is specified for a device like so:
ovs-vsctl set Interface dpdkvhostuser0
options:vhost-server-path=/path/to/socket
'vhost-server-path' represents the full path of the vhost user socket
that has been or will be created by QEMU. Once specified, the port stays
in 'client' mode for the remainder of its lifetime.
QEMU v2.7.0+ is required when using OVS in vHost client mode and QEMU in
vHost server mode.
Signed-off-by: Ciara Loftus <ciara.loftus@intel.com> Signed-off-by: Daniele Di Proietto <diproiettod@vmware.com>
Ciara Loftus [Mon, 15 Aug 2016 15:11:25 +0000 (16:11 +0100)]
netdev-dpdk: Consistent naming for vhost
A mix of vhost_user_ and vhost_ is used when naming vhost functions. The
'user_' has been dropped for consistency. Also remove empty init
functions for netdev dpdk classes.
Suggested-by: Daniele Di Proietto <diproiettod@vmware.com> Acked-by: Daniele Di Proietto <diproiettod at vmware.com> Acked-by: Flavio Leitner <fbl@sysclose.org> Signed-off-by: Ciara Loftus <ciara.loftus@intel.com>
Ciara Loftus [Mon, 15 Aug 2016 15:11:24 +0000 (16:11 +0100)]
netdev-dpdk: Remove dpdkvhostcuse ports
This commit removes the 'dpdkvhostcuse' port type from the userspace
datapath. vhost-cuse ports are quickly becoming obsolete as the
vhost-user port type begins to support a greater feature-set thanks to
the addition of things like vhost-user multiqueue and potential
upcoming features like vhost-user client-mode and vhost-user reconnect.
The feature is also expected to be removed from DPDK soon.
One potential drawback of the removal of this support is that a
userspace vHost port type is not available in OVS for use with older
versions of QEMU (pre v2.2). Considering v2.2 is nearly two years old
this should however be a low impact change.
Ryan Moats [Mon, 15 Aug 2016 18:47:29 +0000 (18:47 +0000)]
Add read-only option to ovs-dpctl and ovs-ofctl commands.
ovs-dpctl and ovs-ofctl lack a read-only option to prevent
running of commands that perform read-write operations. Add
it and the necessary scaffolding to each.
Signed-off-by: Ryan Moats <rmoats@us.ibm.com> Signed-off-by: Ben Pfaff <blp@ovn.org>
Ben Pfaff [Sun, 14 Aug 2016 04:52:27 +0000 (21:52 -0700)]
ovsdb-idl: Style and comment improvements for conditional replication.
The conditional replication code had hardly any comments. This adds some.
This commit also fixes a number of style problems, factors out some code
into a helper function, and moves some struct declarations from a public
header, that were not used by client code, into more private locations.
Ben Pfaff [Tue, 16 Aug 2016 00:00:09 +0000 (17:00 -0700)]
lex: Integrate error handling into struct lexer.
The actions and expr modules had each developed their own error handling
code that were very similar. Upcoming code needs similar error handling,
so rather than duplicating it again, integrate it into the lexer itself.
Signed-off-by: Ben Pfaff <blp@ovn.org> Acked-by: Ryan Moats <rmoats@us.ibm.com> Acked-by: Justin Pettit <jpettit@ovn.org>
Jarno Rajahalme [Mon, 15 Aug 2016 21:57:12 +0000 (14:57 -0700)]
ofproto: Reduce bundle memory use.
Instead of storing the (big) struct ofputil_flow_mod, create the new
rule and/or create the rule criteria for matching at bundle message
insert time. This change reduces the size of a bundle flow mod from
3.5kb to 272 bytes, not counting the created rule, which was anyway
created during bundle commit.
In successful bundles this shifts work out of the ofproto_mutex
critical section and should thus reduce the time the mutex is held
during bundle commit.
Signed-off-by: Jarno Rajahalme <jarno@ovn.org> Acked-by: Ben Pfaff <blp@ovn.org>
Andy Zhou [Tue, 9 Aug 2016 22:35:46 +0000 (15:35 -0700)]
sandbox: launch SB backup server when running in OVN mode
Automatically launch backup server for OVN SB database that replicates
all transactions of the active server. This can be handy for
experimenting with the newly added replication feature.
Signed-off-by: Andy Zhou <azhou@ovn.org> Acked-by: Ben Pfaff <blp@ovn.org> Acked-by: Russell Bryant <russell@ovn.org>
Ciara Loftus [Mon, 15 Aug 2016 09:36:19 +0000 (10:36 +0100)]
netdev-dpdk: Do not attempt to initialise flow control for 'dpdkr' ports
Only 'dpdk' ports support flow control. This patch stops 'dpdkr' ports
from attempting to initialise this feature as this port type does not
support it.
Fixes: 9fd39370c12c ("netdev-dpdk: Add Flow Control support.") Signed-off-by: Ciara Loftus <ciara.loftus@intel.com> Acked-by: Daniele Di Proietto <diproiettod@vmware.com>
openvswitch: do not ignore netdev errors when creating tunnel vports
The creation of a tunnel vport (geneve, gre, vxlan) brings up a
corresponding netdev, a multi-step operation which can fail.
For example, changing a vxlan vport's netdev state to 'up' binds the
vport's socket to a UDP port - if the binding fails (e.g. due to the
port being in use), the error is currently ignored giving the
appearance that the tunnel vport creation completed successfully.
Signed-off-by: Martynas Pumputis <martynas@weave.works> Acked-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Pravin B Shelar <pshelar@ovn.org> Acked-by: Jesse Gross <jesse@kernel.org>
Ben Pfaff [Mon, 15 Aug 2016 18:34:02 +0000 (11:34 -0700)]
ovn: Rewrite logical action parsing and encoding library.
Until now, parsing logical actions and encoding them into OpenFlow has
happened in a single step. An upcoming commit will want to examine
actions after parsing without encoding them into OpenFlow. This commit
refactors OVN logical actions to make this possible.
The new form of the OVN action handling is closely modeled on ofp-actions
in the OVS core library. Notable differences are that OVN actions are
always fixed-length and that individual OVN actions can have destructors
(and thus can contain pointers to data that need to be freed when the
actions are destroyed).
Signed-off-by: Ben Pfaff <blp@ovn.org> Acked-by: Ryan Moats <rmoats@us.ibm.com> Acked-by: Justin Pettit <jpettit@ovn.org>
Amitabha Biswas [Mon, 15 Aug 2016 18:03:30 +0000 (11:03 -0700)]
ovsdb-idl: Fix bugs in Python IDL partial set and map.
This patch fixes a couple of bugs in commit a59912a0
(python: add support for partial map and partial set updates)
and reverses a simplication added in commit 884d9bad
(Simplify partial map Py3 IDL test) to make the Python3 test
cases passes.
The following changes have been made:
1. Allow multiple map updates on the same column in a transaction.
2. Partial map Py3 IDL test can now support multiple elements.
3. SetAttr overrides pre-existing insert and remove updates.
4. addvalue/delvalue contains unique elements
Signed-off-by: Amitabha Biswas <abiswas@us.ibm.com> Signed-off-by: Ben Pfaff <blp@ovn.org>
"internal" netdevs are treated specially in OVS (e.g. for MTU), but
the dummy datapath remaps both "system" and "internal" devices to the
same "dummy" netdev class, so there's no way to discern those in tests.
This commit adds a new "dummy-internal" netdev type, which will be used
by the dummy datapath for internal ports, so that other parts of the
code can understand which ports are internal just by looking at the
netdev object.
The alternative solution, using the original interface type ("internal")
instead of the translated netdev type ("dummy"), is harder to implement,
because in so many places only the netdev object is available.
Signed-off-by: Daniele Di Proietto <diproiettod@vmware.com> Acked-by: Ben Pfaff <blp@ovn.org>
This doesn't have an immediate effect, but can mess up later
LL_RESERVED_SPACE calculations, such as done in
net/ipv6/mcast.c:mld_newpack. For reference, this issue was found
from a skb_panic raised there after the length calculations had given
the wrong result.
Note the other current users of this interface
(drivers/net/tun.c:tun_set_headroom and
drivers/net/veth.c:veth_set_rx_headroom) are both checking this
correctly thus need no modification.
Thanks to Ben for some pointers from the crash dumps!
Cc: Benjamin Poirier <bpoirier@suse.com> Cc: Paolo Abeni <pabeni@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1361414 Signed-off-by: Ian Wienand <iwienand@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Pravin B Shelar <pshelar@ovn.org> Acked-by: Jesse Gross <jesse@kernel.org>
Jesse Gross [Sun, 14 Aug 2016 22:29:37 +0000 (15:29 -0700)]
ofproto-dpif-xlate: Use passed ctx in XLATE_REPORT_ERROR.
XLATE_REPORT_ERROR is a macro that takes struct xlate_ctx as an
argument but also implicitly uses 'ctx' from the local function
scope. This works with current uses but it really should be
using the argument.
Signed-off-by: Jesse Gross <jesse@kernel.org> Acked-by: Ben Pfaff <blp@ovn.org>
Andy Zhou [Fri, 29 Jul 2016 21:39:29 +0000 (14:39 -0700)]
ovsdb: Make OVSDB backup sever read only
When ovsdb-sever is running in the backup state, it would be nice to
make sure there is no un-intended changes to the backup database.
This patch makes the ovsdb server only accepts 'read' transactions as
a backup server. When the server role is changed into an active server,
all existing client connections will be reset. After reconnect, all
clinet transactions will then be accepted.
Signed-off-by: Andy Zhou <azhou@ovn.org> Acked-by: Ben Pfaff <blp@ovn.org>
Andy Zhou [Thu, 28 Jul 2016 22:57:40 +0000 (15:57 -0700)]
ovsdb: Fix bug, set rpc to NULL after freeing.
Found by inspection.
Tested-by: Daniel Levy <dlevy@us.ibm.com>
Reported-at: http://openvswitch.org/pipermail/discuss/2016-August/022322.html Signed-off-by: Andy Zhou <azhou@ovn.org> Acked-by: Ben Pfaff <blp@ovn.org>
Andy Zhou [Thu, 28 Jul 2016 18:35:01 +0000 (11:35 -0700)]
ovsdb: Rename replication related variable names.
Current replication code refers the other ovsdb-sever instance as
a 'remote'. which is overloaded in ovsdb.
Switching to use active/backup instead to make it less confusing.
Active is the server that should be servicing the client, backup
server is the server that boots with the --sync-from option.
Signed-off-by: Andy Zhou <azhou@ovn.org> Acked-by: Ben Pfaff <blp@ovn.org>
Ryan Moats [Mon, 15 Aug 2016 00:48:24 +0000 (19:48 -0500)]
Simplify partial map Py3 IDL test added by commit a59912a0
Commit a59912a0 ("python: Add support for partial map
and partial set updates") added unit tests for the partial
map function for the python IDL. However, because Python3
doesn't order dictionaries consistently, this
test is a crap shoot for systems that support Python3.
As a short term fix, do not use a dictionary with multiple
elements for the partial map test case.
Change-Id: Ibdec10ebd895051321b9bff7d9fe8a7e0bd9eb88 Signed-off-by: Ryan Moats <rmoats@us.ibm.com> Signed-off-by: Ben Pfaff <blp@ovn.org>
Numan Siddique [Thu, 11 Aug 2016 12:21:39 +0000 (17:51 +0530)]
ovn-controller: Reset flow processing after (re)connection to switch
When ovn-controller reconnects to the ovs-vswitchd, it deletes all the
OF flows in the switch. It doesn't install the flows again, leaving
the datapath broken unless ovn-controller is restarted or ovn-northd
updates the SB DB.
The reason for this is
- lflow_reset_processing() is not called after the reconnection
- the hmap "installed_flows" is not cleared, because of which
ofctrl_put skips adding the flows to the switch.
This patch fixes the issue and also adds a test case to test
this scenario.
Signed-off-by: Numan Siddique <nusiddiq@redhat.com> Signed-off-by: Ben Pfaff <blp@ovn.org> Acked-by: Ryan Moats <rmoats@us.ibm.com>
Ben Pfaff [Sun, 14 Aug 2016 19:39:56 +0000 (12:39 -0700)]
ovs-ctl: Properly handle shell quoting in os-release.
Until now, this code did not strip "" or '' from variable assignments in
os-release. This fixes the problem.
Requested-by: Matt Mulsow <mamulsow@us.ibm.com>
Requested-at: https://github.com/openvswitch/ovs/pull/148 Fixes: c60d6b096436 ("ovs-ctl: support populating system info from /etc/os-release") Signed-off-by: Ben Pfaff <blp@ovn.org> Acked-by: Ryan Moats <rmoats@us.ibm.com>
Ryan Moats [Sat, 6 Aug 2016 22:46:30 +0000 (17:46 -0500)]
python: Add support for partial map and partial set updates
Allow the python IDL to use mutate operations more freely
by mimicing the partial map and partial set operations now
available in the C IDL.
Unit tests for both of these types of operations are included.
They are not carbon copies of the C tests, because testing
idempotency is a bit difficult for the current python IDL
test harness.
Signed-off-by: Ryan Moats <rmoats@us.ibm.com> Signed-off-by: Ben Pfaff <blp@ovn.org>
Commit f1ab6e06 ("Add/user partial set updates.) incorrectly
did not include HPE attribution for derived files
lib/ovsdb-set-op.[ch]. Add the attribution to correct this.
Signed-off-by: Ryan Moats <rmoats@us.ibm.com> Signed-off-by: Ben Pfaff <blp@ovn.org>
but for columns that store sets of values rather than key-value
pairs. These columns will now be able to use the OVSDB mutate
operation to transmit deltas on the wire rather than use
verify/update and transmit wait/update operations on the wire.
Side effect of modifying the comments in the partial map update
tests.
Signed-off-by: Ryan Moats <rmoats@us.ibm.com> Signed-off-by: Ben Pfaff <blp@ovn.org>
Numan Siddique [Fri, 5 Aug 2016 04:06:39 +0000 (09:36 +0530)]
ovn-northd: Add logical flows to support DHCPv6
OVN implements native DHCPv6. DHCPv6 options are stored
in the 'DHCP_Options' NB table and logical ports refer to this
table to configure the DHCPv6 options.
For each logical port configured with DHCPv6 Options following flows
are added
- A logical flow which copies the DHCPv6 options to the DHCPv6
request packets using the 'put_dhcpv6_opts' action and advances the
packet to the next stage.
- A logical flow which implements the DHCPv6 reponder by sending
the DHCPv6 reply back to the inport once the 'put_dhcpv6_opts' action
is applied.
Signed-off-by: Numan Siddique <nusiddiq@redhat.com> Signed-off-by: Ben Pfaff <blp@ovn.org>
Numan Siddique [Fri, 5 Aug 2016 04:06:07 +0000 (09:36 +0530)]
ovn-controller: Add 'put_dhcpv6_opts' action in ovn-controller
This patch adds a new OVN action 'put_dhcpv6_opts' to support native
DHCPv6 in OVN.
ovn-controller parses this action and adds a NXT_PACKET_IN2
OF flow with 'pause' flag set and the DHCPv6 options stored in
'userdata' field.
When the valid DHCPv6 packet is received by ovn-controller, it frames a
new DHCPv6 reply packet with the DHCPv6 options present in the
'userdata' field and resumes the packet and stores 1 in the 1-bit subfield.
If the packet is invalid, it resumes the packet without any modifying and
stores 0 in the 1-bit subfield.
A new 'DHCPv6_Options' table is added in SB DB which stores
the supported DHCPv6 options with DHCPv6 code and type. ovn-northd is
expected to popule this table.
Upcoming patch will add logical flows using this action.
Signed-off-by: Numan Siddique <nusiddiq@redhat.com> Signed-off-by: Ben Pfaff <blp@ovn.org>
ovn-controller: Use UDP checksums when creating Geneve tunnels.
Currently metadata transmitted by OVN over Geneve tunnels is
unprotected by any checksum other than the one provided by the link
layer - this includes both the VNI and data stored in options. Turning
on UDP checksums which cover this data has obvious benefits in terms of
integrity protection.
In terms of performance, this actually significantly increases throughput
in most common cases when running on Linux based hosts without NICs
supporting Geneve offload (around 60% for bulk traffic). The reason is
that generally all NICs are capable of offloading transmitted and received
UDP checksums (viewed as ordinary UDP packets and not as tunnels). The
benefit comes on the receive side where the validated outer UDP checksum
can be used to additionally validate an inner checksum (such as TCP), which
in turn allows aggregation of packets to be more efficiently handled by
the rest of the stack.
Not all devices see such a benefit. The most notable exception is hardware
VTEPs (currently using VXLAN but potentially Geneve in the future). These
devices are designed to not buffer entire packets in their switching engines
and are therefore unable to efficiently compute or validate UDP checksums.
In addition certain versions of the Linux kernel are not able to fully
take advantage of Geneve capable NIC offloads in the presence of checksums.
(This is actually a pretty narrow corner case though - earlier versions of
Linux don't support Geneve offloads at all and later versions support both
offloads and checksums well.)
In order avoid possible problems with these cases, efficient checksum
receive performance is exposed as an encap option in the southbound
database as a hint to remote senders. This currently defaults to off
for hardware VTEPs and on for all other cases.
Signed-off-by: Jesse Gross <jesse@kernel.org> Acked-by: Ryan Moats <rmoats@us.ibm.com> Acked-by: Ben Pfaff <blp@ovn.org>
ovn-controller: Make encap processing more robust against changes.
Originally, processing of encapsulations simply iterated over all tables on
every wakeup and would replace anything that changed. This is somewhat
inefficient but it captured all changes.
Incremental processing avoided the need to do so much work but it could
miss several types of changes. In particular, it only monitored the chassis
table in the southbound database, so other changes (particularly in the
encap table) were not reflected. In addition, while it corrected some
changes to its data in OVS, others could go unnoticed.
This attempts to fix those issues by reflecting the most recent updates
to the southbound database in OVS at all times. It also increases safety
by avoiding the possibility of dangling pointers to old database rows and
eliminates the need to traverse the OVS database at all during most wakeups.
Fixes: 1d45d5a9 ("ovn-controller: Change encaps_run to work incrementally.") Signed-off-by: Jesse Gross <jesse@kernel.org> Acked-by: Ryan Moats <rmoats@us.ibm.com> Acked-by: Ben Pfaff <blp@ovn.org>
ovn-controller: Fix memory leak when updating tunnels.
When a tunnel possibly needs to be updated, we are currently allocating
a new name for it. This is not necessary and in fact nothing uses the
name, which then results in the memory being leaked.
Fixes: 1d45d5a9 ("ovn-controller: Change encaps_run to work incrementally.") Signed-off-by: Jesse Gross <jesse@kernel.org> Acked-by: Ryan Moats <rmoats@us.ibm.com> Acked-by: Ben Pfaff <blp@ovn.org>
Before calling the function "ofctrl_run" and "pinctrl_run", the "br-int"
has been checked. Remove the conditional statements in the function may
make the code clearer.
Signed-off-by: nickcooper-zhangtonghao <nickcooper-zhangtonghao@opencloud.tech> Signed-off-by: Ben Pfaff <blp@ovn.org>
Ryan Moats [Wed, 3 Aug 2016 19:07:38 +0000 (19:07 +0000)]
ovsdb: Use better error message for "timeout" without waiting.
When setting a where clause, if the timeout is set to a value of 0,
the clause is tested once and if it fails, a message of '"wait" timed
out' is returned. This can be misleading because there wasn't any
real time, so change the message to '"where" clause test failed'.
Signed-off-by: Ryan Moats <rmoats@us.ibm.com> Reported-by: Ryan Moats <rmoats@us.ibm.com>
Reported-at: http://openvswitch.org/pipermail/dev/2016-August/077083.html Fixes: f85f8ebb ("Initial implementation of OVSDB.") Signed-off-by: Ben Pfaff <blp@ovn.org>
ovn-controller: Add datapath-type and iface-types in chassis:external_ids
This patch reads the 'Bridge.datapath_type' column value of the integration
bridge and 'Open_vSwitch.iface_types' column value and sets these in the
external_ids:datapath-type and external_ids:iface-types of Chassis table.
This will provide hints to the CMS or clients monitoring OVN SB DB to
determine the datapath type (DPDK or non-DPDK) configured and take some
actions based on it.
One usecase is, OVN neutron plugin can use this information to set the
vif_type (ovs or vhostuser) during the port binding.
Signed-off-by: Numan Siddique <nusiddiq@redhat.com> Signed-off-by: Ben Pfaff <blp@ovn.org> Acked-by: Ryan Moats <rmoats@us.ibm.com>
Mark Kavanagh [Tue, 9 Aug 2016 16:01:20 +0000 (17:01 +0100)]
netdev-dpdk: add support for jumbo frames
Add support for Jumbo Frames to DPDK-enabled port types,
using single-segment-mbufs.
Using this approach, the amount of memory allocated to each mbuf
to store frame data is increased to a value greater than 1518B
(typical Ethernet maximum frame length). The increased space
available in the mbuf means that an entire Jumbo Frame of a specific
size can be carried in a single mbuf, as opposed to partitioning
it across multiple mbuf segments.
The amount of space allocated to each mbuf to hold frame data is
defined dynamically by the user with ovs-vsctl, via the 'mtu_request'
parameter.
Signed-off-by: Mark Kavanagh <mark.b.kavanagh@intel.com> Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
[diproiettod@vmware.com rebased] Signed-off-by: Daniele Di Proietto <diproiettod@vmware.com>
netdev: Make netdev_set_mtu() netdev parameter non-const.
Every provider silently drops the const attribute when converting the
parameter to the appropriate subclass. Might as well drop the const
attribute from the parameter, since this is a "set" function.
Signed-off-by: Daniele Di Proietto <diproiettod@vmware.com> Acked-by: Ilya Maximets <i.maximets@samsung.com>
vswitchd: Introduce 'mtu_request' column in Interface.
The 'mtu_request' column can be used to set the MTU of a specific
interface.
This column is useful because it will allow changing the MTU of DPDK
devices (implemented in a future commit), which are not accessible
outside the ovs-vswitchd process, but it can be used for kernel
interfaces as well.
The current implementation of set_mtu() in netdev-dpdk is removed
because it's broken. It will be reintroduced by a subsequent commit on
this series.
Signed-off-by: Daniele Di Proietto <diproiettod@vmware.com> Acked-by: Ilya Maximets <i.maximets@samsung.com>
dpif-netdev: Fix -Wformat warning on 32-bit build.
Use the appropriate format specifier for size_t, otherwise the 32-bit
build fails.
Reported-at: https://travis-ci.org/openvswitch/ovs/jobs/151938383 Fixes: 3453b4d62a98("dpif-netdev: dpcls per in_port with sorted
subtables") Signed-off-by: Daniele Di Proietto <diproiettod@vmware.com> Acked-by: Joe Stringer <joe@ovn.org>
Ciara Loftus [Wed, 10 Aug 2016 14:28:27 +0000 (15:28 +0100)]
netdev-dpdk: add DPDK pdump capability
This commit provides the ability to 'listen' on DPDK ports and save
packets to a pcap file with a DPDK app that uses the librte_pdump
library. One such app is the 'pdump' app that can be found in the DPDK
'app' directory. Instructions on how to use this can be found in
INSTALL.DPDK-ADVANCED.md
Pdump capability in OVS with DPDK will only be initialised if the
CONFIG_RTE_LIBRTE_PMD_PCAP=y and CONFIG_RTE_LIBRTE_PDUMP=y options are
set in DPDK. libpcap is required if the above configuration is used.
Signed-off-by: Ciara Loftus <ciara.loftus@intel.com> Signed-off-by: Daniele Di Proietto <diproiettod@vmware.com>
ovs-bugtool: Correct "rmdir" error messages during "make distcheck".
Remove duplicated delete attempts and error messages during distcheck
clean procedure.
The problem is that during clean up procedure of distcheck:
rmdir: failed to remove ‘/openvswitch-2.5.90/_inst/share/openvswitch/bugtool-plugins/’: Directory not empty
rmdir: failed to remove ‘/openvswitch-2.5.90/_inst/share/openvswitch/bugtool-plugins/ovn/network-status ’: No such file or directory
The first entry is caused by xml file which is kept flat in the directory
structure (not in the subdirectory as it is for other plugins), and rmdir
"tries" to remove folder which keeps all plugins files and folders. That is
why additional check if directory is not empty is added, to prevent that.
The second entry is cause by some other commit when ovs plugin has been added:
stem=`echo "$$plugin" | sed 's,ovn/,,'`; \
So in that sense directory path has been modified during removal of xml
file, but it hasn't been updated during directory removal.
I didn't want to really change this logic, as I'm not sure if there
something else can be stored in this directory, but it was very tempting to
remove everything just by:
rm -rf "$(DESTDIR)$(bugtoolpluginsdir)/*"
Signed-off-by: Michal Weglicki <michalx.weglicki@intel.com> Signed-off-by: Ben Pfaff <blp@ovn.org>
Jan Scheurich [Thu, 11 Aug 2016 10:02:27 +0000 (12:02 +0200)]
dpif-netdev: dpcls per in_port with sorted subtables
The user-space datapath (dpif-netdev) consists of a first level "exact match
cache" (EMC) matching on 5-tuples and the normal megaflow classifier. With
many parallel packet flows (e.g. TCP connections) the EMC becomes inefficient
and the OVS forwarding performance is determined by the megaflow classifier.
The megaflow classifier (dpcls) consists of a variable number of hash tables
(aka subtables), each containing megaflow entries with the same mask of
packet header and metadata fields to match upon. A dpcls lookup matches a
given packet against all subtables in sequence until it hits a match. As
megaflow cache entries are by construction non-overlapping, the first match
is the only match.
Today the order of the subtables in the dpcls is essentially random so that
on average a dpcls lookup has to visit N/2 subtables for a hit, when N is the
total number of subtables. Even though every single hash-table lookup is
fast, the performance of the current dpcls degrades when there are many
subtables.
How does the patch address this issue:
In reality there is often a strong correlation between the ingress port and a
small subset of subtables that have hits. The entire megaflow cache typically
decomposes nicely into partitions that are hit only by packets entering from
a range of similar ports (e.g. traffic from Phy -> VM vs. traffic from VM ->
Phy).
Therefore, maintaining a separate dpcls instance per ingress port with its
subtable vector sorted by frequency of hits reduces the average number of
subtables lookups in the dpcls to a minimum, even if the total number of
subtables gets large. This is possible because megaflows always have an exact
match on in_port, so every megaflow belongs to unique dpcls instance.
For thread safety, the PMD thread needs to block out revalidators during the
periodic optimization. We use ovs_mutex_trylock() to avoid blocking the PMD.
To monitor the effectiveness of the patch we have enhanced the ovs-appctl
dpif-netdev/pmd-stats-show command with an extra line "avg. subtable lookups
per hit" to report the average number of subtable lookup needed for a
megaflow match. Ideally, this should be close to 1 and almost all cases much
smaller than N/2.
The PMD tests have been adjusted to the additional line in pmd-stats-show.
We have benchmarked a L3-VPN pipeline on top of a VXLAN overlay mesh.
With pure L3 tenant traffic between VMs on different nodes the resulting
netdev dpcls contains N=4 subtables. Each packet traversing the OVS
datapath is subject to dpcls lookup twice due to the tunnel termination.
Disabling the EMC, we have measured a baseline performance (in+out) of ~1.45
Mpps (64 bytes, 10K L4 packet flows). The average number of subtable lookups
per dpcls match is 2.5. With the patch the average number of subtable lookups
per dpcls match is reduced to 1 and the forwarding performance grows by ~50%
to 2.13 Mpps.
Even with EMC enabled, the patch improves the performance by 9% (for 1000 L4
flows) and 34% (for 50K+ L4 flows).
As the actual number of subtables will often be higher in reality, we can
assume that this is at the lower end of the speed-up one can expect from this
optimization. Just running a parallel ping between the VXLAN tunnel endpoints
increases the number of subtables and hence the average number of subtable
lookups from 2.5 to 3.5 on master with a corresponding decrease of throughput
to 1.2 Mpps. With the patch the parallel ping has no impact on average number
of subtable lookups and performance. The performance gain is then ~75%.
Signed-off-by: Jan Scheurich <jan.scheurich@ericsson.com> Acked-by: Antonio Fischetti <antonio.fischetti@intel.com> Signed-off-by: Daniele Di Proietto <diproiettod@vmware.com>
ovn-northd: Only warn about peer as switch port when it really is one.
At the end of join_logical_ports(), some ovn_ports might not have been
bound as logical switch ports or logical router ports, but the code assumed
that they were and gave a confusing warning when the assumption was
violated.
Russell Bryant [Fri, 12 Aug 2016 17:28:48 +0000 (13:28 -0400)]
release-process: Use markdown table format.
Update the release process document to use markdown formatting for the
table used to describe the 6 month release schedule. This will make it
be formatted correctly when converted to HTML on github and
openvswitch.org.
Signed-off-by: Russell Bryant <russell@ovn.org> Acked-by: Joe Stringer <joe@ovn.org>
Pravin B Shelar [Fri, 12 Aug 2016 02:27:12 +0000 (19:27 -0700)]
datapath: compat: keep skb mark across tunnel devices.
Older kernel skb_scrub_packet() has bug which resets skb mark for
all packet. It is fixed during 3.18 release where it is reset
only for packets crossing namespace. So OVS is forced to use
compat skb_scrub_packet() on older kernel.
This is related to upstream bug fix commit ca7c7b9059e3
("skbuff: Do not scrub skb mark within the same name space").
VMware-BZ: #1710701 Signed-off-by: Pravin B Shelar <pshelar@ovn.org> Acked-by: Joe Stringer <joe@ovn.org>
Joe Stringer [Fri, 12 Aug 2016 00:54:08 +0000 (17:54 -0700)]
ofproto-dpif-xlate: Fix VLOG_ERR_RL() call.
a716ef9a7a73 ("ofproto-dpif-xlate: Log flow in XLATE_REPORT_ERROR.")
inadvertantly broke build on clang due to improper passing of the ds
cstring into the VLOG() function:
error: format string is not a string literal
(potentially insecure) [-Werror,-Wformat-security]
XLATE_REPORT_ERROR(ctx, "over max translation depth %d", MAX_DEPTH);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
note: expanded from macro
'XLATE_REPORT_ERROR'
VLOG_ERR_RL(&error_report_rl, ds_cstr(&ds)); \
Reported-by: Daniele Di Proietto <diproiettod@vmware.com> Signed-off-by: Joe Stringer <joe@ovn.org> Acked-by: Daniele Di Proietto <diproiettod@vmware.com>
Joe Stringer [Thu, 11 Aug 2016 19:36:16 +0000 (12:36 -0700)]
ofproto-dpif-xlate: Log flow in XLATE_REPORT_ERROR.
To assist debugging pipelines when resubmit resource checks fail, print
the base_flow from the translation context. This base flow can then be
used from ofproto/trace to figure out which parts of the pipeline lead
to this translation error.
Signed-off-by: Joe Stringer <joe@ovn.org> Acked-by: Ben Pfaff <blp@ovn.org>
Ben Pfaff [Thu, 11 Aug 2016 04:14:09 +0000 (21:14 -0700)]
ovs-bugtool: Switch from MD5 to SHA-256.
While going through a FIPS certification process we discovered that
ovs-bugtool uses MD5 to identify the contents of files. FIPS doesn't allow
use of the obsolete and broken MD5 algorithm, so this commit switches to
SHA-256.
In a way, this is a silly requirement. ovs-bugtool only uses MD5 to
identify file content, mostly to ensure that the contents of the bug report
have not been corrupted. MD5 is perfectly adequate for that purpose; in
fact a 16-bit CRC would probably be adequate. On the other hand, there is
basically no cost and no disadvantage to switching to SHA-256, so why not
do it? That's why I think that this is a reasonable change.
VMware-BZ: #1708786 Signed-off-by: Ben Pfaff <blp@ovn.org> Acked-by: Ryan Moats <rmoats@us.ibm.com>
Ilya Maximets [Wed, 10 Aug 2016 09:43:03 +0000 (12:43 +0300)]
netdev-dpdk: vhost: Fix double free and use after free with QoS.
While using QoS with vHost interfaces 'netdev_dpdk_qos_run__()' will
free mbufs while executing 'netdev_dpdk_policer_run()'. After
that same mbufs will be freed at the end of '__netdev_dpdk_vhost_send()'
if 'may_steal == true'. This behaviour will break mempool.
Also 'netdev_dpdk_qos_run__()' will free packets even if we shouldn't
do this ('may_steal == false'). This will lead to using of already freed
packets by the upper layers.
Fix that by copying all packets that we can't steal like it done
for DPDK_DEV_ETH devices and freeing only packets not freed by QoS.
Fixes: 0bf765f753fd ("netdev_dpdk.c: Add QoS functionality.") Signed-off-by: Ilya Maximets <i.maximets@samsung.com> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com> Tested-by: Ian Stokes <ian.stokes@intel.com> Acked-by: Daniele Di Proietto <diproiettod@vmware.com>
Darrell Ball [Tue, 9 Aug 2016 02:20:38 +0000 (19:20 -0700)]
ovn: Fix receive from vxlan in ovn-controller.
The changes enable source node replication in OVN for receive from vxlan
tunnels. OVN only supports source node replication mode. This is needed
for ovn-controller to interoperate with hardware switches.
Previously hardware vtep interaction, which uses service node
replication by default for multicast/broadcast/unknown unicast traffic
partially "worked" by happenstance. Because of limited vxlan
encapsulation metadata, received packets were resubmitted to find
the egress port(s). This is not correct for multicast, broadcast and
unknown unicast traffic as traffic will get resent on the tunnel mesh.
ovn-controller is changed not to send traffic received from vxlan
tunnels out the tunnel mesh again. Traffic received from vxlan tunnels is
now only sent locally as intended with obvious benefits. This behavior is
newly documented in ovn-architecture.7.xml.
To support keeping state for receipt from a vxlan tunnel, a MFF logical
flags register flag is allocated.
As part of this change ovn-controller-vtep is hard-coded to set the
replication mode of each logical switch to source node as OVN will only
support source node replication.
I failed to see that lib/dpif-netdev.c actually needs the concurrency
provided by pvector prior to this change. More specifically, when a
subtable is removed, concurrent lookups may skip over another subtable
swapped in to the place of the removed subtable in the vector.
Since this was the only use of the non-concurrent pvector, it is
cleaner to revert the whole patch.
Reported-by: Jan Scheurich <jan.scheurich@ericsson.com> Signed-off-by: Jarno Rajahalme <jarno@ovn.org> Acked-by: Daniele Di Proietto <diproiettod@vmware.com>
Ryan Moats [Thu, 28 Jul 2016 22:17:41 +0000 (22:17 +0000)]
ovn-controller: Persist desired conntrack groups.
With incremental processing of logical flows desired conntrack groups
are not being persisted. This patch adds this capability, with the
side effect of adding a ds_clone method that this capability leverages.
Signed-off-by: Ryan Moats <rmoats@us.ibm.com> Reported-by: Guru Shetty <guru@ovn.org>
Reported-at: http://openvswitch.org/pipermail/dev/2016-July/076320.html Fixes: 70c7cfe ("ovn-controller: Add incremental processing to lflow_run and physical_run") Acked-by: Flavio Fernandes <flavio@flaviof.com> Signed-off-by: Ben Pfaff <blp@ovn.org>
This commit builds upon some of the recent ovs-ctl changes to build a
more integrated systemd setup. A new service (ovs-vswitchd) is
added to track the ovs-vswitchd, and ovsdb-server service is reserved
for the ovsdb-server daemon. The systemd scripts still use ovs-ctl to
actually initialize the daemons.
rhel/ovsdb-server.service: Rename the nonetwork service
Currently, openvswitch.service calls out to start
openvswitch-nonetwork.service. However, openvswitch-nonetwork.service
will be called ovsdb-server, so that it is a bit more reflective of
the dependencies. This commit does make the file a bit of a misnomer as
currently the ovsdb-server SERVICE will start the ovs-vswitchd service
as well. A future commit will clean this up, and change the ifup
configuration in the process.
Panu Matilainen [Wed, 10 Aug 2016 11:16:14 +0000 (14:16 +0300)]
ovs-ctl: support populating system info from /etc/os-release
On systemd-era hosts, OS name and version are available in sanitized
format from /etc/os-release(5) without resorting to calling (and thus
requiring) lsb_release. Support populating system-type and system-version
from /etc/os-release, prefer it over lsb_release, but permit overriding
via the OVS-specific system-type.conf and system-version.conf.
Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=1350550 Signed-off-by: Panu Matilainen <pmatilai@redhat.com> Signed-off-by: Ben Pfaff <blp@ovn.org>
Ilya Maximets [Mon, 8 Aug 2016 11:19:45 +0000 (14:19 +0300)]
netdev-dpdk: Avoid reconfiguration on reconnection of same vhost device.
Binding/unbinding of virtio driver inside VM leads to reconfiguration
of PMD threads. This behaviour may be abused by executing bind/unbind
in an infinite loop to break normal networking on all ports attached
to the same instance of Open vSwitch.
Fix that by avoiding reconfiguration if it's not necessary.
Number of queues will not be decreased to 1 on device disconnection but
it's not very important in comparison with possible DOS attack from the
inside of guest OS.
Fixes: 81acebdaaf27 ("netdev-dpdk: Obtain number of queues for vhost
ports from attached virtio.") Reported-by: Ciara Loftus <ciara.loftus@intel.com> Signed-off-by: Ilya Maximets <i.maximets@samsung.com> Acked-by: Daniele Di Proietto <diproiettod@vmware.com>
When egress policer is set as a QoS type for a port, an error may occur during
setup if incorrect parameters are used for the rte_meter. If this occurs
the egress policer construct and set functions should free any allocated
memory relevant to the policer and set the QoS configuration pointer to
null. The netdev_dpdk_set_qos function should check the error value returned
for any QoS construct/set calls with an assertion to avoid segfault.
Also this commit modifies egress_policer_qos_set() to correctly lock the QoS
spinlock while the egress policer configuration is updated to avoid
segfault.
Signed-off-by: Ian Stokes <ian.stokes@intel.com> Signed-off-by: Daniele Di Proietto <diproiettod@vmware.com>
INSTALL.DPDK: Update documentation for DPDK 16.07 support
Replace 'dpdk_nic_bind.py' references with 'dpdk-devbind.py'. The script
name is changed in DPDK 16.07 as the script can be used also on crypto
devices along with NICs.
Update the command for setting packet forwarding mode in 'testpmd' app
from 'set fwd mac_retry' to 'set fwd mac retry'.