Remove flow from ofproto data structures in the 'start' phase, even if
we may need to add them back in 'revert' phase.
This makes bundled group mods easier, as a group delete may also
delete flows, and we need the referring flows to be updated in the
'start' phase so that we will not have stale references to the
referring flows.
Signed-off-by: Jarno Rajahalme <jarno@ovn.org> Acked-by: Ben Pfaff <blp@ovn.org>
Make groups RCU protected and make group lookups lockless. While this
makes group lookups perform better, the main motivation is to have an
unified memory management model for versioned data supported in
OpenFlow bundles. Later patches will make groups versioned and add
bundle support for groups.
Signed-off-by: Jarno Rajahalme <jarno@ovn.org> Acked-by: Ben Pfaff <blp@ovn.org>
Recent commits reorganizing bindings handling and also moving ct zone
assignment to ovn-controller.c caused ct zone assignment to no longer
work. The code relies on an "all_lports" sset that should contain all
logical ports that we should be assigning ct zones for. Prior to this
change, all_lports was always empty.
Signed-off-by: Babu Shanmugam <bschanmu@redhat.com> Co-authored-by: Russell Bryant <russell@ovn.org> Signed-off-by: Russell Bryant <russell@ovn.org> Acked-by: Ryan Moats <rmoats@us.ibm.com>
netdev-*: Do not use dp_packet_pad() in recv() functions.
All the netdevs used by dpif-netdev (except for netdev-dpdk) have a
dp_packet_pad() call in the receive function, probably because the
userspace datapath couldn't handle properly short packets.
This doesn't appear to be the case anymore.
This commit removes the call to have a more consistent behavior with the
kernel datapath.
All the testsuite changes in this commit adjust the expectations for
packet lengths in flow dumps and other stats. There's only one fix in
ovn.at: one of the test_ip() functions generated an incomplete udp
packet, which was not a problem until now, because of the padding.
Signed-off-by: Daniele Di Proietto <diproiettod@vmware.com> Acked-by: Ben Pfaff <blp@ovn.org>
Russell Bryant [Fri, 29 Jul 2016 18:51:07 +0000 (14:51 -0400)]
travis: Fix flake8 failures from flake8 3.0.
The "hacking" plugin for flake8 is not currently compatible with flake8
3.0. Ensure that we install flake8 2.x on travis-ci. Also update the
docs to indicate this incompatibility.
Signed-off-by: Russell Bryant <russell@ovn.org> Acked-by: Andy Zhou <azhou@ovn.org>
Joe Stringer [Fri, 29 Jul 2016 00:09:38 +0000 (17:09 -0700)]
fedora: Prioritize OVS modules in weak-updates.
Out-of-tree modules are installed into the kernel's "extra" modules
directory for the version that kmod-openvswitch is compiled against. For
all other kernels on the system at install time, a symlink is created in
the "weak-updates" directory. This provides a path for the same kernel
module to be used when minor kernel updates are done on a system.
However, without updating the depmod configuration the weak-update will
not be prioritized, so modprobe will switch back to using upstream
kernel modules when you upgrade. This patch introduces that depmod
configuration to ensure that the out-of-tree module is always used when
it is installed, regardless of kernel upgrades.
Signed-off-by: Joe Stringer <joe@ovn.org> Acked-by: Pravin B Shelar <pshelar@ovn.org> Acked-by: Flavio Leitner <fbl@sysclose.org>
Joe Stringer [Fri, 29 Jul 2016 00:09:37 +0000 (17:09 -0700)]
rhel: Prioritize our vport-foo modules in depmod.
We've done the same for openvswitch.ko previously, but we really should
be doing this for vport modules as well; otherwise, depmod may try to
pair upstream vport modules with the out-of-tree openvswitch module
(leading to depmod warnings on package install, and failure to load the
module at runtime).
VMware-BZ: #1700293 Signed-off-by: Joe Stringer <joe@ovn.org> Acked-by: Pravin B Shelar <pshelar@ovn.org> Acked-by: Flavio Leitner <fbl@sysclose.org>
Allow clients to use the whole priority range. Note that this changes
the semantics of PVECTOR_FOR_EACH_PRIORITY so that the iteration still
continues for entries at the given priority.
Suggested-by: Ben Pfaff <blp@ovn.org> Signed-off-by: Jarno Rajahalme <jarno@ovn.org> Acked-by: Ben Pfaff <blp@ovn.org>
datapath-windows: Update OvsReadEventCmdHandler in Datapath.c to support different events
OvsReadEventCmdHandler must now reflect the right event being read. If the
event is a Conntrack related event, then convert the entry to netlink
format and send it to userspace. If it's Vport event, retain the existing
workflow.
Signed-off-by: Sairam Venugopal <vsairam@vmware.com> Acked-by: Paul Boca <pboca@cloudbasesolutions.com> Acked-by: Alin Gabriel Serdean <aserdean@cloudbasesolutions.com> Signed-off-by: Gurucharan Shetty <guru@ovn.org>
datapath-windows: Add support for multiple event queue in Event.c
Update Event.c to have multiple event queues and mechanism to retrieve the
associated queue. Introduce OvsPostCtEvent and OvsRemoveCtEventEntry
similar to OvsPostVportEvent and OvsRemoveVportEventEntry.
Signed-off-by: Sairam Venugopal <vsairam@vmware.com> Acked-by: Paul Boca <pboca@cloudbasesolutions.com> Acked-by: Alin Gabriel Serdean <aserdean@cloudbasesolutions.com> Acked-By: Yin Lin <linyi@vmware.com> Acked-by: Alin Gabriel Serdean <aserdean@cloudbasesolutions.com> Signed-off-by: Gurucharan Shetty <guru@ovn.org>
datapath-windows: Modify OvsCreateNlMsgFromCtEntry to make it reusable
Tweak the OvsCreateNlMsgFromCtEntry() method to reuse it for creating
netlink messages from other files. Also define the function in Conntrack.h
to make it accessible.
Signed-off-by: Sairam Venugopal <vsairam@vmware.com> Acked-By: Yin Lin <linyi@vmware.com> Acked-by: Alin Gabriel Serdean <aserdean@cloudbasesolutions.com> Acked-By: Yin Lin <linyi@vmware.com> Signed-off-by: Gurucharan Shetty <guru@ovn.org>
datapath-windows: Define new multicast conntrack events and netlink protocol
The Hyper-V datapath supports NETLINK_GENERIC and NETLINK_NETFILTER
protocols for netlink communication. Define these two protocols in the
datapath.
Define new Conntrack events (new and delete) and add support for
subscribing to these events. Parse out OVS_NL_ATTR_MCAST_GRP and store it
as part of OVS_EVENT_SUBSCRIBE structure.
Signed-off-by: Sairam Venugopal <vsairam@vmware.com> Acked-By: Yin Lin <linyi@vmware.com> Acked-by: Alin Gabriel Serdean <aserdean@cloudbasesolutions.com> Signed-off-by: Gurucharan Shetty <guru@ovn.org>
datapath-windows: Fix bugs in Event.c around subscribe and lock
When userspace tries to resubscribe to an existing queue, return
STATUS_INVALID_PARAMETER since it's not supported. The current bug
overwrites status to STATUS_SUCCESS.
The second bug fix is around releasing the EventQueue lock if an open
instance couldn't be found. The current version returns back without
releasing the lock. Moving the OvsAcquireEventQueueLock() after the
instance is verified.
OvsGetOpenInstance does not enforce a safe read for
gOvsSwitchContext->dpNo. Use the gOvsSwitchContext->dispatchLock for
accessing the parameter.
Signed-off-by: Sairam Venugopal <vsairam@vmware.com> Acked-By: Yin Lin <linyi@vmware.com> Acked-by: Alin Gabriel Serdean <aserdean@cloudbasesolutions.com> Signed-off-by: Gurucharan Shetty <guru@ovn.org>
datapath-windows: Explicitly name vport related event to vportEvent
OVS_EVENT_ENTRY currently handles only Vport related events. Updating the
name of the struct to OVS_VPORT_EVENT_ENTRY. Remove OVS_EVENT_STATUS since
it's currently not in use. Update the datapath to refer to events as
vportEvents. This will aid in the introduction of other events.
Signed-off-by: Sairam Venugopal <vsairam@vmware.com> Acked-By: Yin Lin <linyi@vmware.com> Acked-by: Alin Gabriel Serdean <aserdean@cloudbasesolutions.com> Acked-By: Yin Lin <linyi@vmware.com> Signed-off-by: Gurucharan Shetty <guru@ovn.org>
Ben Pfaff [Wed, 27 Jul 2016 06:50:06 +0000 (23:50 -0700)]
tests: Remove most packet-forwarding related "sleep"s from OVN tests.
These arbitrary sleeps are often longer than necessary and can be too short
in corner cases. This commit replaces them by a common macro that waits
only as long as necessary.
Signed-off-by: Ben Pfaff <blp@ovn.org> Acked-by: Ryan Moats <rmoats@us.ibm.com> Acked-by: Flavio Fernandes <flavio@flaviof.com>
Ben Pfaff [Wed, 27 Jul 2016 06:18:12 +0000 (23:18 -0700)]
ovn: Make two end-to-end tests more reliable.
These tests change the northbound configuration and then immediately check
that the changes have taken effect on the hypervisors. This can't work
reliably, so add a sleep to each one.
Signed-off-by: Ben Pfaff <blp@ovn.org> Acked-by: Ryan Moats <rmoats@us.ibm.com>
Russell Bryant [Thu, 28 Jul 2016 21:22:41 +0000 (17:22 -0400)]
ovn-controller: Remove old values from local_ids.
local_ids is supposed to be the set of interface iface-id values from
this chassis that correspond to OVN logical ports. We use this for
detecting when an interface has been removed as well as if child-ports
should be bound to this chassis.
Old values were not being removed from local_ids. The most immediate
effect of this was that once an interface has been removed from a
chassis, we would think a removal has occured *every* time through
binding_run and trigger the full binding processing. This was
a performance problem.
The second problem this would cause is if a port that had child ports
was moved to another chassis. We would end up with two chassis fighting
over the binding of the child ports.
Signed-off-by: Russell Bryant <russell@ovn.org> Acked-by: Ryan Moats <rmoats@us.ibm.com>
Added an IPv4 and MAC addresses management system to ovn-northd. When a logical
switch's other_config:subnet field is set, logical ports attached to that
switch that have the keyword "dynamic" in their addresses column will
automatically be allocated a globally unique MAC address/unused IPv4 address
within the provided subnet. The allocated address will populate the
dynamic_addresses column. This can be useful for a user who wants to deploy
many VM's or containers with networking capabilities, but does not care about
the specific MAC/IPv4 addresses that are assigned.
When ifdown isn't executed (system didn't shut down properly),
ports remain in the openvswitch's database. In that case, an
inconsitency is left behind when the ifcfg was modified because
ovs-vsctl won't do anything to update existing port's configuration
in the database.
The ifup/ifdown will operate only on configured interfaces, so
this patch fixes the issue by deleting the port from the database
before attempt to configure it with fresh configuration.
Signed-off-by: Flavio Leitner <fbl@redhat.com> Signed-off-by: Joe Stringer <joe@ovn.org>
Russell Bryant [Tue, 26 Jul 2016 20:29:25 +0000 (16:29 -0400)]
ovn: Rename "gateway" to "l3gateway".
When L3 gateway support was added, it introduced a port type called
"gateway" and a corresponding option called "gateway-chassis". Since
that time, we also have an L2 gateway port type called "l2gateway" and a
corresponding option called "l2gateway-chassis". This patch renames the
L3 gateway port type and option to "l3gateway" and "l3gateway-chassis"
to make things a little more clear and consistent.
Ryan Moats [Thu, 28 Jul 2016 16:53:03 +0000 (16:53 +0000)]
ovn: Add ovn-controller-vtep debian package
Having a separate debian package for deploying
the ovn-controller-vtep binary enables the ability
to assign specific nodes the role of communicating
with VTEP enabled TORs.
Change-Id: Ia36aea7d89bd011a57918820b2a9f6e3469b3e04 Signed-off-by: Ryan Moats <rmoats@us.ibm.com> Signed-off-by: Ben Pfaff <blp@ovn.org>
Ryan Moats [Thu, 28 Jul 2016 18:10:16 +0000 (18:10 +0000)]
ovn-controller: Clean up cases that lead to duplicate OF flows.
In physical_run, there are multiple places where OF flows can be
produced each cycle. Because the desired flow table may not have
been completely cleared first, remove flows created during previous
runs before creating new flows. This avoid collisions.
Signed-off-by: Ryan Moats <rmoats@us.ibm.com> Signed-off-by: Ben Pfaff <blp@ovn.org>
route-table: flush addresses list when route table is reset
When the route table is reset, the addresses list may be out of date, as we race
for the many netlink socket notifications.
A quick fix for this is flushing the addresses list, before dumping the routes
and gathering source addresses for them.
That way, instead of using invalid source addresses or preventing an entry to be
added because of missing source addresses, repeated tests showed the correct
entry is always added.
As route-table.c is only built for Linux, we don't need to be concerned that
Windows does not have netdev_get_addrs_list_flush, since it uses
route-table-stub.c instead.
Fixes: a8704b502785 ("tunneling: Handle multiple ip address for given device.") Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@redhat.com> Signed-off-by: Ben Pfaff <blp@ovn.org>
doc: Update INSTALL.Docker.md to reflect it's focus on OVN
While reading this document, the title stood out to me as not
accurate. The title indicates it will discuss how to use
Open vSwitch with Docker, but in reality, it's about using
Open Virtual Networking with Docker.
This change updates the title, as well as the opening paragraphs
to more accurately reflect what the document is talking about.
From the connection tracker perspective, an ICMP connection is a tuple
identified by source ip address, destination ip address and ICMP id.
While this allows basic ICMP traffic (pings) to work, it doesn't take
into account the icmp type: the connection tracker will allow
requests/replies in any directions.
This is improved by making the ICMP type and code part of the connection
tuple. An ICMP echo request packet from A to B, will create a
connection that matches ICMP echo request from A to B and ICMP echo
replies from B to A. The same is done for timestamp and info
request/replies, and for ICMPv6.
A new modules conntrack-icmp is implemented, to allow only "request"
types to create new connections.
Also, since they're tracked in both userspace and kernel
implementations, ICMP type and code are always printed in ct-dpif (a few
testcase are updated as a consequence).
Reported-by: Subramani Paramasivam <subramani.paramasivam@wipro.com> Signed-off-by: Daniele Di Proietto <diproiettod@vmware.com> Acked-by: Joe Stringer <joe@ovn.org>
This commit implements the OVS_ACTION_ATTR_CT action in dpif-netdev.
To allow ofproto-dpif to detect the conntrack feature, flow_put will not
discard anymore flows with ct_* fields set. We still shouldn't allow
flows with NAT bits set, since there is no support for NAT.
Signed-off-by: Daniele Di Proietto <diproiettod@vmware.com> Acked-by: Flavio Leitner <fbl@sysclose.org> Acked-by: Antonio Fischetti <antonio.fischetti@intel.com>
This introduces a very limited but simple benchmark for
conntrack_execute(). It just sends repeatedly the same batch of packets
through the connection tracker and returns the time spent to process
them.
While this is not a realistic benchmark, it has proven useful during
development to evaluate different batching and locking strategies.
This commit adds a thread that periodically removes expired connections.
The expiration time of a connection can be expressed by:
expiration = now + timeout
For each possible 'timeout' value (there aren't many) we keep a list.
When the expiration is updated, we move the connection to the back of the
corresponding 'timeout' list. This ways, the list is always ordered by
'expiration'.
When the cleanup thread iterates through the lists for expired
connections, it can stop at the first non expired connection.
Suggested-by: Joe Stringer <joe@ovn.org> Signed-off-by: Daniele Di Proietto <diproiettod@vmware.com> Acked-by: Joe Stringer <joe@ovn.org>
It is a connection tracker that resides entirely in userspace. Its
primary user will be the dpif-netdev datapath.
The module main goal is to provide conntrack_execute(), which offers a
convenient interface to implement the datapath ct() action.
The conntrack module uses two submodules to deal with the l4 protocol
details (conntrack-other for UDP and ICMP, conntrack-tcp for TCP).
The conntrack-tcp submodule implementation is adapted from FreeBSD's pf
subsystem, therefore it's BSD licensed. It has been slightly altered to
match the OVS coding style and to allow the pickup of already
established connections.
Signed-off-by: Daniele Di Proietto <diproiettod@vmware.com> Acked-by: Antonio Fischetti <antonio.fischetti@intel.com> Acked-by: Joe Stringer <joe@ovn.org>
Linux and FreeBSD have slightly different names for these constants.
Windows doesn't define them. It is simpler to redefine them from
scratch for OVS. The new names are different than those used in Linux
and FreeBSD.
These definitions will be used by a future commit.
Signed-off-by: Daniele Di Proietto <diproiettod@vmware.com> Acked-by: Joe Stringer <joe@ovn.org> Acked-by: Flavio Leitner <fbl@sysclose.org> Acked-by: Ryan Moats <rmoats@us.ibm.com>
Mark Kavanagh [Tue, 26 Jul 2016 13:19:17 +0000 (14:19 +0100)]
netdev-provider: fix comments for netdev_rxq_recv
Commit 64839cf43 applies batch objects to netdev-providers, but
some comments were not updated accordingly. Fix these:
- replace 'pkts' with 'batch'
- replace '*cnt' with 'batch->count'
- replace MAX_RX_BATCH with NETDEV_MAX_BURST
- remove superfluous whitespace
Signed-off-by: Mark Kavanagh <mark.b.kavanagh@intel.com> Acked-by: William Tu <u9012063@gmail.com> Acked-by: Daniele Di Proietto <diproiettod@vmware.com>
python: Send old values of the updated cols in notify for update2
When python IDL calls the "notify" function after processing the "update2"
message from ovsdb-server, it is suppose to send the old values of the
updated columns as the last parameter. But the recent commit "897c8064"
sends the updated values. This breaks the behaviour.
This patch fixes this issue. It also updates the description of
the 'updates' param of the notify function to make it more clear.
Fixes: 897c8064 ("python: move Python idl to work with monitor_cond") Signed-off-by: Numan Siddique <nusiddiq@redhat.com> Signed-off-by: Ben Pfaff <blp@ovn.org>
netdev: do not allow devices to be opened with conflicting types
When a device is already opened, netdev_open should verify that the types match,
or else return an error.
Otherwise, users might expect to open a device with a certain type and get a
handle belonging to a different type.
This also prevents certain conflicting configurations that would have a port of
a certain type in the database and one of a different type on the system.
For example, when adding an interface with a type other than system, and there
is already a system interface with the same name, as the routing table will hold
a reference to that system interface, some conflicts will arise. The netdev will
be opened with the incorrect type and that will make vswitchd remove it, but
adding it again will fail as it already exists. Failing earlier prevents some
vswitchd loops in reconfiguring the interface.
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@redhat.com> Acked-by: Daniele Di Proietto <diproiettod@vmware.com>
dpif-netdev: use the open_type when creating the local port
Instead of using the internal type, use the port_open_type when creating the
local port. That makes sure that whenever dpif_port_query is used, the netdev
open_type is returned instead of the "internal" type.
For other ports, that is already the case, as the netdev type is used when
creating the dp_netdev_port.
That changes the output of dpctl when showing the local port, and also when
trying to change its type. So, corresponding tests are fixed.
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@redhat.com> Signed-off-by: Daniele Di Proietto <diproiettod@vmware.com>
netdev-vport: don't use system type when opening netdev
tunnel_check_status_change__ calls netdev_open with type system. Using NULL
instead will default to system in case the device is not opened yet, and allow a
different type in case it's already opened.
Any type should be fine, as netdev_get_carrier will work with any of them.
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@redhat.com> Acked-by: Daniele Di Proietto <diproiettod@vmware.com>
in-band: don't use system type when opening netdev
A netdev might be already opened with a different type and that can be used
instead. The system type is already the default type that will be used when
there is no netdev opened and the type is not specified.
And as long as the opened netdev supports the required operations, it doesn't
matter its type.
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@redhat.com> Acked-by: Daniele Di Proietto <diproiettod@vmware.com>
in-band: use open_type when opening internal device
in-band code will open a device that it expects to be the main internal port of
the bridge. However, it's possible that the correct type is a different one. For
dpif-netdev, it might be a tap device, or a dummy device for dummy datapaths.
ofproto_port_open_type will give the correct type.
While this doesn't cause any problems right now, as the needed type would be
opened already, a later patch assumes netdev with different types cannot be
opened.
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@redhat.com> Acked-by: Daniele Di Proietto <diproiettod@vmware.com>
This patch is done to enable in tree building of the ovn-utils python
wrapper. This is similar to what was done in commit ee89ea7b477bb4fd05137de03b2e8443807ed9f4 (json: Move from lib to
include/openvswitch.).
Signed-off-by: Aaron Rosen <aaronorosen@gmail.com> Signed-off-by: Ben Pfaff <blp@ovn.org>
Ryan Moats [Mon, 25 Jul 2016 16:28:52 +0000 (16:28 +0000)]
physical: Persist tunnels from one ovn-controller loop to the next.
While commit ab39371d68842b7e4000cc5d8718e6fc04e92795
(ovn-controller: Handle physical changes correctly) addressed
unit test failures, it did so at the cost of performance: [1]
notes that ovn-controller cpu usage is now pegged at 100%.
Root cause of this is that while the storage for tunnels is
persisted, their creation is not (which the above changed
incorrectly assumed was the case). This patch persists
tunneled data across invocations of physical_run. A side
effect is that renaming of localfvif_map_changed variable
to physical_map_changed and extending its scope to include
tunnel changes.
Andy Zhou [Tue, 26 Jul 2016 02:23:02 +0000 (19:23 -0700)]
ovsdb: Fix memory leak in replication logic
Release the memory of reply message of the initial "monitor" request.
Reported-at: http://openvswitch.org/pipermail/dev/2016-July/076075.html Signed-off-by: Andy Zhou <azhou@ovn.org> Acked-by: William Tu <u9012063@gmail.com>
Andy Zhou [Tue, 26 Jul 2016 02:22:03 +0000 (19:22 -0700)]
ovsdb: Properly close replication rpc connection
This patch removes rpc related memory leak reported below.
Reported-at: http://openvswitch.org/pipermail/dev/2016-July/076075.html Signed-off-by: Andy Zhou <azhou@ovn.org> Acked-by: William Tu <u9012063@gmail.com>
ovs-router: Ignore IPv6 source addresses for IPv4 routes.
Though this should not happen when we have another address on the device that is
IPv4 mapped, we should prevent adding a routing entry to IPv4 with an IPv6
source address.
This entry has been observed when the addresses list was out of date.
Cached: 172.16.10.1/32 dev br3 SRC fe80::c4d0:14ff:feb1:b54b
Cached: 172.16.10.0/24 dev br3 SRC fe80::c4d0:14ff:feb1:b54b
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@redhat.com> Signed-off-by: Ben Pfaff <blp@ovn.org>
This commit allows the ovs-ctl command to spawn daemons without the
internal process monitor. This is useful when integrating with,
ex. systemd, which provides its own monitoring facilities.
New 'other_config:pmd-rxq-affinity' field for Interface table to
perform manual pinning of RX queues to desired cores.
This functionality is required to achieve maximum performance because
all kinds of ports have different cost of rx/tx operations and
only user can know about expected workload on different ports.
Example:
# ./bin/ovs-vsctl set interface dpdk0 options:n_rxq=4 \
other_config:pmd-rxq-affinity="0:3,1:7,3:8"
Queue #0 pinned to core 3;
Queue #1 pinned to core 7;
Queue #2 not pinned.
Queue #3 pinned to core 8;
It's decided to automatically isolate cores that have rxq explicitly
assigned to them because it's useful to keep constant polling rate on
some performance critical ports while adding/deleting other ports
without explicit pinning of all ports.
Signed-off-by: Ilya Maximets <i.maximets@samsung.com> Signed-off-by: Daniele Di Proietto <diproiettod@vmware.com>
bridge: Pass interface's configuration to datapath.
This commit adds functionality to pass value of 'other_config' column
of 'Interface' table to datapath.
This may be used to pass not directly connected with netdev options and
configure behaviour of the datapath for different ports.
For example: pinning of rx queues to polling threads in dpif-netdev.
Signed-off-by: Ilya Maximets <i.maximets@samsung.com> Acked-by: Daniele Di Proietto <diproiettod@vmware.com>
If CPU number in pmd-cpu-mask is not divisible by the number of queues and
in a few more complex situations there may be unfair distribution of TX
queue-ids between PMD threads.
For example, if we have 2 ports with 4 queues and 6 CPUs in pmd-cpu-mask
such distribution is possible:
<------------------------------------------------------------------------>
pmd thread numa_id 0 core_id 13:
port: vhost-user1 queue-id: 1
port: dpdk0 queue-id: 3
pmd thread numa_id 0 core_id 14:
port: vhost-user1 queue-id: 2
pmd thread numa_id 0 core_id 16:
port: dpdk0 queue-id: 0
pmd thread numa_id 0 core_id 17:
port: dpdk0 queue-id: 1
pmd thread numa_id 0 core_id 12:
port: vhost-user1 queue-id: 0
port: dpdk0 queue-id: 2
pmd thread numa_id 0 core_id 15:
port: vhost-user1 queue-id: 3
<------------------------------------------------------------------------>
As we can see above dpdk0 port polled by threads on cores:
12, 13, 16 and 17.
By design of dpif-netdev, there is only one TX queue-id assigned to each
pmd thread. This queue-id's are sequential similar to core-id's. And
thread will send packets to queue with exact this queue-id regardless
of port.
In previous example:
pmd thread on core 12 will send packets to tx queue 0
pmd thread on core 13 will send packets to tx queue 1
...
pmd thread on core 17 will send packets to tx queue 5
So, for dpdk0 port after truncating in netdev-dpdk:
To fix this issue some kind of XPS implemented in following way:
* TX queue-ids are allocated dynamically.
* When PMD thread first time tries to send packets to new port
it allocates less used TX queue for this port.
* PMD threads periodically performes revalidation of
allocated TX queue-ids. If queue wasn't used in last
XPS_TIMEOUT_MS milliseconds it will be freed while revalidation.
* XPS is not working if we have enough TX queues.
Reported-by: Zhihong Wang <zhihong.wang@intel.com> Signed-off-by: Ilya Maximets <i.maximets@samsung.com> Signed-off-by: Daniele Di Proietto <diproiettod@vmware.com>
Ben Pfaff [Fri, 22 Jul 2016 19:39:44 +0000 (12:39 -0700)]
release-process.md: Document OVS release process and propose a schedule.
This document has two different kinds of text:
- The first sections of the document, "Release Strategy" and "Release
Numbering", describe what we've already been doing for most of the
history of Open vSwitch. If there is anything surprising in them,
then it's because our process has not been transparent enough, and not
because we're making a change.
- The final section of the document, "Release Scheduling", is a proposal
for current and future releases. We have not had a regular release
schedule in the past, but it seems important to have one in the
future, so this section requires review and feedback from everyone in
the community.
Signed-off-by: Ben Pfaff <blp@ovn.org> Acked-by: Russell Bryant <russell@ovn.org> Acked-by: Ryan Moats <rmoats@us.ibm.com>
Ben Pfaff [Sun, 24 Jul 2016 20:14:59 +0000 (13:14 -0700)]
ovn: Make it possible for CMS to detect when the OVN system is up-to-date.
Until now, there has been no reliable for the CMS (or ovn-nbctl, or
anything else) to detect when changes made to the northbound configuration
have been passed through to the southbound database or to the hypervisors.
This commit adds this feature to the system, by adding sequence numbers
to the northbound and southbound databases and adding code in ovn-nbctl,
ovn-northd, and ovn-controller to keep those sequence numbers up-to-date.
The biggest user-visible change from this commit is new a new option
--wait to ovn-nbctl. With --wait=sb, ovn-nbctl now waits for ovn-northd
to update the southbound database; with --wait=hv, it waits for the
changes to make their way to Open vSwitch on every hypervisor.
Signed-off-by: Ben Pfaff <blp@ovn.org> Acked-by: Russell Bryant <russell@ovn.org>
OVS compat layer can handle tunnel GSO packets. but it does
keep skb encapsulation on for packet handled in GSO. This can
confuse some NIC drivers. I have seen this issue on intel devices:
In upstream linux kernel networking stack udp_set_csum() is called
with only udp header applied but in case of compat layer it can
be called with IP header. So following patch take the offset into
account.
Signed-off-by: Pravin B Shelar <pshelar@ovn.org> Acked-by: Jesse Gross <jesse@kernel.org>
ovn-northd: Add logical flows to support native DHCPv4
OVN implements a native DHCPv4 support which caters to the common
use case of providing an IP address to a booting instance by
providing stateless replies to DHCPv4 requests based on statically
configured address mappings. To do this it allows a short list of
DHCPv4 options to be configured and applied at each compute host
running ovn-controller.
A new table 'DHCP_Options' is added in OVN NB DB to store the DHCP
options. Logical ports refer to this table to configure the DHCPv4
options.
For each logical port configured with DHCPv4 Options following flows
are added
- A logical flow which copies the DHCPv4 options to the DHCPv4
request packets using the 'put_dhcp_opts' action and advances the
packet to the next stage.
- A logical flow which implements the DHCP reponder by sending
the DHCPv4 reply back to the inport once the 'put_dhcp_opts' action
is applied.
Signed-off-by: Numan Siddique <nusiddiq@redhat.com> Co-authored-by: Ben Pfaff <blp@ovn.org> Signed-off-by: Ben Pfaff <blp@ovn.org> Tested-by: Ramu Ramamurthy <ramu.ramamurthy@us.ibm.com> Acked-by: Ramu Ramamurthy <ramu.ramamurthy@us.ibm.com>
Joe Stringer [Mon, 25 Jul 2016 21:09:26 +0000 (14:09 -0700)]
rhel/openvswitch.spec: Add SELinux policy.
Commit 9b897c9125ef ("rhel: provide our own SELinux custom policy
package") added the SELinux policy to the fedora packaging as a
subpackage. This patch makes the corresponding change to
openvswitch.spec, so that users of that specfile can generate the
selinux policy package without having to build all of the fedora
packages.
VMware-BZ: #1692972 Signed-off-by: Joe Stringer <joe@ovn.org> Acked-by: Flavio Leitner <fbl@sysclose.org>
Joe Stringer [Fri, 22 Jul 2016 21:10:51 +0000 (14:10 -0700)]
selinux: Allow ovs-ctl force-reload-kmod.
When invoking ovs-ctl force-reload-kmod via '/etc/init.d/openvswitch
force-reload-kmod', spurious errors would output related to 'hostname'
and 'ip', and the system's selinux audit log would complain about some
of the invocations such as those listed at the end of this commit message.
This patch loosens restrictions for openvswitch_t (used for ovs-ctl, as
well as all of the OVS daemons) to allow it to execute 'hostname' and
'ip' commands, and also to execute temporary files created as
openvswitch_tmp_t. This allows force-reload-kmod to run correctly.
Example audit logs:
type=AVC msg=audit(1468515192.912:16720): avc: denied { getattr } for
pid=11687 comm="ovs-ctl" path="/usr/bin/hostname" dev="dm-1"
ino=33557805 scontext=system_u:system_r:openvswitch_t:s0
tcontext=system_u:object_r:hostname_exec_t:s0 tclass=file
Prevents the cloning of rows with outgoing or incoming weak references when
those rows aren't being modified.
It improves the OVSDB Server performance when many rows with weak references
are involved in a transaction.
In the original code (dst_refs is created from scratch):
old->dst_refs = all the rows that weak referenced old
new->dst_refs = all the rows that weak referenced old and are still weak
+referencing new + rows in the transaction that weak referenced new
In the patch (dst_refs incrementally built):
Old->dst_refs = all the rows that weak referenced old
Ideally, but expansive to calculate:
New->dst_refs = old->dst_refs - "weak references removed within this TXN" +
+"weak references created within this TXN"
What this patch implements:
New->dst_refs = old->dst_refs - "weak references in old rows in TXN" + "weak
+references in new rows in TXN"
The resulting sets should be equal in both cases.
We do some more optimizations:
- If we know that the transactions must be successful at some point then,
instead of cloning dst_refs we could just move the elements between
the lists.
- At that point we lost the rollback feature, but we aren't going to need
it anyway (note that we didn't really touch the src_refs part).
- The references in dst_refs must point to new instead than old.
Previously we iterated over all the weak references in dst_refs
to change that pointer, but using an UUID is easier, and prevents
that iteration completely.
For some more commentary, see:
http://openvswitch.org/pipermail/dev/2016-July/074840.html
Signed-off-by: Esteban Rodriguez Betancourt <estebarb@hpe.com> Signed-off-by: Ben Pfaff <blp@ovn.org>
Ben Pfaff [Fri, 22 Jul 2016 23:43:50 +0000 (16:43 -0700)]
flow: Verify that tot_len >= ip_len in miniflow_extract().
miniflow_extract() uses the following quantities when it examines an IPv4
header:
size, the number of bytes from the start of the IPv4 header onward
ip_len, the number of bytes in the IPv4 header (from the IHL field)
tot_len, same as size but taken from IPv4 header Total Length field
Until now, the code in miniflow_extract() verified these invariants:
size >= 20 (minimum IP header length)
ip_len >= 20 (ditto)
ip_len <= size (to avoid reading past end of packet)
tot_len <= size (ditto)
size - tot_len <= 255 (because this is stored in a 1-byte variable
internally and wouldn't normally be big)
It failed to verify the following, which is not implied by the conjunction
of the above:
ip_len <= tot_len (e.g. that the IP header fits in the packet)
This means that the code was willing to read past the end of an IP
packet's declared length, up to the actual end of the packet including any
L2 padding. For example, given:
size = 44
ip_len = 44
tot_len = 40
miniflow_extract() would successfully verify all the constraints, then:
* Trim off 4 bytes of tail padding (size - tot_len), reducing size to
40 to match tot_len.
* Pull 44 (ip_len) bytes of IP header, even though there are only 40
bytes left. This causes 'size' to wrap around to SIZE_MAX-4.
Given an IP protocol that OVS understands (such as TCP or UDP), this
integer wraparound could cause OVS to read past the end of the packet.
In turn, this could cause OVS to extract invalid port numbers, TCP flags,
or ICMPv4 or ICMPv6 or IGMP type and code from arbitrary heap data
past the end of a packet.
This bug has common hallmarks of a security vulnerability, but we do not
know of a way to exploit this bug to cause an Open vSwitch crash, or to
extract sensitive data from Open vSwitch address space to an attacker's
benefit.
We do not have a specific example, but it is reasonable to suspect that
this bug could allow an attacker in some circumstances to bypass ACLs
implemented via Open vSwitch flow tables. However, any IP packet that
triggers this bug is invalid and should be rejected in an early stage of a
receiver's IP stack. For the same reason, any IP packet that triggers this
bug will also be dropped by any IP router, so an attacker would have to
share the same L2 segment as the victim. In conjunction with an IP stack
that has a similar bug, of course, this could cause some damage, but we do
not know of an IP stack with such a bug; neither Linux nor the OVS
userspace tunnel implementation appear to have such a bug.
Terry Wilson [Tue, 26 Jul 2016 00:17:11 +0000 (19:17 -0500)]
python: Serial JSON via Python's json lib.
There is no particularly good reason to use our own Python JSON
serialization implementation when serialization can be done faster
with Python's built-in JSON library.
A few tests were changed due to Python's default JSON library
returning slightly more precise floating point numbers.
Signed-off-by: Terry Wilson <twilson@redhat.com> Signed-off-by: Ben Pfaff <blp@ovn.org>
This compatibility code was only needed for Linux 2.6.36 and older. With the
support for versions older than 3.10 dropped, this code is not needed anymore.
The style for checking for mpls was kept in case some other protocol type is
added in the future.
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@redhat.com> Acked-by: Pravin B Shelar <pshelar@ovn.org>