Ben Pfaff [Sat, 10 Aug 2013 04:29:03 +0000 (21:29 -0700)]
netdev-linux: Use dedicated netlink notification socket.
The rtnetlink_link asynchronous netlink notifications seem somewhat
troublesome in a threaded environment. It seems more straightforward
to have netdev-linux fend for itself.
Ben Pfaff [Wed, 31 Jul 2013 21:09:30 +0000 (14:09 -0700)]
netdev-vport: Make netdev_vport_patch_peer() return a malloc()'d string.
When threading comes into the picture there arises the possibility of a
race between netdev_vport_patch_peer()'s caller using the returned string
and another caller changing the peer. It is safer to return a copy.
This is the same lifecycle used in the ofproto provider interface.
Compared to the previous netdev provider interface, it has the
advantage that the netdev top layer can control when any given
netdev becomes visible to the outside world.
Ben Pfaff [Sat, 10 Aug 2013 04:14:23 +0000 (21:14 -0700)]
netdev-linux, netdev-bsd: Make access to AF_INET socket thread-safe.
The only uses of 'af_inet_sock', in both drivers, were ioctls, so it seemed
like a good abstraction to write a function that just does such an ioctl,
and to factor out shared code into socket-util.
Signed-off-by: Ben Pfaff <blp@nicira.com> CC: Ed Maste <emaste@freebsd.org>
Ben Pfaff [Thu, 25 Jul 2013 23:11:52 +0000 (16:11 -0700)]
netdev-dummy: Use netdev_get_devices() instead of a local shash.
When an upcoming commit introduces thread safety into the netdev API, this
allows netdev-dummy to avoid adding more internal locking by taking
advantage of netdev_get_devices() refcounting.
Jesse Gross [Fri, 9 Aug 2013 22:27:27 +0000 (15:27 -0700)]
datapath: Fix typo in flow validation logic.
A bit shift operation is using the value '11' instead of '1' as the
starting value. This only makes validation weaker than it should be
so unless userspace is trying to install an invalid flow there will
be no effect.
ovs-bugtool: Collect database through CAP_NETWORK_STATUS.
Currently the openvswitch database is being collected with
CAP_NETWORK_CONFIG which has a max size of 50 KB. This is
quite low as the database can easily be larger than 50 KB.
Move database collection to CAP_NETWORK_STATUS which does
not have a max size. If database size exceeds 10 MB, create
a compacted version of it and then collect it.
Signed-off-by: Gurucharan Shetty <gshetty@nicira.com> Acked-by: Ben Pfaff <blp@nicira.com>
Ben Pfaff [Wed, 7 Aug 2013 03:35:29 +0000 (20:35 -0700)]
ofproto-dpif-xlate: Avoid MAC learning write lock on fast path.
Most of the trips through update_learning_table() do not actually change
the MAC learning table, but because some do the code there took the
MAC learning table's write lock. This commit changes the common case to
take only the read lock, falling back to the write lock if a change was
actually necessary.
Ethan reported that this gave a 3.3x performance improvement in one test
case due to reduced lock contention.
Signed-off-by: Ben Pfaff <blp@nicira.com> Acked-by: Ethan Jackson <ethan@nicira.com>
Jesse Gross [Thu, 8 Aug 2013 02:47:59 +0000 (19:47 -0700)]
datapath: Add 'ovs_' prefix to extern symbols.
The external symbols in the OVS kernel module are prefixed with
'ovs_' with the exception of ipv4_tun_to/from_nlattr(). This adds
the prefix and makes the out of tree version consistent with
upstream.
Signed-off-by: Jesse Gross <jesse@nicira.com> Acked-by: Andy Zhou <azhou@nicira.com>
Ben Pfaff [Tue, 6 Aug 2013 21:57:19 +0000 (14:57 -0700)]
ovs-thread: New function ovsthread_id_self().
I foresee a need for possibly large numbers of instances of "struct
seq" (which is introduced in an upcoming patch). Each struct seq
needs some per-thread data. POSIX has pthread_key_t for this, but
the number of keys can be fairly limited, to as few as 128. It is
reasonable to work around this by using a hash table indexed on the
current thread. That only works if one can get a thread identifier
that is hashable (pthread_t is not). This patch introduces a
hashable thread identifier.
Ben Pfaff [Tue, 6 Aug 2013 21:40:25 +0000 (14:40 -0700)]
ovs-thread: Add support for globally visible per-thread data.
DEFINE_PER_THREAD_DATA always declared its data item as "static", meaning
that it was only directly visible within a single translation unit.
This commit adds additional forms of per-thread data that allow the data
to be accessible from multiple translation units.
Ben Pfaff [Fri, 26 Jul 2013 00:05:46 +0000 (17:05 -0700)]
netdev: Make netdev_from_name() take a reference to its returned netdev.
This API change is necessary for thread safety, to be added in an upcoming
commit. Otherwise, the client would not be able to safely use the returned
netdev because it could already have been destroyed.
Signed-off-by: Ben Pfaff <blp@nicira.com> Acked-by: Andy Zhou <azhou@nicira.com>
Ben Pfaff [Thu, 25 Jul 2013 23:27:39 +0000 (16:27 -0700)]
netdev: Make netdev_get_devices() take a reference to each netdev.
This API change is necessary for thread safety, to be added in an upcoming
commit. Otherwise, the client would not be able to actually use any of
the returned netdevs because they could already have been destroyed.
Signed-off-by: Ben Pfaff <blp@nicira.com> Acked-by: Andy Zhou <azhou@nicira.com>
Ben Pfaff [Fri, 2 Aug 2013 19:19:49 +0000 (12:19 -0700)]
netdev-bsd: Don't assume 'struct netdev' has offset 0.
The data items returned by netdev_get_devices() are "struct netdev *"s.
The code fixed up by this commit used them as "struct netdev_bsd *",
which happens to work because struct netdev happens to be at offset 0 in
each struct but it's better to do a proper cast in case someday
struct netdev gets moved to a nonzero offset.
Ben Pfaff [Wed, 31 Jul 2013 22:22:12 +0000 (15:22 -0700)]
netdev-bsd: Correctly handle IPv4 netmasks.
netdev_bsd_get_in4() did not set anything in its 'netmask' output argument
if the IPv4 address was cached, leaving it indeterminate. It would also
mark the cache as valid even if there was an error retrieving the netmask.
This fixes both problems.
Found by inspection.
Signed-off-by: Ben Pfaff <blp@nicira.com> CC: Ed Maste <emaste@freebsd.org>
Simon Horman [Wed, 7 Aug 2013 00:28:00 +0000 (09:28 +0900)]
Update OPENFLOW-1.1+ to differentiate optional and required features
The purpose of this patch is primarily to provide details on which
unimplemented features are optional and which are required as this
may be of interest to those working on OpenFlow 1.1+ coverage.
This patch also:
* Clarifies the text of some entries which seemed difficult to understand
for the authors of this patch.
* Adds entries for features that were missing from the existing list.
N.B: It is entirely possible that there are still missing entries.
* Expands some entries into sub-entries where some portions of
a feature are required and others are optional
Co-authored-by: Joe Stringer <joe@wand.net.nz> Signed-off-by: Joe Stringer <joe@wand.net.nz> Signed-off-by: Simon Horman <horms@verge.net.au> Signed-off-by: Ben Pfaff <blp@nicira.com>
Andy Zhou [Sat, 3 Aug 2013 19:23:15 +0000 (12:23 -0700)]
odp-util: Always export the priority and skb_mark netlink attributes.
The current Netlink protocol allows a default value of zero if either mark
or priority is not specified (this is part of the ABI). Until now, when
userspace serializes either the value or mask, it looked at the value and
omitted the netlink attribute if it is zero. This is a bug because an
exact match on zero turns into a wildcard of the field.
These two fields (plus input port and EtherType) are special because they
can be omitted whereas most other values are required to be fully
specified. These protocol variations tend to cause bugs (as above) when we
evolve the protocol because an exception that makes sense in one context
might not be logical in another. Since the default value for mark and
priority are merely shorthands, we can push the protocol in a more
consistent direction by ignoring the shortcut and always serializing the
values. This is what this commits does.
Signed-off-by: Andy Zhou <azhou@nicira.com>
[blp@nicira.com added Jesse's text to the commit message] Signed-off-by: Ben Pfaff <blp@nicira.com>
Paul Ingram [Sat, 3 Aug 2013 07:12:36 +0000 (07:12 +0000)]
cfm: update remote opstate only when a CCM is received.
The remote opstate for a CFM interface is presumed to be up unless a CCM is
received which signals opstate down. This means than an interface configured
for CFM demand mode may incorrectly appear to be opstate up if it has not
received a CCM within the last fault interval.
We should remember the last remote opstate for a CFM interface and only
change it when a CCM arrives signaling a change.
Bug #18806 Signed-off-by: Paul Ingram <pingram@nicira.com> Signed-off-by: Ethan Jackson <ethan@nicira.com> Acked-by: Ethan Jackson <ethan@nicira.com>
The current situation is that whenever any packet enters the
userspace, bfd_should_process_flow() looks at the UDP destination
port to figure out whether that is a BFD packet. This means that
UDP destination port cannot be wildcarded for all the other flows
too.
To optimize BFD for megaflows, we introduce a new
'bfd:bfd_dst_mac' field in the database. Whenever this field is set
by a controller, it is assumed that all the BFD packets to/from
this interface will have the destination mac address set as the one
specified in the bfd:bfd_dst_mac field. If this field is set, we
first look at the destination mac address of a packet and if it
does not match the mac address set in bfd:bfd_dst_mac, we do not
process that packet as bfd. If the field does match, we go ahead
and look at the UDP destination port too.
Also, change the default BFD destination mac address to
"00:23:20:00:00:01".
Feature #18850. Signed-off-by: Gurucharan Shetty <gshetty@nicira.com> Acked-by: Ethan Jackson <ethan@nicira.com>
Ethan Jackson [Fri, 2 Aug 2013 19:43:03 +0000 (12:43 -0700)]
ofproto-dpif-xlate: Take responsibility for ofproto_receive().
ofproto_receive() is a slightly odd function which doesn't fit
perfectly in either ofproto-dpif or ofproto-dpif-xlate. However, it's
much easier to reason about its thread safety in ofproto-dpif-xlate,
so this patch moves it there and renames it xlate_receive().
Signed-off-by: Ethan Jackson <ethan@nicira.com> Acked-by: Ben Pfaff <blp@nicira.com>
Andy Zhou [Sat, 3 Aug 2013 19:23:14 +0000 (12:23 -0700)]
odp-util: add verbose mode for displaying dp flow.
When verbose mode tuned on, all dp flow fields described by the netlink
attributes are displayed, including fully wildcarded attributes.
Otherwise, the fully wildcarded attributes are omitted for brevity.
Added -m option to "ovs-dpctl dump-flows" to enable verbose mode. It is
off by default.
Signed-off-by: Andy Zhou <azhou@nicira.com>
[blp@nicira.com added documentation] Signed-off-by: Ben Pfaff <blp@nicira.com>
Ethan Jackson [Fri, 2 Aug 2013 03:52:01 +0000 (20:52 -0700)]
ofproto-dpif-xlate: Don't trace on deep resubmit.
While this code is useful for debugging, removing it allows us to hide
ofproto_trace() in ofproto-dpif. ofproto_trace() is a complex function
which could be difficult to make "obviously" thread safe.
Signed-off-by: Ethan Jackson <ethan@nicira.com> Acked-by: Ben Pfaff <blp@nicira.com>
Andy Zhou [Sat, 3 Aug 2013 03:22:17 +0000 (20:22 -0700)]
ofproto-dpif: avoid losing track of kernel flows upon reinstallation
This commit fixes a problem whereby userspace can lose track of a
flow installed in the kernel, instead believing that the flow is
not installed. The most visible consequence of this bug was a
message in the ovs-vswitchd log warning about an unexpected flow
in the kernel. Other possible consequences included loss of
statistics and failure to updates actions when the OpenFlow flow
table changed.
The problem arose in the following scenario. Suppose userspace
sets up a kernel flow due to an arriving packet. Before kernel
flow setup completes, another packet for that flow arrives. The
kernel sends the new packet to userspace after userspace has
completed processing the batch of packets that set up the flow.
Userspace then attempts to reinstall the kernel flow. This fails
with EEXIST, so userspace then marked the flow as not-installed,
even though it was successfully installed before and remains
installed. The next time userspace dumped the kernel flow
table to gather statistics, it would complain about an unexpected
flow and delete it.
In practice, we have seen these messages with netperf TCP_CRR tests and
UDP stream tests.
This patch fixes the problem by changing userspace so that, once
it successfully installs a flow in the kernel, it will not reinstall
it when it sees another packet for the flow in userspace. This
has the downside that, if something goes wrong and a flow
disappears from the kernel (e.g. ovs-dpctl del-flows), then userspace
won't reinstall it (until it tries to delete it). (This is in fact
the reason why until now userspace reinstalled flows it knew it
already installed.)
Some more background may be warranted. There are two EEXIST error
cases:
1. A subfacet was installed successfully in a previous (recent)
batch. Now we've attempted to reinstall exactly the same
subfacet in this batch.
2. A subfacet was installed successfully in a previous (recent)
batch or earlier in the current batch. We've attempted to
install a subfacet for an overlapping megaflow.
Before megaflows, installation errors were ignored completely.
Since megaflows were introduced, they have been handled by
considering on any installation error that the given subfacet is
not installed. This works well for case #2 but causes case #1 to
yield unexpected flows, as described at the top of the commit
message.
This commit adds the wrinkle that we never try to reinstall
exactly the same subfacet that we know we installed successfully
earlier (and haven't deleted) unless its actions change. This
ought to work just as well for case #2, and avoids the problem
with case #1.
Prepared with assistance from Ethan.
Signed-off-by: Andy Zhou <azhou@nicira.com>
[blp@nicira.com rewrote the commit message] Signed-off-by: Ben Pfaff <blp@nicira.com>
Justin Pettit [Sat, 3 Aug 2013 04:17:31 +0000 (21:17 -0700)]
ofproto-dpif: Always un-wildcard fields that are being set.
The ODP library has an optimization to not set a header if the field was
not changed, regardless of whether an action to set the field was
present. That library is also responsible for un-wildcarding fields
that are bieng modified. This leads to a problem where a packet matches
a flow that updates a field, but that particular packet's field already
has that value. As such, an overly loose megaflow will be generated
that doesn't match on that field and the actions won't update it. A
second packet that should have the field set will match that flow and
will not be modified.
This commit changes the behavior to always un-wildcard fields that are
being modified. Since the ODP library updates the entire header if a
field in it is modified, and all those fields will be un-wildcarded, the
generated flows may be different. However, they should be correct.
Ben Pfaff [Sat, 3 Aug 2013 00:32:25 +0000 (17:32 -0700)]
async-append: Refactor to avoid requiring enabling while single threaded.
Until now, the async append interface has required async_append_enable()
to be called while the process was still single-threaded, with the
rationale being that async_append_enable() could race with
async_append_write() on some existing async_append object. This was a
difficult problem when the async append interface was introduced, because
at the time Open vSwitch did not have any infrastructure for inter-thread
synchronization.
Now it is easy to solve, by introducing synchronization into the
async append module. However, that's more or less wasted, because the
client is already required to serialize access to async append objects.
Moreover, vlog, the only existing client, needs to serialize access for
other reasons, so it wouldn't even be possible to just drop the client's
synchronization.
This commit therefore takes another approach. It drops the
async_append_enable() interface entirely. Now any existing async_append
object is always enabled. The responsibility for "enabling", then, now
rests in whether the client creates and uses an async_append object, and
so vlog now takes care of that by itself. Also, since vlog now has to
deal with sometimes having an async_append and sometimes not having one,
we might as well allow creating an async_append to fail, thereby slightly
simplifying the "no async I/O" implementation from "write synchronously"
to "always fail creating an async_append".
Reported-by: Shih-Hao Li <shihli@nicira.com> Signed-off-by: Ben Pfaff <blp@nicira.com>
Once we have multiple threads running, having each execute flow mods
created by the learn action won't be tenable. It essentially will
require us to make the core ofproto module thread safe, which is not
the direction we want to go. This patch punts on the problem by
handing flow mods to ofproto-dpif to handle later.
Signed-off-by: Ethan Jackson <ethan@nicira.com> Acked-by: Ben Pfaff <blp@nicira.com>
Ethan Jackson [Fri, 2 Aug 2013 00:07:08 +0000 (17:07 -0700)]
tag: Retire the venerable tag library.
This patch retires a venerable library whose inception dates before
the first patch of the current repository: tags. They have served us
well, but their time has come for the reasons listed below.
1) They don't actually help much.
In theory, tags had been used to reduce revalidation necessary when
using bonds, mac-learning, and frequently changing flow tables. With
bonds and mac-learning, things change happen so rarely that tagging
isn't worth it. That leaves flow table changes. With the complex flow
tables in my testing, the revalidate_set gets so overwhelmed with
tags, that we end up revalidating every facet every time through the
run loop. In other words, they tags are giving us no benefit.
2) They complicate the code.
This patch simplifies the code and removes a couple of rather ugly
kludges.
3) They complicated locking once threading hits.
Because of the calculate_flow_tag() function, the table_dpif structure
would require locking in a multi-threaded OVS. Though this problem
isn't insurmountable, it's annoying and probably would cause lock
contention.
Of course, we could try to work around these problems with a more
advanced tagging infrastructure, but this moves in the opposite of the
direction we should be. Ideally we'll have a more-or-less stateless
ofproto-dpif supporting a massive number of datapath flows. Tags (or
facets for that matter) aren't going to work in this new world.
Signed-off-by: Ethan Jackson <ethan@nicira.com> Acked-by: Ben Pfaff <blp@nicira.com>
Ben Pfaff [Fri, 2 Aug 2013 19:19:49 +0000 (12:19 -0700)]
netdev-linux: Don't assume 'struct netdev' has offset 0.
The data items returned by netdev_get_devices() are "struct netdev *"s.
The code fixed up by this commit used them as "struct netdev_linux *",
which happens to work because struct netdev happens to be at offset 0 in
each struct but it's better to do a proper cast in case someday
struct netdev gets moved to a nonzero offset.
A tunnel value attribute is not allowed to have an empty IP destination
address but this is legal for masks. This drops both the checks for
serializing masks and also the sanity checks on them.
Signed-off-by: Jesse Gross <jesse@nicira.com> Acked-by: Andy Zhou <azhou@nicira.com>
Jesse Gross [Thu, 1 Aug 2013 20:31:28 +0000 (13:31 -0700)]
ofproto-dpif-xlate: Unmask mark when used for tunnels.
The tunnel lookup uses the skb_mark as part of the port find process
but it isn't unmasked along with the other fields. This adds it to
the list of significant fields.
Ethan Jackson [Sat, 27 Jul 2013 19:24:15 +0000 (12:24 -0700)]
ofproto-dpif-xlate: Don't try to optimize goto table.
This patch reverts commit 5559942 (ofproto-dpif: GOTO_TABLE recursion
removal.) by reintroducing the recursion through xlate_table_action().
The main reason to do this is the introduction of new rule locking in
future patches. The code before this patch was relatively difficult
to lock in a clean straight-forward manner.
Andy Zhou [Thu, 1 Aug 2013 17:49:46 +0000 (10:49 -0700)]
datapath: Accept any 802.2 eth_type mask but override to be exact match
When key.eth_type is absent it is interpreted to be 802.2, which is
represented by a special value. In order to prevent inadvertant matches
on this opaque value, the mask is forced to be either fully wildcarded
or fully exact.
Signed-off-by: Andy Zhou <azhou@nicira.com> Signed-off-by: Jesse Gross <jesse@nicira.com>
Andy Zhou [Thu, 1 Aug 2013 17:49:45 +0000 (10:49 -0700)]
datapath: Accept any in_port mask but override to be exact match
Pre mega flow, netlink allows the in_port key attribute
to be missing. Missing in_port is interpreted as DP_MAX_PORTS.
For backward compatibility, mega flow implementation will always allow
the mask of in_port to be specified, as if the in_port key attribute
is always specified.
To prevent accidental match of the DP_MAX_PORTS, which value is opaque to
the user space, we will always force the mask to be exact match,
regardless of the value supplied by the netline message. Missing
in_port mask continue to mean wildcarded match, same as other masks.
Signed-off-by: Andy Zhou <azhou@nicira.com> Signed-off-by: Jesse Gross <jesse@nicira.com>
Commit 97be153858b4cd175cbe7862b8e1624bf22ab98a (clang: Add
annotations for thread safety check.) defined 'struct ovs_mutex'
variable in 'atomic_flag' in 'ovs-atomic-pthreads.h'. This
casued "mutex: has incomplete type" error in compilation when
'ovs-atomic-pthreads.h' is included.
This commit goes back to use 'pthread_mutex_t' for that variable
and adds test for the 'atomic_flag' related functions.
Reported-by: Gurucharan Shetty <gshetty@nicira.com> Signed-off-by: Alex Wang <alexw@nicira.com> Signed-off-by: Ben Pfaff <blp@nicira.com>
Ethan Jackson [Wed, 31 Jul 2013 18:29:28 +0000 (11:29 -0700)]
bfd: Downgrade long delay messages to INFO level.
On my system the long delay messages can cause a transient BFD unit
test failure due to the log checking. These messages don't *really*
need to be at WARN level, so this patch downgrades them.
Signed-off-by: Ethan Jackson <ethan@nicira.com> Acked-by: Ben Pfaff <blp@nicira.com>