Ben Pfaff [Tue, 29 Jul 2014 15:59:40 +0000 (08:59 -0700)]
netlink-socket: Add conceptual documentation.
Based on a conversation with the VMware Hyper-V team earlier today.
This commit also changes a couple of functions that were only used with
netlink-socket.c into static functions. I couldn't think of a reason for
code outside that file to use them.
Ben Pfaff [Mon, 28 Jul 2014 17:17:56 +0000 (10:17 -0700)]
connmgr: Demote service controllers too in ofconn_set_role().
Service controllers can set their roles, so it's necessary to demote them
to slaves if another controller becomes master. Unfortunately the
'controllers' hmap only contains primary controllers, so this was omitted.
Signed-off-by: Ben Pfaff <blp@nicira.com> Acked-by: Justin Pettit <jpettit@nicira.com>
Saurabh Shah [Mon, 28 Jul 2014 00:26:58 +0000 (17:26 -0700)]
datapath-windows: Kernel module for HyperV.
The kernel switch extension has support for bridged back forwarding & tunneling
over VXLAN. There is no Netlink integration as it is still being worked out.
Co-Authored-By: Ankur Sharma <ankursharma@vmware.com> Signed-off-by: Ankur Sharma <ankursharma@vmware.com> Co-Authored-By: Eitan Eliahu <eliahue@vmware.com> Signed-off-by: Eitan Eliahu <eliahue@vmware.com> Co-Authored-By: Guolin Yang <gyang@vmware.com> Signed-off-by: Guolin Yang <gyang@vmware.com> Co-Authored-By: Linda Sun <lsun@vmware.com> Signed-off-by: Linda Sun <lsun@vmware.com> Co-Authored-By: Nithin Raju <nithin@vmware.com> Signed-off-by: Nithin Raju <nithin@vmware.com> Signed-off-by: Saurabh Shah <ssaurabh@vmware.com> Signed-off-by: Ben Pfaff <blp@nicira.com>
Ben Pfaff [Sat, 26 Jul 2014 19:19:03 +0000 (12:19 -0700)]
ofproto-dpif-upcall: Fix sparse warnings.
Fixes these warnings from "sparse":
../ofproto/ofproto-dpif-upcall.c:761:1: warning: symbol 'free_upcall' was
not declared. Should it be static?
../ofproto/ofproto-dpif-upcall.c:849:1: warning: symbol 'convert_upcall'
was not declared. Should it be static?
Signed-off-by: Ben Pfaff <blp@nicira.com> Acked-by: Justin Pettit <jpettit@nicira.com>
Ben Pfaff [Thu, 15 May 2014 15:19:11 +0000 (08:19 -0700)]
meta-flow: Simplify handling of a variable number of registers.
At the time that Open vSwitch implemented registers, there was a high cost
to adding additional fields, so I wrote the code so that the number of
registers could be reduced at compile time. Now, fields are cheaper
(though not free) and in the meantime I have never heard of anyone reducing
the number of registers. Since I intend to add more code that would
require awkward "#if"s like this, I think that this is a good time to
simplify it by requiring FLOW_N_REGS to be fixed. This commit does that.
Signed-off-by: Ben Pfaff <blp@nicira.com> Acked-by: Jarno Rajahalme <jrajahalme@nicira.com>
Typically, kernel datapath threads send upcalls to userspace where
handler threads process the upcalls. For TAP and DPDK devices, the
datapath threads operate in userspace, so there is no need for
separate handler threads.
This patch allows userspace datapath threads to directly call the
ofproto upcall functions, eliminating the need for handler threads
for datapaths of type 'netdev'.
Signed-off-by: Ryan Wilson <wryan@nicira.com> Signed-off-by: Ethan Jackson <ethan@nicira.com> Acked-by: Ethan Jackson <ethan@nicira.com>
There is race in datapath when last mask in mask array deleted can
result in NULL pointer dereference in datapath.
datapath lookup does not check mask pointer if its index is less than
mask-array count. That is safe because delete operation moves last valid
pointer to the deleted element. But this does not work if we are
deleting last element in array. Following patch adds NULL check for the
mask pointer.
This patch also avoids accessing ma->count without any locks.
Signed-off-by: Pravin B Shelar <pshelar@nicira.com> Acked-by: Andy Zhou <azhou@nicira.com>
Ben Pfaff [Fri, 25 Jul 2014 16:19:17 +0000 (09:19 -0700)]
cfm: Reduce "long delay" message from WARN to INFO, to match BFD behavior.
These messages can cause the testsuite to fail on a busy build machine
since the testsuite treats WARN or ERR log messages as failures. BFD
uses an INFO message instead of WARN, so this just changes CFM to match.
Alternatively, the testsuite could ignore "long delay" messages (it ignores
some other categories of messages). In that case I'd expect that we'd
want to change BFD to match CFM since I don't know of a reason why they
should log differently.
Signed-off-by: Ben Pfaff <blp@nicira.com> Acked-by: Justin Pettit <jpettit@nicira.com>
Andy Zhou [Thu, 24 Jul 2014 09:17:48 +0000 (02:17 -0700)]
datapath: reorder action netlink attribute definition for upstreaming
Keeping the order of netlink attribute definition in the order of
upstreaming is the best way to keep all released user space program
forward compatible with upstreamed kernel modules.
Adjust action netlink attribute order to match with the current
upstreaming plan.
Recirc and hash actions are introduced in branch 2.3, which will be
fixed by the patch. The MPLS actions have been released since
branch-2.1 but there is no kernel implementation of them prior to
branch 2.3. Thus the ordering change should not affect them.
Signed-off-by: Andy Zhou <azhou@nicira.com> Acked-by: Pravin B Shelar <pshelar@nicira.com>
datapath/flow_netlink: Avoid wildcarding tunnel key with disabled megaflows
If the userspace wants to match on a flow with some tunnel attributesset to 0,
it simply omits them in the netlink attributes stream.
Since our wildcarding logic (when megaflows are disabled) is based on the
attributes in the netlink stream, we set our mask incorrectly.
This commit adds a check to detect if the userspace wants to match on a tunnel,
in which case we simply unwildcard the whole tun_key
Reported-by: Andy Zhou <azhou@nicira.com> Signed-off-by: Daniele Di Proietto <ddiproietto@vmware.com> Signed-off-by: Andy Zhou <azhou@nicira.com>
Alex Wang [Wed, 23 Jul 2014 06:14:50 +0000 (23:14 -0700)]
datapath: flow_netlink: Fix a bug.
Commit 62974663fe (datapath/flow_netlink: Create right mask with
disabled megaflows) introduced the bug which caused
ovs_nla_get_match() returns immediately after parsing the flow
mask for OVS_KEY_ATTR_ENCAP. Consequently, when vlan encapsulated
packets are present, the corresponding datapath flows will have
incorrect mask like below. And the incorrect flows could affect
other non-vlan packets.
Ben Pfaff [Wed, 23 Jul 2014 04:06:44 +0000 (21:06 -0700)]
dpif-netdev: Initialize upcall->packet when queuing to userspace.
Commit db73f7166a6 (netdev-dpdk: Fix race condition with DPDK mempools in
non pmd threads) switched to a new way of setting up 'upcall->packet', but
only initialized two of the fields in the packet. This could cause
core dumps and other strange behavior. In particular it caused failures in
several unit tests on XenServer.
This commit fixes the problem by initializing the entire ofpbuf.
Signed-off-by: Ben Pfaff <blp@nicira.com> Acked-by: Justin Pettit <jpettit@nicira.com>
netlink-notifier: Exit loop if nl_sock_recv() returns an error
An error from nl_sock_recv() could mean that there are some issues with the
netlink socket (EBADF, ENOTSOCK, ...). Calling nl_sock_recv() in this case is
harmful: nln_run() will never return and since we are calling it from the main
thread, vswitchd becomes unresponsive.
Also, with this commit we avoid calling the notifier callback in case of error
(except for ENOBUFS, which means that there could be too many notifications)
Suggested-by: Alex Wang <alexw@nicira.com> Signed-off-by: Daniele Di Proietto <ddiproietto@vmware.com> Acked-by: Ben Pfaff <blp@nicira.com> Acked-by: Alex Wang <alexw@nicira.com>
Ben Pfaff [Tue, 22 Jul 2014 23:18:35 +0000 (16:18 -0700)]
configure: Don't check for malloc hooks that we no longer use.
Commit 825da1c6d1c7 (leak-checker: Remove because it cannot be made
thread-safe.) removed the only uses of these hooks but neglected to remove
the test for them.
Signed-off-by: Ben Pfaff <blp@nicira.com> Acked-by: Justin Pettit <jpettit@nicira.com>
Ben Pfaff [Tue, 22 Jul 2014 23:09:26 +0000 (16:09 -0700)]
ovs-ctl: Add comment to explain why we only save ofports for pre-1.10.
We've had a couple of questions about this lately, including one suggestion
that we should always save the OpenFlow port numbers. This explains why
the behavior is as it is.
Signed-off-by: Ben Pfaff <blp@nicira.com> Acked-by: Justin Pettit <jpettit@nicira.com>
Ben Pfaff [Tue, 22 Jul 2014 18:50:37 +0000 (11:50 -0700)]
debian: Avoid -Wformat-zero-length warnings.
Debian puts an extra "-Wformat" in the CFLAGS following OVS's own
"-Wformat -Wno-format-zero-length", which therefore overrides
-Wno-format-zero-length, so this commit adds an extra
-Wno-format-zero-length to avoid those false positives.
Signed-off-by: Ben Pfaff <blp@nicira.com> Acked-by: Justin Pettit <jpettit@nicira.com>
Ben Pfaff [Tue, 22 Jul 2014 04:00:04 +0000 (21:00 -0700)]
cmap, classifier: Avoid unsafe aliasing in iterators.
CMAP_FOR_EACH and CLS_FOR_EACH and their variants tried to use void ** as
a "pointer to any kind of pointer". That is a violation of the aliasing
rules in ISO C which technically yields undefined behavior. With GCC 4.1,
it causes both warnings and actual misbehavior. One option would to add
-fno-strict-aliasing to the compiler flags, but that would only help with
GCC; who knows whether this can be worked around with other compilers.
Instead, this commit rewrites the iterators to avoid disallowed pointer
aliasing.
VMware-BZ: #1287651 Signed-off-by: Ben Pfaff <blp@nicira.com> Acked-by: Jarno Rajahalme <jrajahalme@nicira.com>
The check for the need of default values was in the wrong place,
causing no prefix tracking to be used when database had no
configuration for a flow table. Missing configuration means that
defaults should be used.
To limit clutter on the log, we now log the prefix tracking
configuration when it is explicitly set in the database.
Signed-off-by: Jarno Rajahalme <jrajahalme@nicira.com> Acked-by: Ben Pfaff <blp@nicira.com>
lib/ovs-rcu: evaluate argument of ovsrcu_get only once.
As ovsrcu_get() looks like a function call, it is reasonable for the
callers to expect that the arguments are evaluated only once.
CONST_CAST expands its 'POINTER' argument multiple times, and the
exact effect of this turned out to be compiler dependent. Fix this by
expanding the macro argument before CONST_CAST, and removing
unnecessary CONST_CASTs.
netdev-dpdk: Fix race condition with DPDK mempools in non pmd threads
DPDK mempools rely on rte_lcore_id() to implement a thread-local cache.
Our non pmd threads had rte_lcore_id() == 0. This allowed concurrent access to
the "thread-local" cache, causing crashes.
This commit resolves the issue with the following changes:
- Every non pmd thread has the same lcore_id (0, for management reasons), which
is not shared with any pmd thread (lcore_id for pmd threads now start from 1)
- DPDK mbufs must be allocated/freed in pmd threads. When there is the need to
use mempools in non pmd threads, like in dpdk_do_tx_copy(), a mutex must be
held.
- The previous change does not allow us anymore to pass DPDK mbufs to handler
threads: therefore this commit partially revert 143859ec63d45e. Now packets
are copied for upcall processing. We can remove the extra memcpy by
processing upcalls in the pmd thread itself.
With the introduction of the extra locking, the packet throughput will be lower
in the following cases:
- When using internal (tap) devices with DPDK devices on the same datapath.
Anyway, to support internal devices efficiently, we needed DPDK KNI devices,
which will be proper pmd devices and will not need this locking.
- When packets are processed in the slow path by non pmd threads. This overhead
can be avoided by handling the upcalls directly in pmd threads (a change that
has already been proposed by Ryan Wilson)
Also, the following two fixes have been introduced:
- In dpdk_free_buf() use rte_pktmbuf_free_seg() instead of rte_mempool_put().
This allows OVS to run properly with CONFIG_RTE_LIBRTE_MBUF_DEBUG DPDK option
- Do not bulk free mbufs in a transmission queue. They may belong to different
mempools
Signed-off-by: Daniele Di Proietto <ddiproietto@vmware.com> Acked-by: Pravin B Shelar <pshelar@nicira.com>
Alex Wang [Fri, 18 Jul 2014 21:27:36 +0000 (14:27 -0700)]
netlink-socket: Do not make flow_dump block on netlink socket.
Commit 93295354 (netlink-socket: Simplify multithreaded dumping
to match Linux reality.) makes the call to recvmsg() block if no
messages are available. This can cause revalidator threads hanging
for long time or even deadlock when main thread tries to stop the
revalidator threads.
This commit fixes the issue by enabling the MSG_DONTWAIT flag in
the call to recvmsg().
Signed-off-by: Alex Wang <alexw@nicira.com> Acked-by: Ben Pfaff <blp@nicira.com>
Without this change, with shared libraries, VLOG
constructor for ovsdb-server would get called twice corrupting
the 'vlog_modules' list causing an infinite loop.
Signed-off-by: Gurucharan Shetty <gshetty@nicira.com> Reported-by: Gur Stavi <gstavi@mrv.com> Acked-by: Ben Pfaff <blp@nicira.com>
lib/classifier: Return all matching prefix lengths from trie lookup.
Previously we only returned the last matching prefix length
encountered during a trie lookup, and skipped subtables that had
prefixes longer than that. This patch changes the trie lookup
functions to return all matching prefix lengths seen, so that all
non-matching prefix lengths can be skipped.
Signed-off-by: Jarno Rajahalme <jrajahalme@nicira.com> Acked-by: Ben Pfaff <blp@nicira.com>
lib/classifier: Unify struct classifier and cls_classifier.
Now that it is clear that struct cls_classifier itself does not
need RCU indirection and pvector is defined in its own header, it
is possible get rid of the indirection from struct classifier to
struct cls_classifier.
- Since we have two dpdk classes, we should split the initial operations needed
by both classes from the initialization needed by each class.
- The dpdk_ring class does not need an initialization function: it has been
removed. This also prevents many testcase from failing, because
dpdk_ring_class_init() was printing an unexpected log message
(OVS_VSWITCHD_START at tests/ofproto-macros.at:54 check for a specific set of
startup log messages)
- If the user doesn't pass the --dpdk option we do not register the dpdk*
classes
- Do not call VLOG_ERR if there are 0 dpdk ethernet device. OVS can now be used
with dpdk_ring devices.
Signed-off-by: Daniele Di Proietto <ddiproietto@vmware.com> Acked-by: Pravin B Shelar <pshelar@nicira.com>
Ben Pfaff [Thu, 17 Jul 2014 17:27:00 +0000 (10:27 -0700)]
Simplify ofproto_controller_info by using a struct smap in place of array.
The only client for ofproto_controller_info was transforming the array of
pairs into an smap anyway. It's easy for the code that fills in the array
to generate it as an smap directly, and it's also easier to extend later.
Signed-off-by: Ben Pfaff <blp@nicira.com> Acked-by: Gurucharan Shetty <gshetty@nicira.com>
Alex Wang [Wed, 16 Jul 2014 01:52:19 +0000 (18:52 -0700)]
stp: Make stp-disabled port forward stp bpdu packets.
Commit 0d1cee123a84 (stp: Fix bpdu tx problem in listening state)
makes ovs drop the stp bpdu packets if stp is not enabled on the
input port. However, when pif bridge is used and stp is enabled
on the integration bridge. The flow translation of stp bpdu
packets will go through a level of resubmission which changes
the input port to the corresponding peer port. Since, the
patch port on the pif bridge does not have stp enabled, the
flow translation will drop the bpdu packets.
This commit fixes the issue by making ovs forward stp bpdu packets
on stp-disabled port.
cmap_next_position() didn't update the node pointer while iterating through a
list of nodes with the same hash.
This commit fixes the bug and improve test-cmap to detect it.
Signed-off-by: Daniele Di Proietto <ddiproietto@vmware.com> Signed-off-by: Ben Pfaff <blp@nicira.com>
This patch enables the client dpdk rings within the netdev-dpdk. It adds
a new dpdk device called dpdkr (other naming suggestions?). This allows
for the use of shared memory to communicate with other dpdk applications,
on the host or within a virtual machine. Instructions for use are in
INSTALL.DPDK.
This has been tested on Intel multi-core platforms and with the client
application within the host.
Ben Pfaff [Wed, 16 Jul 2014 16:39:49 +0000 (09:39 -0700)]
netlink-socket: Simplify multithreaded dumping to match Linux reality.
Commit 0791315e4d (netlink-socket: Work around kernel Netlink dump thread
races.) introduced a simple workaround for Linux kernel races in Netlink
dumps. However, the code remained more complicated than needed. This
commit simplifies it.
The main reason for complication in the code was 'status_seq' in nl_dump.
This member was there to allow a thread to wait for some other thread to
refill the socket buffer with another dump message (although we did not
understand the reason at the time it was introduced). Now that we know
that Netlink dumps properly need to be serialized to work in existing
Linux kernels, there's no additional value in having 'status_seq',
because serialized recvmsg() calls always refill the socket buffer
properly.
This commit updates nl_msg_next() to clear its buffer argument on error.
This is a more convenient interface for the new version of the Netlink
dump code. nl_msg_next() doesn't have any other callers.
Joe Stringer [Tue, 15 Jul 2014 12:04:52 +0000 (12:04 +0000)]
dpif: Update documentation for RCU-protected actions.
The userspace datapath returns RCU-protected actions from flow_get() and
flow_dump_next(). This doesn't cause any trouble for current users of
these functions, but it imposes additional constraints on their use.
This patch makes the dpif documentation more explicit about how the
results of these functions can be used.
Signed-off-by: Joe Stringer <joestringer@nicira.com> Acked-by: Ben Pfaff <blp@nicira.com>
tests: Use the 'LARGE_MSECS' variation of time/warp at more places.
commit 8661af798(timeval: Provide a variation for time/warp command)
provided a variation of time/warp command to prevent multiple
invocations of time/warp. That commit changed only the users in bfd.at
and cfm.at as they used it the most. Since we haven't had any negative
confequences because of the change, add the remaining users.
Signed-off-by: Gurucharan Shetty <gshetty@nicira.com> Acked-by: Ben Pfaff <blp@nicira.com>
debian: Automatically start openvswitch before first invocation of ovs-vsctl.
In the 'interfaces' file, if an admin adds the openvswitch interface
in 'auto', ifupdown will try to create OVS interfaces even before
openvswitch has started. In a case like that, assume that the admin
knows what he is doing and try to start openvswitch.
The negatives I see are
1. /usr is NFS mounted, in which case expect the admin
to mount it through initramfs.
2. syslog through network may not be accessible.
This is similar to what rhel does with commit 602453000e28ec10
(rhel: Automatically start openvswitch service before bringing an ovs
interface up or down)
Signed-off-by: Gurucharan Shetty <gshetty@nicira.com> Acked-by: Ben Pfaff <blp@nicira.com>
Ben Pfaff [Mon, 14 Jul 2014 21:06:03 +0000 (14:06 -0700)]
netlink-socket: Fix handling socket allocation failure in nl_dump_start().
If nl_pool_alloc() failed, then 'dump' was not initialized at all and
further use of the dump would access uninitialized data, probably causing
a crash.
Found by inspection.
Signed-off-by: Ben Pfaff <blp@nicira.com> Acked-by: Joe Stringer <joestringer@nicira.com>
Ben Pfaff [Mon, 14 Jul 2014 20:17:05 +0000 (13:17 -0700)]
dpif-linux: Avoid null dereference if all ports disappear.
When dpif_linux_refresh_channels() refreshes the set of channels when
the number of handlers changes, it destroys all the dpif's channels and
sets dpif->uc_array_size to 0. If the port dump later in the function
turns up no ports (which generally indicates a bug), then no channels will
be allocated and thus dpif->uc_array_size will remain 0 and 'channels' will
be null in each handler. This is self-consistent, at least, but
dpif_linux_port_get_pid__() was still willing in this situation to
try to access element 0 of the set of channels, dereferencing a null
pointer.
This fixes the problem.
I encountered this while looking at a bug that I had introduced during
development that caused the port dump to always be empty. It would be
difficult to encounter in normal use.
Signed-off-by: Ben Pfaff <blp@nicira.com> Acked-by: Joe Stringer <joestringer@nicira.com>
Ben Pfaff [Mon, 14 Jul 2014 21:47:12 +0000 (14:47 -0700)]
ofproto: Avoid theoretical double free of large rule collections.
collect_rules_strict() and collect_rules_loose() destroy the rule
collections that they create if they return an error, and some of their
callers then go on to destroy them again. This could cause a double-free
in the case where rule_collection_destroy() actually calls free(). That
never happens in the current tree, because free() is only necessary if
malloc() was called and there's a 64-entry stub that none of the current
code in collect_rules_*() can fill up in their error cases. Still, it
seems better to fix the problem.
Found by clang-analyzer.
Signed-off-by: Ben Pfaff <blp@nicira.com> Acked-by: Jarno Rajahalme <jrajahalme@nicira.com>
Joe Stringer [Tue, 8 Jul 2014 07:04:05 +0000 (07:04 +0000)]
revalidator: Revalidate missed flows.
If the datapath doesn't dump a flow for some reason, and the current
dump is expected to revalidate all flows in the datapath, then perform
revalidation for those flows by fetching them during the sweep phase.
If revalidation is not required, then leave the flow in the datapath and
don't revalidate it.
Signed-off-by: Joe Stringer <joestringer@nicira.com> Acked-by: Ben Pfaff <blp@nicira.com>
Joe Stringer [Tue, 8 Jul 2014 07:04:04 +0000 (07:04 +0000)]
dpif: Support fetching flow mask via dpif_flow_get().
Change the interface to allow implementations to pass back a buffer, and
allow callers to specify which of actions, mask, and stats they wish to
receive. This will be used in the next commit.
Signed-off-by: Joe Stringer <joestringer@nicira.com> Acked-by: Ben Pfaff <blp@nicira.com>
ovsdb-server.at: Skip tests that use ovsdb-server's "--run" on Windows.
ovsdb-server's port on Windows does not support the "--run" option.
The two tests skipped in this commit make use of "--run" option to
test ovsdb-server's truncating of corrupt log or bad transaction.
It looks a little tricky to get this test running on Windows without
the "--run" option implemented.
Signed-off-by: Gurucharan Shetty <gshetty@nicira.com> Acked-by: Ben Pfaff <blp@nicira.com>
Ben Pfaff [Fri, 11 Jul 2014 18:03:08 +0000 (11:03 -0700)]
tests: Disable glibc memory checking under glibc <= 2.11.
We noticed that the unit tests sometimes fail on XenServer inside glibc's
memory checker, in the free_check() function. It turns out that the
glibc memory checker in glibc 2.11 and earlier had an internal race that
caused false positives in multithreaded programs.
This commit avoids the problem by disabling the glibc memory checker in
glibc 2.11 and earlier.
VMware-BZ: #1267127 Signed-off-by: Ben Pfaff <blp@nicira.com> Acked-by: Gurucharan Shetty <gshetty@nicira.com>
datapath/flow_netlink: Create right mask with disabled megaflows
If megaflows are disabled, the userspace does not send the netlink attribute
OVS_FLOW_ATTR_MASK, and the kernel must create an exact match mask.
sw_flow_mask_set() sets every bytes (in 'range') of the mask to 0xff, even the
bytes that represent padding for struct sw_flow, or the bytes that represent
fields that may not be set during ovs_flow_extract().
This is a problem, because when we extract a flow from a packet,
we do not memset() anymore the struct sw_flow to 0 (since commit 9cef26ac6a71).
This commit gets rid of sw_flow_mask_set() and introduces mask_set_nlattr(),
which operates on the netlink attributes rather than on the mask key. Using
this approach we are sure that only the bytes that the user provided in the
flow are matched.
Also, if the parse_flow_mask_nlattrs() for the mask ENCAP attribute fails, we
now return with an error.
Reported-by: Alex Wang <alexw@nicira.com> Suggested-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: Daniele Di Proietto <ddiproietto@vmware.com> Acked-by: Pravin B Shelar <pshelar@nicira.com>
Now that all the relevant classifier structures use RCU and internal
mutual exclusion for modifications, we can remove the fat-rwlock and
thus make the classifier lookups lockless.
As the readers are operating concurrently with the writers, a
concurrent reader may or may not see a new rule being added by a
writer, depending on how the concurrent events overlap with each
other. Overall, this is no different from the former locked behavior,
but there the visibility of the new rule only depended on the timing
of the locking functions.
A new rule is first added to the segment indices, so the readers may
find the rule in the indices before the rule is visible in the
subtables 'rules' map. This may result in us losing the opportunity
to quit lookups earlier, resulting in sub-optimal wildcarding. This
will be fixed by forthcoming revalidation always scheduled after flow
table changes.
Similar behavior may happen due to us removing the overlapping rule
(if any) from the indices only after the corresponding new rule has
been added.
The subtable's max priority is updated only after a rule is inserted
to the maps, so the concurrent readers may not see the rule, as the
updated priority ordered subtable list will only be visible after the
subtable's max priority is updated.
Similarly, the classifier's partitions are updated by the caller after
the rule is inserted to the maps, so the readers may keep skipping the
subtable until they see the updated partitions.
cls_set_prefix_fields() now synchronizes explicitly with the readers,
waiting them to finish using the old configuration before changing to
the new configuration.
Add an internal mutex to struct cls_classifier, and reorganize
classifier internal structures according to the user of each field,
marking the fields that need to be protected by the mutex. This makes
locking requirements easier to track, and may make lookup more memory
efficient.
After this patch there is some double locking, as callers are taking
the fat-rwlock, and we take the mutex internally. A following patch
will remove the classifier fat-rwlock, removing the (double) locking
overhead.
lib/classifier: Simplify iteration with C99 declaration.
Hide the cursor from the classifier iteration users and move locking to
the iterators. This will make following RCU changes simpler, as the call
sites of the iterators need not be changed at that point.
Ben Pfaff [Thu, 10 Jul 2014 23:48:16 +0000 (16:48 -0700)]
netlink-socket: Work around kernel Netlink dump thread races.
The Linux kernel Netlink implementation has two races that cause problems
for processes that attempt to dump a table in a multithreaded manner.
The first race is in the structure of the kernel netlink_recv() function.
This function pulls a message from the socket queue and, if there is none,
reports EAGAIN:
skb = skb_recv_datagram(sk, flags, noblock, &err);
if (skb == NULL)
goto out;
Only if a message is successfully read from the socket queue does the
function, toward the end, try to queue up a new message to be dumped:
if (nlk->cb && atomic_read(&sk->sk_rmem_alloc) <= sk->sk_rcvbuf / 2) {
ret = netlink_dump(sk);
if (ret) {
sk->sk_err = ret;
sk->sk_error_report(sk);
}
}
This means that if thread A reads a message from a dump, then thread B
attempts to read one before A queues up the next, B will get EAGAIN. This
means that, following EAGAIN, B needs to wait until A returns to userspace
before it tries to read the socket again. nl_dump_next() already does
this, using 'dump->status_seq' (although the need for it has never been
explained clearly, to my knowledge).
The second race is more serious. Suppose thread X and thread Y both
simultaneously attempt to queue up a new message to be dumped, using the
call to netlink_dump() quoted above. netlink_dump() begins with:
mutex_lock(nlk->cb_mutex);
cb = nlk->cb;
if (cb == NULL) {
err = -EINVAL;
goto errout_skb;
}
Suppose that X gets cb_mutex first and finds that the dump is complete. It
will therefore, toward the end of netlink_dump(), clear nlk->cb to NULL to
indicate that no dump is in progress and release the mutex:
nlk->cb = NULL;
mutex_unlock(nlk->cb_mutex);
When Y grabs cb_mutex afterward, it will see that nlk->cb is NULL and
return -EINVAL as quoted above. netlink_recv() stuffs -EINVAL in sk_err,
but that error is not reported immediately; instead, it is saved for the
next read from the socket. Since Open vSwitch maintains a pool of Netlink
sockets, that next failure can crop up pretty much anywhere. One of the
worst places for it to crop up is in the execution of a later transaction
(e.g. in nl_sock_transact_multiple__()), because userspace treats Netlink
transactions as idempotent and will re-execute them when socket errors
occur. For a transaction that sends a packet, this causes packet
duplication, which we actually observed in practice. (ENOBUFS should
actually cause transactions to be re-executed in many cases, but EINVAL
should not; this is a separate bug in the userspace netlink code.)
VMware-BZ: #1283188 Reported-and-tested-by: Alex Wang <alexw@nicira.com> Signed-off-by: Ben Pfaff <blp@nicira.com> Acked-by: Alex Wang <alexw@nicira.com>
Ben Pfaff [Thu, 10 Jul 2014 21:32:10 +0000 (14:32 -0700)]
netlink-socket: Fix sign of error code.
Commit 8f20fd98db (netlink-socket: Work around upstream kernel Netlink
bug.) got the sign of the error code wrong, so that it reported e.g. -22
for EINVAL to nl_sock_recv__()'s caller, instead of 22.
Signed-off-by: Ben Pfaff <blp@nicira.com> Acked-by: Alex Wang <alexw@nicira.com>
Andy Zhou [Thu, 10 Jul 2014 07:50:26 +0000 (00:50 -0700)]
datapath: refactor do_output() to move skb_clone NULL check out of fast path
skb_clone() NULL check is implemented in do_output(), as past of the
common (fast) path. Refactoring so that NULL check is done in the
slow path, immediately after skb_clone() is called.
Besides optimization, this change also improves code readability by
making the skb_clone() NULL check consistent within OVS datapath
module.
Signed-off-by: Andy Zhou <azhou@nicira.com> Acked-by: Pravin B Shelar <pshelar@nicira.com>
Andy Zhou [Thu, 10 Jul 2014 07:30:27 +0000 (00:30 -0700)]
datapath: add skb_clone NULL check in the recirc action.
Refactoring recirc action implementation.
The main change is to fix a bug where the NULL check after skb clone()
call is missing. The fix is to return -ENOMEM whenever skb_clone()
failed to create a clone.
Also rearranged adjacent code to improve readability.
Reported-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: Andy Zhou <azhou@nicira.com> Acked-by: Pravin B Shelar <pshelar@nicira.com>
Ben Pfaff [Thu, 10 Jul 2014 20:21:35 +0000 (13:21 -0700)]
netlink-protocol: Remove definition of struct sockaddr_nl.
This struct is used only in netlink-socket.c, which is only used on Linux,
which in turn gets the definition from <linux/netlink.h>. On Windows the
definition actually causes a small amount of trouble because Windows does
not define sa_family_t (despite POSIX), so it's better to remove it.
Even if other platforms adopt Netlink, I have no reason to believe that
they will use the same sockaddr format as Linux.
CC: Saurabh Shah <ssaurabh@vmware.com> Signed-off-by: Ben Pfaff <blp@nicira.com> Acked-by: Gurucharan Shetty <gshetty@nicira.com>