Ben Pfaff [Mon, 14 Jul 2014 21:47:12 +0000 (14:47 -0700)]
ofproto: Avoid theoretical double free of large rule collections.
collect_rules_strict() and collect_rules_loose() destroy the rule
collections that they create if they return an error, and some of their
callers then go on to destroy them again. This could cause a double-free
in the case where rule_collection_destroy() actually calls free(). That
never happens in the current tree, because free() is only necessary if
malloc() was called and there's a 64-entry stub that none of the current
code in collect_rules_*() can fill up in their error cases. Still, it
seems better to fix the problem.
Found by clang-analyzer.
Signed-off-by: Ben Pfaff <blp@nicira.com> Acked-by: Jarno Rajahalme <jrajahalme@nicira.com>
Joe Stringer [Tue, 8 Jul 2014 07:04:05 +0000 (07:04 +0000)]
revalidator: Revalidate missed flows.
If the datapath doesn't dump a flow for some reason, and the current
dump is expected to revalidate all flows in the datapath, then perform
revalidation for those flows by fetching them during the sweep phase.
If revalidation is not required, then leave the flow in the datapath and
don't revalidate it.
Signed-off-by: Joe Stringer <joestringer@nicira.com> Acked-by: Ben Pfaff <blp@nicira.com>
Joe Stringer [Tue, 8 Jul 2014 07:04:04 +0000 (07:04 +0000)]
dpif: Support fetching flow mask via dpif_flow_get().
Change the interface to allow implementations to pass back a buffer, and
allow callers to specify which of actions, mask, and stats they wish to
receive. This will be used in the next commit.
Signed-off-by: Joe Stringer <joestringer@nicira.com> Acked-by: Ben Pfaff <blp@nicira.com>
ovsdb-server.at: Skip tests that use ovsdb-server's "--run" on Windows.
ovsdb-server's port on Windows does not support the "--run" option.
The two tests skipped in this commit make use of "--run" option to
test ovsdb-server's truncating of corrupt log or bad transaction.
It looks a little tricky to get this test running on Windows without
the "--run" option implemented.
Signed-off-by: Gurucharan Shetty <gshetty@nicira.com> Acked-by: Ben Pfaff <blp@nicira.com>
Ben Pfaff [Fri, 11 Jul 2014 18:03:08 +0000 (11:03 -0700)]
tests: Disable glibc memory checking under glibc <= 2.11.
We noticed that the unit tests sometimes fail on XenServer inside glibc's
memory checker, in the free_check() function. It turns out that the
glibc memory checker in glibc 2.11 and earlier had an internal race that
caused false positives in multithreaded programs.
This commit avoids the problem by disabling the glibc memory checker in
glibc 2.11 and earlier.
VMware-BZ: #1267127 Signed-off-by: Ben Pfaff <blp@nicira.com> Acked-by: Gurucharan Shetty <gshetty@nicira.com>
datapath/flow_netlink: Create right mask with disabled megaflows
If megaflows are disabled, the userspace does not send the netlink attribute
OVS_FLOW_ATTR_MASK, and the kernel must create an exact match mask.
sw_flow_mask_set() sets every bytes (in 'range') of the mask to 0xff, even the
bytes that represent padding for struct sw_flow, or the bytes that represent
fields that may not be set during ovs_flow_extract().
This is a problem, because when we extract a flow from a packet,
we do not memset() anymore the struct sw_flow to 0 (since commit 9cef26ac6a71).
This commit gets rid of sw_flow_mask_set() and introduces mask_set_nlattr(),
which operates on the netlink attributes rather than on the mask key. Using
this approach we are sure that only the bytes that the user provided in the
flow are matched.
Also, if the parse_flow_mask_nlattrs() for the mask ENCAP attribute fails, we
now return with an error.
Reported-by: Alex Wang <alexw@nicira.com> Suggested-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: Daniele Di Proietto <ddiproietto@vmware.com> Acked-by: Pravin B Shelar <pshelar@nicira.com>
Now that all the relevant classifier structures use RCU and internal
mutual exclusion for modifications, we can remove the fat-rwlock and
thus make the classifier lookups lockless.
As the readers are operating concurrently with the writers, a
concurrent reader may or may not see a new rule being added by a
writer, depending on how the concurrent events overlap with each
other. Overall, this is no different from the former locked behavior,
but there the visibility of the new rule only depended on the timing
of the locking functions.
A new rule is first added to the segment indices, so the readers may
find the rule in the indices before the rule is visible in the
subtables 'rules' map. This may result in us losing the opportunity
to quit lookups earlier, resulting in sub-optimal wildcarding. This
will be fixed by forthcoming revalidation always scheduled after flow
table changes.
Similar behavior may happen due to us removing the overlapping rule
(if any) from the indices only after the corresponding new rule has
been added.
The subtable's max priority is updated only after a rule is inserted
to the maps, so the concurrent readers may not see the rule, as the
updated priority ordered subtable list will only be visible after the
subtable's max priority is updated.
Similarly, the classifier's partitions are updated by the caller after
the rule is inserted to the maps, so the readers may keep skipping the
subtable until they see the updated partitions.
cls_set_prefix_fields() now synchronizes explicitly with the readers,
waiting them to finish using the old configuration before changing to
the new configuration.
Add an internal mutex to struct cls_classifier, and reorganize
classifier internal structures according to the user of each field,
marking the fields that need to be protected by the mutex. This makes
locking requirements easier to track, and may make lookup more memory
efficient.
After this patch there is some double locking, as callers are taking
the fat-rwlock, and we take the mutex internally. A following patch
will remove the classifier fat-rwlock, removing the (double) locking
overhead.
lib/classifier: Simplify iteration with C99 declaration.
Hide the cursor from the classifier iteration users and move locking to
the iterators. This will make following RCU changes simpler, as the call
sites of the iterators need not be changed at that point.
Ben Pfaff [Thu, 10 Jul 2014 23:48:16 +0000 (16:48 -0700)]
netlink-socket: Work around kernel Netlink dump thread races.
The Linux kernel Netlink implementation has two races that cause problems
for processes that attempt to dump a table in a multithreaded manner.
The first race is in the structure of the kernel netlink_recv() function.
This function pulls a message from the socket queue and, if there is none,
reports EAGAIN:
skb = skb_recv_datagram(sk, flags, noblock, &err);
if (skb == NULL)
goto out;
Only if a message is successfully read from the socket queue does the
function, toward the end, try to queue up a new message to be dumped:
if (nlk->cb && atomic_read(&sk->sk_rmem_alloc) <= sk->sk_rcvbuf / 2) {
ret = netlink_dump(sk);
if (ret) {
sk->sk_err = ret;
sk->sk_error_report(sk);
}
}
This means that if thread A reads a message from a dump, then thread B
attempts to read one before A queues up the next, B will get EAGAIN. This
means that, following EAGAIN, B needs to wait until A returns to userspace
before it tries to read the socket again. nl_dump_next() already does
this, using 'dump->status_seq' (although the need for it has never been
explained clearly, to my knowledge).
The second race is more serious. Suppose thread X and thread Y both
simultaneously attempt to queue up a new message to be dumped, using the
call to netlink_dump() quoted above. netlink_dump() begins with:
mutex_lock(nlk->cb_mutex);
cb = nlk->cb;
if (cb == NULL) {
err = -EINVAL;
goto errout_skb;
}
Suppose that X gets cb_mutex first and finds that the dump is complete. It
will therefore, toward the end of netlink_dump(), clear nlk->cb to NULL to
indicate that no dump is in progress and release the mutex:
nlk->cb = NULL;
mutex_unlock(nlk->cb_mutex);
When Y grabs cb_mutex afterward, it will see that nlk->cb is NULL and
return -EINVAL as quoted above. netlink_recv() stuffs -EINVAL in sk_err,
but that error is not reported immediately; instead, it is saved for the
next read from the socket. Since Open vSwitch maintains a pool of Netlink
sockets, that next failure can crop up pretty much anywhere. One of the
worst places for it to crop up is in the execution of a later transaction
(e.g. in nl_sock_transact_multiple__()), because userspace treats Netlink
transactions as idempotent and will re-execute them when socket errors
occur. For a transaction that sends a packet, this causes packet
duplication, which we actually observed in practice. (ENOBUFS should
actually cause transactions to be re-executed in many cases, but EINVAL
should not; this is a separate bug in the userspace netlink code.)
VMware-BZ: #1283188 Reported-and-tested-by: Alex Wang <alexw@nicira.com> Signed-off-by: Ben Pfaff <blp@nicira.com> Acked-by: Alex Wang <alexw@nicira.com>
Ben Pfaff [Thu, 10 Jul 2014 21:32:10 +0000 (14:32 -0700)]
netlink-socket: Fix sign of error code.
Commit 8f20fd98db (netlink-socket: Work around upstream kernel Netlink
bug.) got the sign of the error code wrong, so that it reported e.g. -22
for EINVAL to nl_sock_recv__()'s caller, instead of 22.
Signed-off-by: Ben Pfaff <blp@nicira.com> Acked-by: Alex Wang <alexw@nicira.com>
Andy Zhou [Thu, 10 Jul 2014 07:50:26 +0000 (00:50 -0700)]
datapath: refactor do_output() to move skb_clone NULL check out of fast path
skb_clone() NULL check is implemented in do_output(), as past of the
common (fast) path. Refactoring so that NULL check is done in the
slow path, immediately after skb_clone() is called.
Besides optimization, this change also improves code readability by
making the skb_clone() NULL check consistent within OVS datapath
module.
Signed-off-by: Andy Zhou <azhou@nicira.com> Acked-by: Pravin B Shelar <pshelar@nicira.com>
Andy Zhou [Thu, 10 Jul 2014 07:30:27 +0000 (00:30 -0700)]
datapath: add skb_clone NULL check in the recirc action.
Refactoring recirc action implementation.
The main change is to fix a bug where the NULL check after skb clone()
call is missing. The fix is to return -ENOMEM whenever skb_clone()
failed to create a clone.
Also rearranged adjacent code to improve readability.
Reported-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: Andy Zhou <azhou@nicira.com> Acked-by: Pravin B Shelar <pshelar@nicira.com>
Ben Pfaff [Thu, 10 Jul 2014 20:21:35 +0000 (13:21 -0700)]
netlink-protocol: Remove definition of struct sockaddr_nl.
This struct is used only in netlink-socket.c, which is only used on Linux,
which in turn gets the definition from <linux/netlink.h>. On Windows the
definition actually causes a small amount of trouble because Windows does
not define sa_family_t (despite POSIX), so it's better to remove it.
Even if other platforms adopt Netlink, I have no reason to believe that
they will use the same sockaddr format as Linux.
CC: Saurabh Shah <ssaurabh@vmware.com> Signed-off-by: Ben Pfaff <blp@nicira.com> Acked-by: Gurucharan Shetty <gshetty@nicira.com>
match_validate() enforce that a mask matching on NDP attributes has also an
exact match on ICMPv6 type.
The ICMPv6 type, which is 8-bit wide, is stored in the 'tp.src' field of
'struct sw_flow_key', which is 16-bit wide.
Therefore, an exact match on ICMPv6 type should only check the first 8 bits.
This commit fixes a bug that prevented flows with an exact match on NDP field
from being installed
Signed-off-by: Daniele Di Proietto <ddiproietto@vmware.com> Acked-by: Pravin B Shelar <pshelar@nicira.com>
daemon: restart child process if it died before signaling its readiness
The child process (the one being monitored) could die before it was able
to call fork_notify_startup() function. If such situation arises, then
parent process (the one monitoring child process) would also terminate
with a fatal log message:
...|EMER|fork child died before signaling startup (killed (...))
This patch changes that behavior by always restarting child process
if it was able to start up at least once in the past. However, if
child was not able to start up even once, then the monitor process
would still terminate, because that would most likely indicate a
persistent programming or system error.
To reproduce use following script:
while : ; do kill -SIGSEGV `cat /var/run/openvswitch/ovs-vswitchd.pid`; done
Signed-Off-By: Ansis Atteka <aatteka@nicira.com> Acked-by: Ben Pfaff <blp@nicira.com>
VMware-BZ: 1273550
timeval: Initialize 'unix_epoch' for Windows correctly.
Till now, we were initializing 'unix_epoch' through time_init().
But if there was a call directly to xgettimeofday(), we would
miss the initialization causing overflows. This commit fixes it
by pre-calculating the value and assigning it globally.
Also add-in a missing return statement.
Reported-by: Alessandro Pilotti <apilotti@cloudbasesolutions.com> Reported-by: Linda Sun <lsun@vmware.com> Signed-off-by: Gurucharan Shetty <gshetty@nicira.com> Acked-by: Ben Pfaff <blp@nicira.com>
Alex Wang [Tue, 8 Jul 2014 04:58:33 +0000 (21:58 -0700)]
dpif-linux: Recheck the socket pointer existence before getting its pid.
This commit fixes a race between port deletion and flow miss handling.
More specifically, a port could be removed by main thread while
the handler thread is handling the flow miss from it. If the flow
requires slow path action, the handler thread will try querying a pid
from port's socket. Since the port has been deleted, the query will
cause a dereference of NULL socket pointer.
This commit makes the handler thread recheck the socket pointer before
dereferencing it.
Joe Stringer [Fri, 4 Jul 2014 06:58:28 +0000 (06:58 +0000)]
tests: Fix race in 'balance-tcp bonding' test.
Running the test in a tight loop could cause this test to fail after
about 5 runs, with some of the ports reporting "may_enable: false" in
the "ovs-appctl bond/show" output. This commit fixes the race condition
by waiting for may_enable to be true for all bond ports.
I suspect that LACP negotiation finishes, but the main thread doesn't
have a chance to enable the ports before we send the test packets.
Signed-off-by: Joe Stringer <joestringer@nicira.com> Acked-by: Ben Pfaff <blp@nicira.com>
This patch fixes two compile warnings introduced by commit 64b73291 ("util: create a copy of program_name"):
1. ../lib/util.c:457:5: error: passing argument 1 of 'free'
discards 'const' qualifier from pointer target type; And
2. ../lib/util.c:463:5: error: ISO C90 forbids mixed declarations
and code [-Werror=declaration-after-statement] (affected only
branch-2.3 that is C90 compliant and not the master)
Reported-By: Joe Stringer <jstringer@nicira.com> Reported-By: Lorand Jakab <lojakab@cisco.com> Signed-Off-By: Ansis Atteka <aatteka@nicira.com> Acked-by: Joe Stringer <joestringer@nicira.com>
This is a prerequisite step in making the classifier lookups lockless.
If taking a reference fails, we do the lookup again, as a new (lower
priority) rule may now match instead.
Also remove unwildcarding dl_type and nw_frag, as these are already
taken care of by xlate_actions().
Signed-off-by: Jarno Rajahalme <jrajahalme@nicira.com> Acked-by: Ben Pfaff <blp@nicira.com>
After a quick analysis, in most cases the access to refcounted objects
is clearly protected either with an explicit lock/mutex, or RCU. there
are only a few places where I left a call to ovs_refcount_unref().
Upon closer analysis it may well be that those could also use the
relaxed form.
Signed-off-by: Jarno Rajahalme <jrajahalme@nicira.com> Acked-by: Ben Pfaff <blp@nicira.com>
When a reference counted object is also RCU protected the deletion of
the object's memory is always postponed. This allows
memory_order_relaxed to be used also for unreferencing, as RCU
quiescing provides a full memory barrier (it has to, or otherwise
there could be lingering accesses to objects after they are recycled).
Also, when access to the reference counted object is protected via a
mutex or a lock, the locking primitives provide the required memory
barrier functionality.
Also, add ovs_refcount_try_ref_rcu(), which takes a reference only if
the refcount is non-zero and returns true if a reference was taken,
false otherwise. This can be used in combined RCU/refcount scenarios
where we have an RCU protected reference to an refcounted object, but
which may be unref'ed at any time. If ovs_refcount_try_ref_rcu()
fails, the object may still be safely used until the current thread
quiesces.
Signed-off-by: Jarno Rajahalme <jrajahalme@nicira.com> Acked-by: Ben Pfaff <blp@nicira.com>
ovs-atomic: Use explicit memory order for ovs_refcount.
Use explicit variants of atomic operations for the ovs_refcount to
avoid the overhead of the default memory_order_seq_cst.
Adding a reference requires no memory ordering, as the calling thread
is already assumed to have protected access to the object being
reference counted. Hence, memory_order_relaxed is used for
ovs_refcount_ref(). ovs_refcount_read() does not change the reference
count, so it can also use memory_order_relaxed.
Unreferencing an object needs a release barrier, so that none of the
accesses to the protected object are reordered after the atomic
decrement operation. Additionally, an explicit acquire barrier is
needed before the object is recycled, to keep the subsequent accesses
to the object's memory from being reordered before the atomic
decrement operation.
This patch follows the memory ordering and argumentation discussed
here:
Suppress a gcc warning which was introduced by
commit e0b48482c16b6eaa7f14d8c7e7c6275528881b9e.
("util: create a copy of program_name")
I guess MSVC doesn't have a corresponding warning.
Signed-off-by: YAMAMOTO Takashi <yamamoto@valinux.co.jp> Acked-by: Lorand Jakab <lojakab@cisco.com>
Commit 8a9562 ("dpif-netdev: Add DPDK netdev.") reversed sequence
in which set_program_name() and proctitle_init() functions are
called. This introduced a regression where program_name and argv_start
would point to exactly the same memory (previously both of these
pointers were pointing to different memory locations because
proctitle_init() would have beforehand created a copy of argv[0]
for the succeeding set_program_name() call).
This regression on my system caused ovs-vswitchd monitoring process to
show up without process name:
... 00:00:00 : monitoring pid 26308 (healthy)
Ps output was lacking process name because following code was
using overlapping memory for source and target buffer:.
Joe Stringer [Tue, 1 Jul 2014 09:54:18 +0000 (09:54 +0000)]
revalidator: Simplify push_dump_ops__().
Commit acaa8dac49 (revalidator: Eliminate duplicate flow handling.)
ensured that a ukey will always exist for a given flow, even if it is
about to be deleted. This means that push_dump_ops__() no longer needs
to handle the case where there is no ukey. This commit removes the
redundant code.
Signed-off-by: Joe Stringer <joestringer@nicira.com> Acked-by: Ben Pfaff <blp@nicira.com>
datapath: Additional logging for -EINVAL on flow setups.
There are many possible ways that a flow can be invalid so we've
added logging for most of them. This adds logs for the remaining
possible cases so there isn't any ambiguity while debugging.
Ben Pfaff [Mon, 30 Jun 2014 21:57:42 +0000 (14:57 -0700)]
netlink-socket: Work around upstream kernel Netlink bug.
The upstream kernel net/netlink/af_netlink.c netlink_recvmsg() contains the
following code to refill the Netlink socket buffer with more dump skbs
while a dump is in progress:
if (nlk->cb && atomic_read(&sk->sk_rmem_alloc) <= sk->sk_rcvbuf / 2) {
ret = netlink_dump(sk);
if (ret) {
sk->sk_err = ret;
sk->sk_error_report(sk);
}
}
The netlink_dump() function that this calls returns a negative number on
error, the convention used throughout the kernel, and thus sk->sk_err
receives a negative value on error.
However, sk->sk_err is supposed to contain either 0 or a positive errno
value, as one can see from a quick "grep" through net for 'sk_err =', e.g.:
The result is that the next attempt to receive from the socket will return
the error to userspace with the wrong sign.
(The root of the error in this case is that multiple threads are attempting
to read a single flow dump from a shared fd. That should work, but the
kernel has an internal race that can result in one or more of those threads
hitting the EINVAL case at the start of netlink_dump(). The EINVAL is
harmless in this case and userspace should be able to ignore it, but
reporting the EINVAL as if it were a 22-byte message received in userspace
throws a real wrench in the works.)
This bug makes me think that there are probably not many programs doing
multithreaded Netlink dumps. Maybe it is good that we are considering
other approaches.
VMware-BZ: #1255704 Reported-by: Mihir Gangar <gangarm@vmware.com> Signed-off-by: Ben Pfaff <blp@nicira.com> Acked-by: Alex Wang <alexw@nicira.com>
Joe Stringer [Wed, 2 Jul 2014 07:41:33 +0000 (07:41 +0000)]
revalidator: Improve optimization to skip revalidation.
The should_revalidate() optimisation introduced with commit 698ffe3623
(revalidator: Only revalidate high-throughput flows.) was a little
aggressive, occasionally deleting flows even when OVS is quite capable
of performing full revalidation.
This commit modifies the logic to:
* Firstly, check if we are likely to handle full revalidation, and
attempt that instead.
* Secondly, fall back to the existing flow throughput estimations to
determine whether to revalidate the flow or just delete it.
Simon Horman [Mon, 30 Jun 2014 04:20:14 +0000 (13:20 +0900)]
datapath: Allow pop and push MPLS actions after pop VLAN
This patch loosens the restrictions surrounding push and pop MPLS actions
such that they will be allowed after a pop VLAN action if the inner
ethernet type is acceptable for pop and push MPLS actions. This implies
that there is only one VLAN tag present.
Some analysis of logic of this change is as follows:
The purpose of tracking vlan_tci is to allow prohibition of push
and pop MPLS actions in the presence of a VLAN. In this scenario
the VLAN_TAG_PRESENT bit of vlan_tci is set and eth_type is that of
the packet with the outermost VLAN tag removed.
A pop VLAN action may clear vlan_tci as it removes the outermost
VLAN tag and the push and pop MPLS logic may rely on eth_type for
their prohibition logic.
This will not allow push and pop MPLS on packets with multiple VLAN
tags, regardless of if they are all remove using POP VLAN, as there
is no mechanism to expose the inner ethernet type beyond that of
the outermost VLAN tag.
Alex Wang [Mon, 30 Jun 2014 21:51:02 +0000 (14:51 -0700)]
datapath: Use exact lookup for flow_get and flow_del.
Due to the race condition in userspace, there is chance that two
overlapping megaflows could be installed in datapath. And this
causes userspace unable to delete the less inclusive megaflow flow
even after it timeout, since the flow_del logic will stop at the
first match of masked flow.
This commit fixes the bug by making the kernel flow_del and flow_get
logic check all masks in that case.
Signed-off-by: Alex Wang <alexw@nicira.com> Acked-by: Andy Zhou <azhou@nicira.com> Acked-by: Pravin B Shelar <pshelar@nicira.com>
Jesse Gross [Mon, 30 Jun 2014 20:43:25 +0000 (13:43 -0700)]
datapath: Change u64_stats_* to use _irq instead of _bh().
The upstream u64_stats API has been changed to remove the _bh()
versions and switch all consumers to use IRQ safe variants instead.
This was done to be safe for netpoll generated packets, which can
occur in hard IRQ context. From a safety perspective, this doesn't
directly affect OVS since it doesn't support netpoll. However, this
change has been backported to older kernels so OVS needs to use the
new API to compile.
test-vconn: Change the expected error for Windows.
On Windows ECONNRESET is WSAECONNRESET.
Also, "unix" connections are done through TCP sockets.
For the 'refuse-connection' test, the error message for Windows
is WSAECONNRESET instead of EPIPE.
This just makes ovs-benchmark compile on windows.
This lets us go ahead with just a 'make' instead of
picking and choosing executables that are tested to work on
windows as arguments for make.
This commit does not make ovs-benchmark a supported utility
on windows.
Signed-off-by: Gurucharan Shetty <gshetty@nicira.com> Acked-by: Ben Pfaff <blp@nicira.com>
Ryan Wilson [Fri, 27 Jun 2014 01:16:39 +0000 (18:16 -0700)]
netdev-dpdk: Fix memory leak in dpdk_do_tx_copy().
This patch fixes a bug where rte_pktmbuf_alloc() would fail and
packets which succeeded to allocate memory with rte_pktmbuf_alloc()
would not be sent and leak memory.
Also, as a byproduct of using a local variable to record dropped
packets, this reduces the locking of the netdev's mutex when
multiple packets are dropped in dpdk_do_tx_copy().
Signed-off-by: Ryan Wilson <wryan@nicira.com> Acked-by: Daniele Di Proietto <ddiproietto@vmware.com> Acked-by: Pravin B Shelar <pshelar@nicira.com>
Ryan Wilson [Fri, 27 Jun 2014 00:41:46 +0000 (17:41 -0700)]
netdev-dpdk: Set current timestamp when flushing TX queue.
The current timestamp should be set every time the queue is flushed.
Thus, if DRAIN_TSC timer cycles have passed since the last timestamp,
the send queue should be flushed again.
Signed-off-by: Ryan Wilson <wryan@nicira.com> Acked-by: Pravin B Shelar <pshelar@nicira.com>
poll-loop: Create Windows event handles for sockets automatically.
We currently have a poll_fd_wait_event(fd, wevent, events) function that
is used at places common to Windows and Linux where we have to wait on
sockets. On Linux, 'wevent' is always set as zero. On Windows, for sockets,
when we send both 'fd' and 'wevent', we associate them with each other for
'events' and then wait on 'wevent'. Also on Windows, when we only send 'wevent'
to this function, we would simply wait for all events for that 'wevent'.
There is a disadvantage with this approach.
* Windows clients need to create a 'wevent' and then pass it along. This
means that at a lot of places where we create sockets, we also are forced
to create a 'wevent'.
With this commit, we pass the responsibility of creating a 'wevent' to
poll_fd_wait() in case of sockets. That way, a client using poll_fd_wait()
is only concerned about sockets and not about 'wevents'. There is a potential
disadvantage with this change in that we create events more often and that
may have a performance penalty. If that turns out to be the case, we will
eventually need to create a pool of wevents that can be re-used.
In Windows, there are cases where we want to wait on a event (not
associated with any sockets) and then control it using functions
like SetEvent() etc. For that purpose, introduce a new function
poll_wevent_wait(). For this function, the client needs to create a event
and then pass it along as an argument.
Signed-off-by: Gurucharan Shetty <gshetty@nicira.com> Acked-By: Ben Pfaff <blp@nicira.com>
Pravin B Shelar [Thu, 26 Jun 2014 22:15:00 +0000 (15:15 -0700)]
datapath: Initialize OVS_CB in ovs_vport_receive()
On packet recv OVS CB is initialized in multiple function.
Following patch moves all these initialization to
ovs_vport_receive().
This patch also save a check in execute actions.
Signed-off-by: Pravin B Shelar <pshelar@nicira.com> Acked-by: Jesse Gross <jesse@nicira.com>
Polehn, Mike A [Thu, 19 Jun 2014 22:58:26 +0000 (22:58 +0000)]
dpdk: High speed PMD physical NIC queue size
Large TX and RX queues are needed for high speed 10 GbE physical NICS.
Observed a 250% zero loss improvement over small NIC queue test for
port to port flow test.
Signed-off-by: Mike A. Polehn <mike.a.polehn@intel.com> Acked-by: Pravin B Shelar <pshelar@nicira.com>
Thomas Graf [Fri, 27 Jun 2014 07:31:57 +0000 (09:31 +0200)]
build: Allow building with autoconf 2.63
Reduces the dependency on autoconf from 2.64 to 2.63 to ease building
on older platforms. There is only a few macros missing and they can
be provided easily.
A handful of tests needed modification. The difference in quoting
behaviour between 2.63 and later require the m4_define() to be
manually unfolded.
The Debian control file is left untouched on purpose. The decision
whether to adjust the dependency is left to the respective maintainers.
Tested with autoconf 2.63 and 2.69.
Cc: Scott Mann <smann@noironetworks.com> Cc: Don Kehn <dkehn@noironetworks.com> Signed-off-by: Thomas Graf <tgraf@noironetworks.com> Signed-off-by: Ben Pfaff <blp@nicira.com>
ovs-ofctl: Ability to read a hex string from file.
The unit test, "OFPST_TABLE reply - OF1.2" in ofp-print.at
sends a very large hex string as an argument to 'ovs-ofctl ofp-print'.
The length of the hex string exceeds the maximum command line length
in Windows. With this commit, we can pass the same hex string by
placing it inside a file.
Signed-off-by: Gurucharan Shetty <gshetty@nicira.com> Acked-by: Ben Pfaff <blp@nicira.com>
ovsdb-tool: Workaround inability to replace existing file on Windows.
rename() on an existing destination file fails on Windows. This commit
worksaround that problem.
There are two tests that test it. But both of them use the ovsdb-server's
--run option for the test and it does not exist in Windows. So change
the test to workaround the lack of that feature.
Signed-off-by: Gurucharan Shetty <gshetty@nicira.com> Acked-by: Ben Pfaff <blp@nicira.com>
ovsdb-server.at: Handle different error message for already opened database.
Commit ebed9f78(ovsdb-server: Improve message for "add-db" of
database already open.) improved the error message seen when
opening an already opened database on Linux. For Windows,
we still need to look for the lockfile error message.
Signed-off-by: Gurucharan Shetty <gshetty@nicira.com> Acked-by: Ben Pfaff <blp@nicira.com>
sflow feature needs to be investigated for Windows. Right now
test-sflow related tests do not pass because of LOOPBACK_INTERFACE
constraints for 'agent'. Add a TODO item and skip the tests.
Signed-off-by: Gurucharan Shetty <gshetty@nicira.com> Acked-by: Ben Pfaff <blp@nicira.com>
file_name.at: Skip a symlink related test for Windows.
There is no one-one mapping of symlinks between Linux and
Windows. This test currently fails on Windows and we do not
really need this functionality on Windows. So skip it.
Signed-off-by: Gurucharan Shetty <gshetty@nicira.com> Acked-by: Ben Pfaff <blp@nicira.com>
ovs-ofctl.at: Prevent msys from getting confused with ipv6 address.
msys has a set of rules which triggers an automatic conversion of
arguments into something else to suit Windows requirements. Sometimes
this also causes unwanted conversions. Details of the rules is here:
http://www.mingw.org/wiki/Posix_path_conversion
msys converts ::1/::1 into ;1\;1. To prevent this, use fullform
ipv6 address of the form 0:0:0:0:0:0:0:1 instead.
Signed-off-by: Gurucharan Shetty <gshetty@nicira.com> Acked-by: Ben Pfaff <blp@nicira.com>
rconn: Don't warn when peer abruptly closes connection.
On Windows, when a peer terminates without calling a close
on socket fd, the server ends up printing "connection dropped"
warning messages. We probably don't want those warning messages
when the error is WSAECONNRESET.
(In OVS unit tests on Windows, anytime a client like ovs-ofctl
calls a ovs_fatal without clean close(fd) on the socket, the
server like ovs-vswitchd prints warnings that cause unit tests
to fail.)
Signed-off-by: Gurucharan Shetty <gshetty@nicira.com> Acked-by: Ben Pfaff <blp@nicira.com>
For this particular test, we pass the PKIDIR through a
ovsdb-tool transact and msys does not convert the path style.
(On Windows, we have to pass the directory in the form C:/foo/bar.pem.)
So get the Windows style path through 'pwd -W'(which is called through
the function pwd ())
Signed-off-by: Gurucharan Shetty <gshetty@nicira.com> Acked-by: Ben Pfaff <blp@nicira.com>
stream-tcp: Cleanup files created for Windows "unix" sockets.
On Windows, we create "unix sockets" by creating TCP sockets
and hiding the TCP port number in files. When we close the
pstream session, we need to delete the file.
Signed-off-by: Gurucharan Shetty <gshetty@nicira.com> Acked-by: Ben Pfaff <blp@nicira.com>
Jarno Rajahalme [Fri, 13 Jun 2014 17:38:05 +0000 (10:38 -0700)]
lib/classifier: Optimize megaflows for single rule case.
When, during a classifier lookup, we narrow down to a single potential
rule, it is enough to match on ("unwildcard") one bit that differs
between the packet and the rule.
This is a special case of the more general algorithm, where it is
sufficient to match on enough bits that separates the packet from all
higher priority rules than the matched rule. For a miss that would be
all the rules. Implementing this is expensive for a more than a few
rules. This patch starts by doing this for a single rule when we
already have it, also reducing the lookup cost by finishing the lookup
earlier than before.
Signed-off-by: Jarno Rajahalme <jrajahalme@nicira.com> Acked-by: Ben Pfaff <blp@nicira.com>
Jarno Rajahalme [Thu, 26 Jun 2014 14:41:25 +0000 (07:41 -0700)]
lib/pvector: Non-intrusive RCU priority vector.
Factor out the priority vector code from the classifier.
Making the classifier use RCU instead of locking requires parallel
access to the priority vector, pointing to subtables in descending
priority order. When a new subtable is added, a new copy of the
priority vector is allocated, while the current readers can keep on
using the old copy they started with. Adding and removing subtables
is usually less frequent than adding and removing rules, so this
should not have a visible performance implication. As an optimization
for the userspace datapath use, where all the subtables have the same
priority, new subtables can be added to the end of the vector without
reallocation and without disturbing readers.
cls_subtables_reset() is now removed, as it served its purpose in bug
hunting. Checks on the new pvector are now incorporated into
tests/test-classifier.c.
Signed-off-by: Jarno Rajahalme <jrajahalme@nicira.com> Acked-by: Ben Pfaff <blp@nicira.com>