Ryan Wilson [Wed, 21 May 2014 04:50:19 +0000 (21:50 -0700)]
ofproto: Remove per-flow miss hash table from upcall handler.
The upcall handler keeps a hash table which hashes flow to a list
of corresponding packets. This used to be necessary as packets with
the same flow had similar actions and calculating actions used to be
a performance bottleneck. Now that userspace action calculation
performance has improved, there is no need for this hash map.
This patch removes this hash map and each packet has its own upcall.
Signed-off-by: Ryan Wilson <wryan@nicira.com> Acked-by: Alex Wang <alexw@nicira.com>
Simon Horman [Tue, 20 May 2014 23:31:47 +0000 (08:31 +0900)]
datapath: 16bit inner_network_header field in struct ovs_gso_cb
The motivation for this is to create a 16bit hole in struct ovs_gso_cb
which may be used for the inner_protocol field which is needed
for the proposed implementation of compatibility for MPLS GSO segmentation.
This should be safe as inner_network_header is now an offset to
the inner_mac_header rather than skb->head.
As pointed out by Thomas Graf simply making both inner offsets 16bis is not
safe as there have been cases of overflow with "with collapsed TCP frames
on IB when the headroom grew beyond 64K. See commit 50bceae9bd ``tcp:
Reallocate headroom if it would overflow csum_start'' for additional
details."
This patch is based on suggestions by Thomas Graf and Jesse Gross.
Cc: Thomas Graf <tgraf@suug.ch> Cc: Jesse Gross <jesse@nicira.com> Signed-off-by: Simon Horman <horms@verge.net.au> Signed-off-by: Jesse Gross <jesse@nicira.com>
Ben Pfaff [Tue, 20 May 2014 23:51:42 +0000 (16:51 -0700)]
cmap: New module for cuckoo hash table.
This implements an "optimistic concurrent cuckoo hash", a single-writer,
multiple-reader hash table data structure. The point of this data
structure is performance, so this commit message focuses on performance.
I tested the performance of cmap with the test-cmap utility included in
this commit. It takes three parameters for benchmarking:
- n, the number of elements to insert.
- n_threads, the number of threads to use for searching and
mutating the hash table.
- mutations, the percentage of operations that should modify the
hash table, from 0% to 100%.
e.g. "test-cmap 1000000 16 1" inserts one million elements, uses 16
threads, and 1% of the operations modify the hash table.
Any given run does the following for both hmap and cmap
implementations:
- Inserts n elements into a hash table.
- Iterates over all of the elements.
- Spawns n_threads threads, each of which searches for each of the
elements in the hash table, once, and removes the specified
percentage of them.
- Removes each of the (remaining) elements and destroys the hash
table.
and reports the time taken by each step,
The tables below report results for various parameters with a draft version
of this library. The tests were not formally rerun for the final version,
but the intermediate changes should only have improved performance, and
this seemed to be the case in some informal testing.
n_threads=16 was used each time, on a 16-core x86-64 machine. The compiler
used was Clang 3.5. (GCC yields different numbers but similar relative
results.)
The results show:
- Insertion is generally 3x to 5x faster in an hmap.
- Iteration is generally about 3x faster in a cmap.
- Search and mutation is 4x faster with .1% mutations and the
advantage grows as the fraction of mutations grows. This is
because a cmap does not require locking for read operations,
even in the presence of a writer.
With no mutations, however, no locking is required in the hmap
case, and the hmap is somewhat faster. This is because raw hmap
search is somewhat simpler and faster than raw cmap search.
- Destruction is faster, usually by less than 2x, in an hmap.
Alex Wang [Tue, 20 May 2014 21:16:54 +0000 (14:16 -0700)]
ovs-ctl: Raise the limit on the number of open file descriptors.
Since the removal of dispatcher thread, OVS creates 'n-handler-threads'
file descriptors for each bridge port. To allow more bridge ports
be supported, this commit raises the limit on the number of open file
descriptors from 7500 to 65535.
Ben Pfaff [Tue, 20 May 2014 18:37:02 +0000 (11:37 -0700)]
dpif: Refactor flow dumping interface to make better sense for batching.
Commit a6ce4b9d251 (ofproto-dpif-upcall: Avoid use-after-free in
revalidate() corner case.) showed that it is somewhat tricky to correctly
use the existing dpif flow dumping interface to obtain batches of flows.
One has to be careful about calling dpif_flow_dump_next_may_destroy_keys()
before going on to the next flow.
A better interface is possible, one that is naturally oriented toward
retrieving batches when that is a useful optimization. This commit
replaces the dpif interface by such a design, and updates both the
implementations and the callers to adopt it.
This is a fairly large change, but I think that the code in
ofproto-dpif-upcall is easier to understand after the change.
Daniel Borkmann [Fri, 11 Apr 2014 21:34:06 +0000 (18:34 -0300)]
netinet: Add IPPROTO_IGMP definition
Add the definition of Internet Group Management Protocol.
Signed-off-by: Cong Wang <amwang@redhat.com> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Acked-by: Thomas Graf <tgraf@redhat.com> Signed-off-by: Flavio Leitner <fbl@redhat.com> Signed-off-by: Ben Pfaff <blp@nicira.com>
Thomas Graf [Thu, 10 Apr 2014 10:50:11 +0000 (12:50 +0200)]
ovs-vsctl: Add error column to show command
a425a102-c317-4743-b0ba-79d59ff04a74
Bridge "br0"
[...]
Port test
Interface test
type: vxlan
options: {unknown="1"}
error: "test: could not set configuration (Invalid argument)"
ovs_version: "2.1.90"
Signed-off-by: Thomas Graf <tgraf@redhat.com> Signed-off-by: Ben Pfaff <blp@nicira.com>
Thomas Graf [Thu, 10 Apr 2014 10:50:10 +0000 (12:50 +0200)]
vswitchd: Add error column to Interface table to store error condition
Store the error condition of a failed port configuration in a new
column 'error' in the Interface table.
Example:
$ ovs-vsctl add-port br0 test -- \
set Interface test type=vxlan options:unknown=1
ovs-vsctl: Error detected while setting up 'test'. [...]
$ ovs-vsctl list Interface test | grep error
error : "test: could not set configuration (Invalid argument)"
Fixing the error will clear the error column:
$ ovs-vsctl set Interface test options:remote_ip=1.1.1.1
$ ovs-vsctl list Interface test | grep error
error : []
$
For now, the high level error messages when opening and configuring
the netdev are used. Further patches can extend passing the error
pointer into the individual netdev implementations to allow for more
fine grained error messages to be stored.
Signed-off-by: Thomas Graf <tgraf@redhat.com> Signed-off-by: Ben Pfaff <blp@nicira.com>
Ben Pfaff [Mon, 19 May 2014 14:52:21 +0000 (07:52 -0700)]
acinclude.m4: Fix "sparse", via detection of GNU make "if" directive.
Make treats tabs very differently from spaces at the beginning of a line,
so this test must use a tab instead of a space. This partially reverts
commit a0903134d2d60 (acinclude.m4: Expand tabs).
Without this commit, the build system never enables checking with sparse
because it never detects that GNU make "if" works.
Signed-off-by: Ben Pfaff <blp@nicira.com> Acked-by: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
Joe Stringer [Thu, 8 May 2014 00:37:52 +0000 (12:37 +1200)]
tests: Check dpif-netdev odp_actions consistency.
Ensure that upcall key matches flow install and flow_dump for userspace
datapath. This was previously assumed, but not tested. This makes the
assumption more explicit.
Signed-off-by: Joe Stringer <joestringer@nicira.com> Acked-by: Andy Zhou <azhou@nicira.com>
Joe Stringer [Fri, 9 May 2014 01:58:32 +0000 (13:58 +1200)]
odp-util: Always serialise recirculation in upcall key.
The userspace and kernel datapaths previously differed on their
treatment of the recirc_id and dp_hash fields when sending upcalls.
While the kernel datapath would always serialise these fields, the
userspace would not. When using the userspace datapath, this would cause
a mismatch between the odp flow key in an upcall compared to the one
that is serialised upon flow_dump.
This patch brings the userspace datapath behaviour back in line with the
kernel datapath by always serialising recirc_id and dp_hash to odp.
Signed-off-by: Joe Stringer <joestringer@nicira.com> Acked-by: Andy Zhou <azhou@nicira.com>
Jarno Rajahalme [Fri, 16 May 2014 19:51:11 +0000 (12:51 -0700)]
Use prefix trie lookup for IPv4 by default.
Unless otherwise configured, the prefix trie lookup is enabled for
IPv4 destination and source address fields. A new keyword "none" is
accepted as the value of "prefixes" in the OVSDB Flow_Table column.
Signed-off-by: Jarno Rajahalme <jrajahalme@nicira.com> Acked-by: Ethan Jackson <ethan@nicira.com>
Ryan Wilson [Fri, 16 May 2014 09:17:58 +0000 (02:17 -0700)]
netdev: Remove netdev from global shash when the user is changing interface configuration.
When the user changes port type (i.e. changing p0 from type 'internal' to
'gre'), the netdev must first be deleted, then re-created with the new type.
Deleting the netdev requires there exist no more references to the netdev.
However, the xlate cache holds references to netdevs and the cache is only
invalidated by revalidator threads. Thus, if cache is not invalidated prior to
the netdev being re-created, the netdev will not be able to be re-created and
the configuration change will fail.
This patch always removes the netdev from the global netdev shash when the
user changes port type. This ensures that the new netdev can always be created
while handler and revalidator threads can retain references to the old netdev
until they are finished.
Signed-off-by: Ryan Wilson <wryan@nicira.com> Signed-off-by: Ben Pfaff <blp@nicira.com>
Andy Zhou [Sat, 10 May 2014 02:13:47 +0000 (19:13 -0700)]
ofproto-dpif: Install internal rule should not change the match content.
Without this patch, the match passed into to
ofproto_dpif_add_internal_flow() are modified. The mask of dl_type will
always be converted from wildcarded match into exact match due to
calling rule_dpif_lookup_in_table(). The fix makes sure
ofproto_dpif_add_internal_flow() does not change the original match,
and makes the match passed as const in the
ofproto_dpif_add_internal_flow() API.
This bug prevents bond module from properly tracking the post
recirculation rules installed in the internal table. The existing rule
is always deleted followed by reinstalling of the same rule.
The observable behavior of the bug is that bond module losses track
of the slave's stats, after the slave is rebalanced. Although traffic
flows through the slave just fine.
Simon Horman [Wed, 14 May 2014 07:19:35 +0000 (16:19 +0900)]
ovs-atomic: Remove atomic_uint64_t and atomic_int64_t.
Some concern has been raised by Ben Pfaff that atomic_uint64_t may not
be portable. In particular on 32bit platforms that do not have atomic
64bit integers.
Now that there are no longer any users of atomic_uint64_t remove it
entirely. Also remove atomic_int64_t which has no users.
Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp> Signed-off-by: Simon Horman <horms@verge.net.au> Signed-off-by: Ben Pfaff <blp@nicira.com>
Simon Horman [Wed, 14 May 2014 07:19:34 +0000 (16:19 +0900)]
ofproto-dpif-upcall: Use atomic_long in struct udpif
Some concern has been raised by Ben Pfaff that atomic_uint64_t may not
be portable. Accordingly, use atomic_ulong instead of atomic_uint64_t
in struct ofproto.
This is in preparation for removing atomic_uint64_t entirely.
Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp> Signed-off-by: Simon Horman <horms@verge.net.au> Signed-off-by: Ben Pfaff <blp@nicira.com>
Ben Pfaff [Thu, 15 May 2014 22:52:17 +0000 (15:52 -0700)]
ofproto-dpif-upcall: Avoid use-after-free in revalidate() corner cases.
The loop in revalidate() needs to ensure that any data obtained from
dpif_flow_dump_next() is used before it is destroyed, as indicated by
dpif_flow_dump_next_may_destroy_keys(). In the common case, where
processing reaches the end of the main "while" loop, it does this, but
in two corner cases the code in the loop execute "continue;", which skipped
the check. This commit fixes the problem.
Bug #1249988. Signed-off-by: Ben Pfaff <blp@nicira.com> Acked-by: Joe Stringer <joestringer@nicira.com>
Simon Horman [Thu, 15 May 2014 00:05:03 +0000 (09:05 +0900)]
datapath: sample action without side effects
The sample action is rather generic, allowing arbitrary actions to be
executed based on a probability. However its use, within the Open vSwitch
code-base is limited: only a single user-space action is ever nested.
A consequence of the current implementation of sample actions is that
depending on weather the sample action executed (due to its probability)
any side-effects of nested actions may or may not be present before
executing subsequent actions. This has the potential to complicate
verification of valid actions by the (kernel) datapath. And indeed adding
support for push and pop MPLS actions inside sample actions is one case
where such case.
In order to allow all supported actions to be continue to be nested inside
sample actions without the potential need for complex verification code
this patch changes the implementation of the sample action in the kernel
datapath so that sample actions are more like a function call and any side
effects of nested actions are not present when executing subsequent
actions.
With the above in mind the motivation for this change is twofold:
* To contain side-effects the sample action in the hope of making it
easier to deal with in the future and;
* To avoid some rather complex verification code introduced in the MPLS
datapath patch.
Some notes about the implementation:
* This patch silently changes the behaviour of sample actions whose nested
actions have side-effects. There are no known users of such sample
actions.
* sample() does not clone the skb for the only known use-case of the sample
action: a single nested userspace action. In such a case a clone is not
needed as the userspace action has no side effects.
Given that there are no known users of other nested actions and in order
to avoid the complexity of predicting if other sequences of actions have
side-effects in such cases the skb is cloned.
* As sample() provides a cloned skb in the unlikely case where there are
nested actions other than a single userspace action it is no longer
necessary to clone the skb in do_execute_actions() when executing a
recirculation action just because the keep_skb parameter is set: this
parameter was only set when processing the nested actions of a sample
action. Moreover it is possible to remove the keep_skb parameter of
do_execute_actions entirely.
* As sample() provides either a cloned skb or one that has had a
reference taken (using keep_skb) to do_execute_actions()
the original skb passed to sample() is never consumed. Thus the
caller of sample() (also do_execute_actions()) can use its generic
error handling to free the skb on error.
Signed-off-by: Simon Horman <horms@verge.net.au> Signed-off-by: Jesse Gross <jesse@nicira.com>
tests: Change to parse dynamically allocated ports on windows.
In Windows, we use kernel assigned TCP port for ssl/tcp and
unixctl. In tests, we parse the log files of ovsdb-server.log,
test-sflow.log and test-netflow.log to get this port. In all
the above cases, tcp port is allocated first and then the unixctl port.
So a 'head -1' on the result should be safe.
Signed-off-by: Gurucharan Shetty <gshetty@nicira.com> Acked-by: Ben Pfaff <blp@nicira.com>
Jarno Rajahalme [Thu, 15 May 2014 02:53:51 +0000 (19:53 -0700)]
lib/classifier: Simpilify array ordering.
The terminology we used for subtable ordering ('splice', 'right
before') was inherited from an earlier use of a linked list, and
turned out to be confusing when applied to an array. Also, we only
ever move one subtable earlier or later within the array, so we can
simplify the code as well.
Signed-off-by: Jarno Rajahalme <jrajahalme@nicira.com> Acked-by: Ben Pfaff <blp@nicira.com>
Some ovsdb-tool related unit tests fail with bad checksum errors
while reading transactions from database. It is most likely because
of the CR at the end of line. Using binary mode solves the problem.
Signed-off-by: Gurucharan Shetty <gshetty@nicira.com> Acked-by: Ben Pfaff <blp@nicira.com>
Ben Pfaff [Sat, 10 May 2014 01:16:38 +0000 (18:16 -0700)]
nx-match: Refactor nxm_put_ip() to handle all IPv4 and IPv6 fields.
Until now, some fields have been handled in the caller, and the caller has
been responsible for distinguishing ICMPv4 from ICMPv6. This
implementation seems to make the code a little easier to understand.
Ben Pfaff [Thu, 8 May 2014 06:18:46 +0000 (23:18 -0700)]
Implement OpenFlow 1.5 port desc stats request.
OpenFlow 1.4 and earlier always send the description of every port in
response to an OFPMP_PORT_DESC request. OpenFlow 1.5 proposes allowing
the controller to request a description of a single port. This commit
implements a prototype.
Ben Pfaff [Thu, 8 May 2014 06:49:00 +0000 (23:49 -0700)]
Implement OpenFlow 1.5 group desc stats request.
OpenFlow 1.4 and earlier always send the description of every group in
response to an OFPMP_GROUP_DESC request. OpenFlow 1.5 proposes allowing
the controller to request a description of a single group. This commit
implements a prototype.
Ben Pfaff [Fri, 9 May 2014 21:12:06 +0000 (14:12 -0700)]
ofp-util: Remove ofputil_get_phy_port_size().
The size is not fixed for OpenFLow 1.4 and later, so it's a little
deceptive to return any particular value. This function was only used in
one place, so move it inline there.
Ben Pfaff [Thu, 8 May 2014 06:35:35 +0000 (23:35 -0700)]
ofp-util: Reduce duplicate code.
ofputil_put_phy_port() and ofputil_append_port_desc_stats_reply() had a
lot of code duplication. This reduces it: it deletes some specialized
code from ofputil_put_phy_port(), moving it into its caller
ofputil_put_switch_features_port() that actually needed it. That change
then allows ofputil_append_port_desc_stats_reply() to become a simple
wrapper around ofputil_put_phy_port().
Ben Pfaff [Sat, 10 May 2014 02:29:56 +0000 (19:29 -0700)]
ofp-util: Generalize functions for parsing OF1.3+ properties.
The main effect is to move these functions a little earlier in the file.
Also, OpenFlow 1.4 changed the table-features specific error codes to new
values that apply to all property sets, so this commit updates the error
code names and adds the appropriate OpenFlow 1.4+ codes.
Ben Pfaff [Thu, 8 May 2014 04:39:00 +0000 (21:39 -0700)]
ofp-util: Remove ofputil_count_phy_ports().
It's harder to calculate the number of ports in a given amount of space in
OpenFlow 1.4 and later, because the ofp_port structure becomes variable
length in those versions. This commit removes the one caller, replacing
it by a version that doesn't need to know the number of ports in advance.
Ben Pfaff [Fri, 9 May 2014 04:20:22 +0000 (21:20 -0700)]
ovs-ofctl: Fix port lookup and "ovs-ofctl" behavior for OpenFlow 1.3+.
ovs-ofctl supports using port names in commands that operate on ports. It
does this by connecting to the switch, listing the ports, and picking out
the one with the specified name. However, this didn't work properly for
OpenFlow 1.3+, because it always used an OFPT_FEATURES_REQUEST to list the
ports, and in OpenFlow 1.3+ the reply to this request does not include a
list of ports. This commit fixes the problem (using code that previously
was just a fallback when there were too many ports to fit in an
OFPT_FEATURES_REPLY).
For similar reasons, "ovs-ofctl show" wasn't listing the switch's ports
when it connected to a switch over OpenFlow 1.3 or later. This commit
fixes that bug also.
Signed-off-by: Ben Pfaff <blp@nicira.com>
Conflicts:
utilities/ovs-ofctl.c
Anoob Soman [Wed, 14 May 2014 13:32:16 +0000 (14:32 +0100)]
ofproto-dpif-xlate: Fix null pointer dereference
actions (in xlate_actions__) would be NULL when xlate_actions()
is called from packet_out()->ofproto_dpif_execute_actions().
This causes a NULL pointer to be dereferenced when
ctx.xbridge->netflow is set.
Signed-off-by: Anoob Soman <anoob.soman@citrix.com> Signed-off-by: Ben Pfaff <blp@nicira.com>
Simon Horman [Tue, 13 May 2014 05:46:18 +0000 (14:46 +0900)]
datapath: Free skb(s) on recirculation error
This patch attempts to ensure that skb(s) are always freed (once)
if if an error occurs in execute_recirc(). It does so in two ways:
1. Ensure that execute_recirc() always consimes skb passed to it.
Specifically, free the skb if the call to ovs_flow_extract() fails.
2. Return from the recirc case in execute_recirc() whenever
the skb has not been cloned immediately above.
This is only the case if the action is both the last action and the
keep_skb parameter of execute_recirc is not true. As it is the last
action and the skb is consumed one way or another by execute_recirc() it
is correct to return here. In particular this avoids the possibility of
the skb, which has been consumed by execute_recirc() from being freed.
Conversely if this is not the case then the skb has been cloned
and the clone has been consumed by execute_recirc().
This leads to three sub-cases:
* If execute_recirc() returned an error code then the original skb
will be freed by the error handling code below the case statement in
do_execute_actions().
* If this is not the last action then action processing will continue,
using the original skb.
* If this is the last action then it must also be the case that keep_skb
is true (otherwise the skb would not have been cloned). Thus
do_execute_actions() will return without freeing the original skb.
Signed-off-by: Simon Horman <horms@verge.net.au>
[jesse: use kfree_skb instead of consume_skb on error path] Signed-off-by: Jesse Gross <jesse@nicira.com>
Jarno Rajahalme [Mon, 12 May 2014 06:38:44 +0000 (23:38 -0700)]
lib/classifier: Fix array splicing.
Array splicing was broken when multiple elements were being moved,
resulting in the priority order being mixed. This came up when the
highest priority rule from a subtable was removed and the subtable
needed to be moved down the priority list by more than one position.
Signed-off-by: Jarno Rajahalme <jrajahalme@nicira.com> Acked-by: Ben Pfaff <blp@nicira.com>
Thomas Graf [Thu, 8 May 2014 18:45:25 +0000 (20:45 +0200)]
ovs-ctl: Don't decrease max open fds if already set higher
A user may set LimitNOFILE through systemd or other means to set
the maximum number of open file descriptors. Only modify the ulimit
if not already set to a higher value by the user.
Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: Ben Pfaff <blp@nicira.com> Acked-by: Andy Gospodarek <gospo@redhat.com>
Andy Zhou [Wed, 7 May 2014 05:31:00 +0000 (22:31 -0700)]
bond: raise minimal bond distribution per interface
Raise the minimal per interface packet distribution from 7 to 24.
With 256 packet distributing to 3 interfaces, the expected packets per
interface should be 256/3 = 85.3
Tested with 200 runs, the average number of packet sent to a single
interface is 85.9. close to the expected number, standard deviation
within the 200 run is 24.4. Tested with 2x standard deviation with
10K test runs, got around 0.1% failure rate. 2.5x standard deviation
passes 100K test runs without failure.
Using 2.5x for the unit test, 83.5 - 2.5 * 24.4, Round down to the
whole number of 24.
Signed-off-by: Andy Zhou <azhou@nicira.com> Reviewed-by: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
Joe Stringer [Sun, 4 May 2014 22:14:18 +0000 (10:14 +1200)]
tunnel: Fix bug where misconfiguration persists.
Previously, misconfiguring a tunnel port to use the exact same settings
would cause the corresponding netdev to never be destroyed. When
attempting to re-use the port as a different type, this would fail and
result in a discrepancy between reported port type and actual netdev in
use.
An example configuration that would previously give unexpected behaviour:
ovs-vsctl add-port br0 p0 -- set int p0 type=gre options:remote_ip=1.2.3.4
ovs-vsctl add-port br0 p1 -- set int p1 type=internal
ovs-vsctl set int p1 type=gre options:remote_ip=1.2.3.4
ovs-vsctl set int p1 type=internal
The final command would report in the ovs-vswitchd logs that it is
attempting to configure the port with the same gre settings as p0,
despite the command specifying the type as internal. Even after
deleting and re-adding the port, the message would reappear.
This patch fixes the bug by dereferencing the netdev in the failure
case of tnl_port_add__(), and ensures that the tnl_port structure is
freed in that case as well.