Ben Pfaff [Mon, 21 Mar 2011 19:49:44 +0000 (12:49 -0700)]
bridge: Drop LACP configuration members from struct iface and struct port.
There's no reason that I can see to maintain this information in struct
port and struct iface. It's redundant, since the lacp implementation
maintains the same information.
Ben Pfaff [Mon, 21 Mar 2011 20:15:31 +0000 (13:15 -0700)]
lacp: Fix misleading prototype for lacp_configure().
Only the first 6 bytes (ETH_ADDR_LEN) of the 'sys_id' argument are used,
but the prototype declared it as an array of 8 bytes. This has no effect
on the generated code--the declared size of an array parameter is
irrelevant--but it is misleading.
Also, add 'const' since the array is not modified.
Ben Pfaff [Thu, 24 Mar 2011 20:35:15 +0000 (13:35 -0700)]
packets: Fix potential use-after-free in compose_benign_packet().
The second call to ofpbuf_put_zeros() could cause the 'eth' pointer to
be invalidated.
It appears that this does not fix a real bug because the existing callers
all preallocate 128 bytes of tailroom, but the interface doesn't document
that requirement.
Ben Pfaff [Fri, 1 Apr 2011 22:46:22 +0000 (15:46 -0700)]
ovsdb-server: Avoid intermittent test failures due to lockfile log message.
Sometimes lockfile will emit a message saying that it took a little while
to get the lock, which caused spurious test failures. This commit
suppresses the message. With this change, I was able to run these tests
continuously for some time without failures.
This was a bug in the testsuite, not in the code under test.
Ethan Jackson [Fri, 1 Apr 2011 20:10:49 +0000 (13:10 -0700)]
cfm: Allow time for CCM reception after cfm_configure();
Before this (and the previous) patch, whenever cfm_configure was
called it would set the fault_timer to expired. Thus, the next
call to cfm_run would notice a lack of CCM reception and trigger a
faulted status. This is a bug in and of itself, but normally would
not be a big deal because cfm_configure should only be called
infrequently (when the database changes). However due to an
unrelated bug, cfm_configure() was getting called approximately once
per second. This resulted in all monitors showing faults all of
the time.
This patch fixes the problem by not expiring the timer at
cfm_configure(). Instead it gives it the appropriate
fault_interval amount of time to miss heartbeats.
Ethan Jackson [Fri, 1 Apr 2011 20:22:44 +0000 (13:22 -0700)]
cfm: cfm_configure() only update when necessary.
Calling cfm_configure often could cause timers to be reset
resulting in unexpected behavior. This commit only updates when
cfm configuration actually changed.
Ben Pfaff [Mon, 28 Mar 2011 20:05:40 +0000 (13:05 -0700)]
ovsdb: Truncate bad transactions from database log.
When ovsdb-server reads a database file that is corrupted at the
transaction level (that is, the transaction is valid JSON and has the
correct SHA-1 hash, but it does not describe a valid database transaction),
then ovsdb-server should truncate it and overwrite it by valid
transactions. However, until now, it didn't. Instead, it would keep the
invalid transaction and possibly every transaction in the database file
(depending on in what way the transaction was invalid), which would just
cause the same trouble again the next time the database was read.
This fixes the problem. An invalid transaction will be deleted from the
database file at the first write to the database.
Ben Pfaff [Mon, 28 Mar 2011 19:57:20 +0000 (12:57 -0700)]
ovsdb: Check that ovsdb-server truncates corrupted database logs.
When ovsdb-server reads a database that is corrupted at the log level
(that is, when ovsdb_log detects the corruption by checking the SHA-1 hash
of the record or JSON parser error reporting), then writing to the database
should discard the corrupted data and thereby fix the problem for future
ovsdb-server runs.
This already worked OK. This just adds an extra test.
Ben Pfaff [Mon, 28 Mar 2011 19:59:18 +0000 (12:59 -0700)]
ovsdb: Raise database corruption log level from warning to error.
If there's database corruption then it indicates that something went wrong,
e.g. the machine was powered-off by power failure. It's definitely
something that the admin should know about. This sounds like an error to
me, so use that log level.
Ben Pfaff [Thu, 31 Mar 2011 23:43:43 +0000 (16:43 -0700)]
ovsdb: Force strong references to non-root tables to be persistent.
When a strong reference to a non-root table is ephemeral, the database log
can contain inconsistencies. In particular, if the column in question is
the only reference to a row, then the row will be created in one logged
transaction but the reference to it will not be logged (because it is
ephemeral). Thus, any later occurrence of the row later in the log (to
modify it, to delete it, or just to reference it) will yield a transaction
error and reading the database will abort at that point.
This commit fixes the problem by forcing any column with a strong reference
to a non-root table to be persistent.
The change to ovsdb_schema_from_json() looks bigger than it really is: it
just swaps the order of two operations on the schema and updates their
comments. Similarly for the update to ovs.db.DbSchema.__init__().
Ben Pfaff [Mon, 28 Mar 2011 17:48:36 +0000 (10:48 -0700)]
ovsdb-types: Fix bug in ovsdb_base_type_is_ref().
This function only worked properly inside OVSDB itself, because that is
the only place where the 'refTable' member of ovsdb_base_type is set.
Both inside and outside OVSDB, 'refTableName' is set for reference types,
so it's better to check for that.
This doesn't fix any existing bug because this function was only used
inside OVSDB until now.
Ben Pfaff [Fri, 25 Mar 2011 22:26:30 +0000 (15:26 -0700)]
Convert shash users that don't use the 'data' value to sset instead.
In each of the cases converted here, an shash was used simply to maintain
a set of strings, with the shash_nodes' 'data' values set to NULL. This
commit converts them to use sset instead.
Ben Pfaff [Wed, 30 Mar 2011 20:44:10 +0000 (13:44 -0700)]
sset: New data type for a set of strings.
Many uses of "shash" or "svec" data structures really call for a "set of
strings" data type. This commit introduces such a data structure. Later
commits convert inappropriate uses of shash and svec to use sset instead.
Ethan Jackson [Thu, 31 Mar 2011 20:46:04 +0000 (13:46 -0700)]
lib: Create new timer library.
Scattered throughout the code base we use long integers to
implement timers. When the result of timer_msec() is greater than
the time stored, we preform some action.
This commit creates a new timer library intended to replace these
manually managed timers. Code using the timer library will be more
obviously correct, and more consistent with other code using the
library.
Ben Pfaff [Thu, 31 Mar 2011 21:11:57 +0000 (14:11 -0700)]
ofproto: Fix order of destruction in ofproto_destroy().
ofproto_flush_flows() calls into the connmgr (via connmgr_flushed()) so
it must be called before destroying the connmgr to avoid a use-after-free
error.
Ben Pfaff [Wed, 2 Mar 2011 21:39:59 +0000 (13:39 -0800)]
ofpbuf: Make ofpbufs initialized with ofpbuf_use_stack() not expandable.
My original intent for ofpbufs initialized with ofpbuf_use_stack() was that
the caller was providing enough space on the stack for the common case,
with dynamic allocation as a fallback. But in practice, none of the
clients actually do this. Instead, all of them actually know that the
stack-allocated buffer is big enough and, since they don't want to bother
with having to call ofpbuf_delete(), they instead assert that the buffer
wasn't reallocated.
Since this is a bit of a pain, this commit changes the semantics of
ofpbuf_use_stack() to be that the stack-allocated buffer cannot be
reallocated at all. This is more convenient for the existing clients.
Ben Pfaff [Wed, 30 Mar 2011 21:54:26 +0000 (14:54 -0700)]
datapath: Fix mysterious GRE-over-IPSEC problems.
We've noticed that packets that go up to userspace and then back down to
the kernel and then enter an GRE tunnel that is then ESP encapsulated
by IPSEC end up with a bad ESP "next header" value: it ends up as zero
instead of 0x2f (IPPROTO_GRE). Just putting packets from userspace into
a freshly allocated skb fixes the problem.
The underlying problem that this works around is still unknown.
Signed-off-by: Ben Pfaff <blp@nicira.com> Acked-by: Jesse Gross <jesse@nicira.com>
Bug #4769.
Ben Pfaff [Mon, 28 Mar 2011 20:43:32 +0000 (13:43 -0700)]
timeval: Only log poll intervals longer than 50 ms.
When poll interval-based logging was introduced a long time, we were
actively interested in looking at almost every long poll interval. But
these days, with OVS working rather well, with pretty good latency, most
of the messages are red herrings that bother some administrators and
provoke false reports. So this commit suppresses all but the most
egregious long poll intervals that may in fact be worth looking at.
Ben Pfaff [Tue, 29 Mar 2011 17:08:16 +0000 (10:08 -0700)]
bridge: Always wait for MAC learning table and ports.
The test ofproto_has_primary_controller() is meaningless, since OFPP_NORMAL
can cause the MAC learning table and port bonding to be in use even when
there is a controller.
I see that this bug has been here since early 2009, when the OFPP_NORMAL
feature was introduced in the bridge. (Obviously it's not a severe
problem.)
Ben Pfaff [Mon, 28 Mar 2011 23:22:59 +0000 (16:22 -0700)]
xenserver: Wait for ovs-xapi-sync to exit in "stop" command.
It seems possible that "restart" or a quick application of "stop" then
"start" could kill ovs-xapi-sync without starting it again, if
ovs-xapi-sync takes a little while to die, long enough for the next
instance of it to see that its pidfile is still open and locked.
I hope that this fixes some odd races that we've noticed in the "restart"
command.
Ethan Jackson [Mon, 28 Mar 2011 20:10:12 +0000 (13:10 -0700)]
cfm: No longer keep track of bad remote MPs and MAIDS.
Ben pointed out that an attacker could cause OVS to use infinite
memory by sending a series of CCMs with different MAIDs. Each
message would cause a remote_maid to be allocated and stored for
several seconds.
Since Commit 1c2e2d2fc8 (cfm: Don't report unexpected remote
endpoints) no longer reports unexpected remote MAIDS and MPs in the
database, the only reason to keep track of this information is for
debugging purposes. In my judgment, it provides negligible useful
debugging information at the expense of significantly increased
code complexity. This commit rips it out entirely.
Ethan Jackson [Fri, 25 Mar 2011 21:26:53 +0000 (14:26 -0700)]
cfm: Reduce missed CCM detection time.
The specification says that a fault should be signaled when 3.5 *
ccm_interval milliseconds have passed. This commit respects that
requirement, possibly increasing the responsiveness of fault
detection slightly.
Ethan Jackson [Fri, 25 Mar 2011 01:36:56 +0000 (18:36 -0700)]
cfm: Don't report unexpected remote endpoints.
Before this patch, CFM would report unexpected remote maintenance
points in the database. This commit no longer exposes this
information.
Information about precisely why a link is faulty is more interesting
to a system administrator debugging a problem than a controller
which will generally only care about whether or not a link is
faulty. For simplicity sake, this commit removes this information
from the database where it was somewhat awkwardly placed. In the
future it may be valuable to report the information through
ovs-appctl commands for debugging purposes.
Ethan Jackson [Fri, 25 Mar 2011 02:04:21 +0000 (19:04 -0700)]
cfm: Immediately signal fault on bad CCM reception.
Commit af5739857a (cfm: Immediately signal a fault upon receiving
an unexpected MPID.) caused the CFM library to immediately signal a
fault upon reception of an unexpected remote MPID. This commit
does the same for MAIDs, and remote maintenance points with invalid
CCM intervals.
Ethan Jackson [Tue, 22 Mar 2011 23:57:43 +0000 (16:57 -0700)]
cfm: cfm_run() return ccm instead of packet.
It doesn't really make sense for the CFM code to be composing
packets. Its caller is better placed to compose the appropriate
L2 header. This commit pulls that logic out of the CFM library.
Ethan Jackson [Wed, 23 Mar 2011 19:59:40 +0000 (12:59 -0700)]
packets: Create new compose_packet() function.
This commit generalizes compose_lacp_packet() into new
compose_packet() function. This new function will be used to send
CCM messages in future patches.
Ben Pfaff [Wed, 23 Mar 2011 17:34:37 +0000 (10:34 -0700)]
datapath: Add compatibility with sk_buff's vlan_tci before 2.6.33.
Between 2.6.27 and 2.6.32, the vlan_tci member of struct sk_buff was the
raw value of the 802.1Q header's TCI field, without the CFI bit being set.
In 2.6.33 and later, the CFI bit is always set if an 802.1Q header is
present, correcting a corner case.
Until now, OVS has not consistently dealt with this. If a packet arrived
at a datapath from a network device directly, or if it was set with an
ODP_ACTION_ATTR_SET_DL_TCI action, then the CFI bit would not be set in
vlan_tci. In flow_extract(), OVS copies vlan_tci directly to dl_tci in the
flow structure (via vlan_get_tci()), so the CFI bit would also not be set
in dl_tci. But if OVS had to send a packet up to userspace (converting the
vlan_tci back to an 802.1Q header along the way) and got it back, then it
would set the VLAN CFI bit in dl_tci when it parsed the 802.1Q header in
parse_vlan(). This had the effect that a flow set up by userspace (with
the CFI bit set) would never be matched by a packet arriving from a network
device, because they would have different dl_tci values.
This fixes the problem, by making the vlan_get_tci() and vlan_set_tci()
interface consistent across kernel versions. Now, they always accept or
return a value where the VLAN CFI bit is set if an 802.1Q header is
present.
Build-tested only.
Problem isolated by Ethan Jackson <ethan@nicira.com>
Ethan Jackson [Tue, 22 Mar 2011 20:30:15 +0000 (13:30 -0700)]
vswitchd: Destroy lacp in port_destroy().
Port destruction could cause dangling lacp objects to live in the
lacp module's 'all_lacps' list. This could cause bogus output for
the lacp/show appctl command.
Ben Pfaff [Mon, 21 Mar 2011 17:07:39 +0000 (10:07 -0700)]
bridge: Change bridge's 'ports' member from array to hmap.
In my opinion, this makes the code more obviously correct in many places,
because there are generally fewer variables. One must generally keep two
variables in sync for iterating through an array: the array index and the
contents of the array at that index. For iterating through an hmap, only
the map element is necessary.
A linked list would also be a reasonable choice for the bridge's collection
of ports. I chose to use an hmap because we already had an index by name
and it seemed OK to use only one index. I decided not to keep the shash
because they are less convenient for iteration than an hmap.
Ben Pfaff [Fri, 18 Mar 2011 22:42:29 +0000 (15:42 -0700)]
bridge: Avoid flushing entire MAC learning table for common operations.
Adding and removing ports is fairly common in a virtual environment,
because it happens whenever a VM boots or shuts down. It is best to
avoid flushing the whole MAC learning table when that happens, because
that means that, briefly, every packet will get flooded, wasting CPU cycles
and network bandwidth.
This commit breaks flushing the MAC table out of bridge_flush(). Instead,
each caller is now responsible for flushing the MAC table if it is
necessary. In a few cases, no flushing was necessary, so those callers
were not modified. In the case of removing a port or modifying its VLAN
assignments, it is necessary to expire all of the MAC learning entries
associated with that port, so this commit does that. Finally, some
operations do require a MAC learning flush but they are rare enough that
in my opinion it's not worth taking care to avoid a MAC table flush.
Ben Pfaff [Tue, 22 Mar 2011 16:47:02 +0000 (09:47 -0700)]
mac-learning: Refactor to increase generality.
In an upcoming commit I want to store a pointer in MAC learning entries
in the bridge, instead of an integer port number. The MAC learning library
has other clients, and the others do not gracefully fit this new model, so
in fact the data will have to become a union. However, this does not fit
well with the current mac_learning API, since mac_learning_learn()
currently initializes and compares the data. It seems better to break up
the API so that only the client has to know the data's format and how to
initialize it or compare it. This commit makes this possible.
This commit doesn't change the type of the data stored in a MAC learning
entry yet.
As a side effect this commit has the benefit that clients that don't need
gratuitous ARP locking don't have to specify any policy for it at all.
Ben Pfaff [Mon, 21 Mar 2011 17:24:32 +0000 (10:24 -0700)]
bridge: Change port's 'ifaces' member from array to list.
In my opinion, this makes the code more obviously correct in many places,
because there are generally fewer variables. One must generally keep two
variables in sync for iterating through an array: the array index and the
contents of the array at that index. For iterating through a list, only
the list element is necessary.
Ben Pfaff [Mon, 21 Mar 2011 20:25:02 +0000 (13:25 -0700)]
bridge: Get rid of "port_ifidx" in struct iface. Fix bonding hash.
This is a first step toward changing the array of ifaces in struct port
to a more suitable data structure.
As a side effect this fixes a bonding problem that I noticed via code
inspection. Before this commit, each bond_entry specified an interface
via index. If an iface was deleted, however, this shifted all of the
iface indexes, and the code didn't compensate for that. This commit fixes
the problem by using pointers to ifaces instead, which don't shift around.
Ben Pfaff [Fri, 18 Mar 2011 00:12:40 +0000 (17:12 -0700)]
bridge: Expire bond slave assignments when their loads decay to 0.
Until now, if a given MAC ever transmitted, then it would always show up
in bond information output. There's no benefit to that if the MAC has
gone away permanently. This commit causes them to be deleted when their
load has gone to 0. This takes a fairly long time: if a MAC has sent, say,
one million bytes and then stops transmitting entirely, then it will take
about 20 rebalancing intervals (200 seconds) before it decays to 0 and
gets deleted.
Ben Pfaff [Thu, 17 Mar 2011 23:29:55 +0000 (16:29 -0700)]
bridge: Improve name and comments for bond_entry's "iface_tag" member.
The iface_tag name and comment implied that it was really just a copy of
the 'tag' member of struct iface, but in fact it has a completely different
purpose: it represents the binding of a bond_entry to a particular iface.
It is invalidated when the bond_entry has to be redirected to a different
iface, not when the iface itself changes. I hope that this commit helps
to clarify.