git.proxmox.com Git - ovs.git/log

connmgr: Fix a typo.

Signed-off-by: Alex Wang <alexw@nicira.com>
Acked-by: Andy Zhou <azhou@nicira.com>

tests: Add another test for extended registers.

ONF-JIRA: EXT-244
Suggested-by: Jean Tourrilhes <jt@hpl.hp.com>
Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Jean Tourrilhes <jt@hpl.hp.com>

Add more files to the openvswitch library on MSVC

Add netlink related files to the windows build.

Signed-off-by: Alin Gabriel Serdean <aserdean@cloudbasesolutions.com>
Signed-off-by: Ben Pfaff <blp@nicira.com>

netlink-socket: Adapt to Windows and MSVC.

Add two functions set_sock_pid_in_kernel and portid_next. This will allow
the channel identification for the kernel extension to send back messages.

Replace send with WriteFile equivalent and ignore nl_sock_drain for the moment
under MSVC.

Replace sendmsg and recvmsg with ReadFile and WriteFile equivalents.

On MSVC put in handle instead of fd(sock->fd becomes sock->handle).

Creation of the netlink socket will be replaced by CreateFile equivalent.

Add MAX_STACK_LENGTH for MSVC. This will be our maximum size for on-stack
copy buffer.

Signed-off-by: Alin Gabriel Serdean <aserdean@cloudbasesolutions.com>
Signed-off-by: Ben Pfaff <blp@nicira.com>

netlink-socket: Allow compiling on MSVC even without HAVE_NETLINK.

Bypass the error compilation when compiling under MSVC.

Signed-off-by: Alin Gabriel Serdean <aserdean@cloudbasesolutions.com>
Signed-off-by: Ben Pfaff <blp@nicira.com>

netlink-protocol: Add more definitions.

Add MAX_LINKS define needed for nl_pool.

Add NLM_F_CREATE and NLM_F_EXCL defines also.

Signed-off-by: Alin Gabriel Serdean <aserdean@cloudbasesolutions.com>
Signed-off-by: Ben Pfaff <blp@nicira.com>

Add defines, enums and headers for MSVC

Add defines needed to compile netlink-socket.c and netlink.c.

Add a wrapper and the functionality behind it for syconf.

Add the newly created files to the noinst_HEADERS in windows/automake.mk

Signed-off-by: Alin Gabriel Serdean <aserdean@cloudbasesolutions.com>
Signed-off-by: Ben Pfaff <blp@nicira.com>

cmap: Merge CMAP_FOR_EACH_SAFE into CMAP_FOR_EACH.

There isn't any significant downside to making cmap iteration "safe" all
the time, so this drops the _SAFE variant.

Similar changes to CMAP_CURSOR_FOR_EACH and CMAP_CURSOR_FOR_EACH_CONTINUE.

Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Jarno Rajahalme <jrajahalme@nicira.com>

netlink-socket: Add conceptual documentation.

Based on a conversation with the VMware Hyper-V team earlier today.

This commit also changes a couple of functions that were only used with
netlink-socket.c into static functions. I couldn't think of a reason for
code outside that file to use them.

Signed-off-by: Ben Pfaff <blp@nicira.com>

datapath: Use currect rcu API in exact match flow lookup function.

exact match cache lookup is always done under ovs lock. So
use ovsl_dereference() API for rcu access.

Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Andy Zhou <azhou@nicira.com>

connmgr: Only send role status messages to OpenFlow 1.4+ controllers.

Only OpenFlow 1.4 and later support role status messages, but this code
tried to send them to all controllers, which caused an assertion failure.

Also, add tests to check that role status messages work, and that they
don't cause trouble with OF1.2.

Reported-by: Anup Khadka <khadka.py@gmail.com>
Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Justin Pettit <jpettit@nicira.com>

connmgr: Demote service controllers too in ofconn_set_role().

Service controllers can set their roles, so it's necessary to demote them
to slaves if another controller becomes master. Unfortunately the
'controllers' hmap only contains primary controllers, so this was omitted.

Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Justin Pettit <jpettit@nicira.com>

ovs-ofctl: Add --unixctl command line option.

This matches the option offered by some other Open vSwitch daemons. I
intend to use it in tests in an upcoming commit.

Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Justin Pettit <jpettit@nicira.com>

datapath-windows: Kernel module for HyperV.

The kernel switch extension has support for bridged back forwarding & tunneling
over VXLAN. There is no Netlink integration as it is still being worked out.

Co-Authored-By: Ankur Sharma <ankursharma@vmware.com>
Signed-off-by: Ankur Sharma <ankursharma@vmware.com>
Co-Authored-By: Eitan Eliahu <eliahue@vmware.com>
Signed-off-by: Eitan Eliahu <eliahue@vmware.com>
Co-Authored-By: Guolin Yang <gyang@vmware.com>
Signed-off-by: Guolin Yang <gyang@vmware.com>
Co-Authored-By: Linda Sun <lsun@vmware.com>
Signed-off-by: Linda Sun <lsun@vmware.com>
Co-Authored-By: Nithin Raju <nithin@vmware.com>
Signed-off-by: Nithin Raju <nithin@vmware.com>
Signed-off-by: Saurabh Shah <ssaurabh@vmware.com>
Signed-off-by: Ben Pfaff <blp@nicira.com>

meta-flow: Add 64-bit registers.

These 64-bit registers are intended to conform with the OpenFlow 1.5
draft specification.

EXT-244.
Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Jarno Rajahalme <jrajahalme@nicira.com>

ofproto-dpif-upcall: Fix sparse warnings.

Fixes these warnings from "sparse":

../ofproto/ofproto-dpif-upcall.c:761:1: warning: symbol 'free_upcall' was
not declared. Should it be static?
../ofproto/ofproto-dpif-upcall.c:849:1: warning: symbol 'convert_upcall'
was not declared. Should it be static?

Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Justin Pettit <jpettit@nicira.com>

Fix two memory leaks.

Found by coverity.

Signed-off-by: yinpeijun <yinpeijun@huawei.com>
Signed-off-by: Ben Pfaff <blp@nicira.com>

Remove assumption that there are 64 or fewer fields.

An upcoming commit will increase the number of fields beyond 64.

Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Jarno Rajahalme <jrajahalme@nicira.com>

bitmap: Add new functions.

These will be used in an upcoming commit.

Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Jarno Rajahalme <jrajahalme@nicira.com>

meta-flow: Simplify handling of a variable number of registers.

At the time that Open vSwitch implemented registers, there was a high cost
to adding additional fields, so I wrote the code so that the number of
registers could be reduced at compile time.  Now, fields are cheaper
(though not free) and in the meantime I have never heard of anyone reducing
the number of registers.  Since I intend to add more code that would
require awkward "#if"s like this, I think that this is a good time to
simplify it by requiring FLOW_N_REGS to be fixed.  This commit does that.

Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Jarno Rajahalme <jrajahalme@nicira.com>

datapath: Fix buffer overrun in mask array realloc.

mask realloc copies elements from old array to new array. When
shrinking array it can go beyond allocated memory.

Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Andy Zhou <azhou@nicira.com>

dpif-netdev: Polling threads directly call ofproto upcall functions.

Typically, kernel datapath threads send upcalls to userspace where
handler threads process the upcalls. For TAP and DPDK devices, the
datapath threads operate in userspace, so there is no need for
separate handler threads.

This patch allows userspace datapath threads to directly call the
ofproto upcall functions, eliminating the need for handler threads
for datapaths of type 'netdev'.

Signed-off-by: Ryan Wilson <wryan@nicira.com>
Signed-off-by: Ethan Jackson <ethan@nicira.com>
Acked-by: Ethan Jackson <ethan@nicira.com>

datapath: Add NULL check for mask pointer.

There is race in datapath when last mask in mask array deleted can
result in NULL pointer dereference in datapath.
datapath lookup does not check mask pointer if its index is less than
mask-array count. That is safe because delete operation moves last valid
pointer to the deleted element. But this does not work if we are
deleting last element in array. Following patch adds NULL check for the
mask pointer.
This patch also avoids accessing ma->count without any locks.

Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Andy Zhou <azhou@nicira.com>

cfm: Reduce "long delay" message from WARN to INFO, to match BFD behavior.

These messages can cause the testsuite to fail on a busy build machine
since the testsuite treats WARN or ERR log messages as failures. BFD
uses an INFO message instead of WARN, so this just changes CFM to match.

Alternatively, the testsuite could ignore "long delay" messages (it ignores
some other categories of messages). In that case I'd expect that we'd
want to change BFD to match CFM since I don't know of a reason why they
should log differently.

Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Justin Pettit <jpettit@nicira.com>

datapath: reorder action netlink attribute definition for upstreaming

Keeping the order of netlink attribute definition in the order of
upstreaming is the best way to keep all released user space program
forward compatible with upstreamed kernel modules.

Adjust action netlink attribute order to match with the current
upstreaming plan.

Recirc and hash actions are introduced in branch 2.3, which will be
fixed by the patch. The MPLS actions have been released since
branch-2.1 but there is no kernel implementation of them prior to
branch 2.3. Thus the ordering change should not affect them.

Signed-off-by: Andy Zhou <azhou@nicira.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>

netdev-provider: Clarify comment where 'get_next_hop' function looks.

Some readers thought it was looking in an ofproto- or netdev-specific
table.

Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Flavio Leitner <fbl@redhat.com>

datapath/flow_netlink: Avoid wildcarding tunnel key with disabled megaflows

If the userspace wants to match on a flow with some tunnel attributesset to 0,
it simply omits them in the netlink attributes stream.
Since our wildcarding logic (when megaflows are disabled) is based on the
attributes in the netlink stream, we set our mask incorrectly.

This commit adds a check to detect if the userspace wants to match on a tunnel,
in which case we simply unwildcard the whole tun_key

Reported-by: Andy Zhou <azhou@nicira.com>
Signed-off-by: Daniele Di Proietto <ddiproietto@vmware.com>
Signed-off-by: Andy Zhou <azhou@nicira.com>

datapath: flow_netlink: Fix a bug.

Commit 62974663fe (datapath/flow_netlink: Create right mask with
disabled megaflows) introduced the bug which caused
ovs_nla_get_match() returns immediately after parsing the flow
mask for OVS_KEY_ATTR_ENCAP. Consequently, when vlan encapsulated
packets are present, the corresponding datapath flows will have
incorrect mask like below. And the incorrect flows could affect
other non-vlan packets.

~/ovs# ovs-dpctl dump-flows
in_port(3/0xffff0000),eth_type(0x8100),encap(), packets:0,
bytes:0, used:never, actions:2

This commit fixes the bug by checking and handling the return
value of the parsing function correctly.

Signed-off-by: Alex Wang <alexw@nicira.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>

dpif-netdev: Initialize upcall->packet when queuing to userspace.

Commit db73f7166a6 (netdev-dpdk: Fix race condition with DPDK mempools in
non pmd threads) switched to a new way of setting up 'upcall->packet', but
only initialized two of the fields in the packet. This could cause
core dumps and other strange behavior. In particular it caused failures in
several unit tests on XenServer.

This commit fixes the problem by initializing the entire ofpbuf.

Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Justin Pettit <jpettit@nicira.com>

dpif-netdev: Reduce netdev_flow_key size

struct 'miniflow' already contains MINI_N_INLINE values, therefore
we can save few bytes in netdev_flow_key.

Signed-off-by: Daniele Di Proietto <ddiproietto@vmware.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>

netdev-dpdk: Increase tx queue size and rx batch size

These values has been found to give the best throughput
in simple cases (1 flow 64 bytes UDP packets).

Signed-off-by: Daniele Di Proietto <ddiproietto@vmware.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>

stp: Add more logging points for debug.

This commit adds more logging points in stp module for debugging.
Also, it makes the log print out the port name.

Signed-off-by: Alex Wang <alexw@nicira.com>
Acked-by: Ben Pfaff <blp@nicira.com>

netlink-notifier: Exit loop if nl_sock_recv() returns an error

An error from nl_sock_recv() could mean that there are some issues with the
netlink socket (EBADF, ENOTSOCK, ...). Calling nl_sock_recv() in this case is
harmful: nln_run() will never return and since we are calling it from the main
thread, vswitchd becomes unresponsive.
Also, with this commit we avoid calling the notifier callback in case of error
(except for ENOBUFS, which means that there could be too many notifications)

Suggested-by: Alex Wang <alexw@nicira.com>
Signed-off-by: Daniele Di Proietto <ddiproietto@vmware.com>
Acked-by: Ben Pfaff <blp@nicira.com>
Acked-by: Alex Wang <alexw@nicira.com>

configure: Don't check for malloc hooks that we no longer use.

Commit 825da1c6d1c7 (leak-checker: Remove because it cannot be made
thread-safe.) removed the only uses of these hooks but neglected to remove
the test for them.

Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Justin Pettit <jpettit@nicira.com>

ovsdb-server: Document RFC 7047 extensions to ovsdb <error>s.

Reported-by: Madhu Venugopal <vmadhu@redhat.com>
Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Justin Pettit <jpettit@nicira.com>

ovs-ctl: Add comment to explain why we only save ofports for pre-1.10.

We've had a couple of questions about this lately, including one suggestion
that we should always save the OpenFlow port numbers. This explains why
the behavior is as it is.

Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Justin Pettit <jpettit@nicira.com>

debian: Avoid -Wformat-zero-length warnings.

Debian puts an extra "-Wformat" in the CFLAGS following OVS's own
"-Wformat -Wno-format-zero-length", which therefore overrides
-Wno-format-zero-length, so this commit adds an extra
-Wno-format-zero-length to avoid those false positives.

Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Justin Pettit <jpettit@nicira.com>

Use xstrdup() instead of strdup(), xmalloc() instead of malloc().

Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Justin Pettit <jpettit@nicira.com>

datapath: Refactor get_dp() function into multiple access APIs.

Avoid recursive read_rcu_lock() by using the lighter weight get_dp_rcu()
API. Add proper locking assertions to get_dp().

Signed-off-by: Andy Zhou <azhou@nicira.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>

tests: Wait for ofctl_monitor.log in ofproto-dpif controller.

Fixes a race in the test case.

Signed-off-by: Daniele Di Proietto <ddiproietto@vmware.com>
Signed-off-by: Ben Pfaff <blp@nicira.com>

classifier: Refactor cls_cursor_advance() to make it easier to follow.

Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Jarno Rajahalme <jrajahalme@nicira.com>

cmap, classifier: Avoid unsafe aliasing in iterators.

CMAP_FOR_EACH and CLS_FOR_EACH and their variants tried to use void ** as
a "pointer to any kind of pointer".  That is a violation of the aliasing
rules in ISO C which technically yields undefined behavior.  With GCC 4.1,
it causes both warnings and actual misbehavior.  One option would to add
-fno-strict-aliasing to the compiler flags, but that would only help with
GCC; who knows whether this can be worked around with other compilers.

Instead, this commit rewrites the iterators to avoid disallowed pointer
aliasing.

VMware-BZ: #1287651
Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Jarno Rajahalme <jrajahalme@nicira.com>

vswitchd/bridge: Fix setting default prefix fields.

The check for the need of default values was in the wrong place,
causing no prefix tracking to be used when database had no
configuration for a flow table. Missing configuration means that
defaults should be used.

To limit clutter on the log, we now log the prefix tracking
configuration when it is explicitly set in the database.

Signed-off-by: Jarno Rajahalme <jrajahalme@nicira.com>
Acked-by: Ben Pfaff <blp@nicira.com>

ovs-atomic: Avoid evaluating arguments multiple times.

ovs-atomic-gcc4+ did expand arguments again, if locked set/get was
called.

Also fix atomic_is_lock_free().

Signed-off-by: Jarno Rajahalme <jrajahalme@nicira.com>
Acked-by: Ben Pfaff <blp@nicira.com>

lib/ovs-rcu: evaluate argument of ovsrcu_get only once.

As ovsrcu_get() looks like a function call, it is reasonable for the
callers to expect that the arguments are evaluated only once.
CONST_CAST expands its 'POINTER' argument multiple times, and the
exact effect of this turned out to be compiler dependent. Fix this by
expanding the macro argument before CONST_CAST, and removing
unnecessary CONST_CASTs.

VMware-BZ: #1287651

Signed-off-by: Jarno Rajahalme <jrajahalme@nicira.com>
Acked-by: Ben Pfaff <blp@nicira.com>

netdev-dpdk: Fix race condition with DPDK mempools in non pmd threads

DPDK mempools rely on rte_lcore_id() to implement a thread-local cache.
Our non pmd threads had rte_lcore_id() == 0. This allowed concurrent access to
the "thread-local" cache, causing crashes.

This commit resolves the issue with the following changes:

- Every non pmd thread has the same lcore_id (0, for management reasons), which
  is not shared with any pmd thread (lcore_id for pmd threads now start from 1)
- DPDK mbufs must be allocated/freed in pmd threads. When there is the need to
  use mempools in non pmd threads, like in dpdk_do_tx_copy(), a mutex must be
  held.
- The previous change does not allow us anymore to pass DPDK mbufs to handler
  threads: therefore this commit partially revert 143859ec63d45e. Now packets
  are copied for upcall processing. We can remove the extra memcpy by
  processing upcalls in the pmd thread itself.

With the introduction of the extra locking, the packet throughput will be lower
in the following cases:

- When using internal (tap) devices with DPDK devices on the same datapath.
  Anyway, to support internal devices efficiently, we needed DPDK KNI devices,
  which will be proper pmd devices and will not need this locking.
- When packets are processed in the slow path by non pmd threads. This overhead
  can be avoided by handling the upcalls directly in pmd threads (a change that
  has already been proposed by Ryan Wilson)

Also, the following two fixes have been introduced:
- In dpdk_free_buf() use rte_pktmbuf_free_seg() instead of rte_mempool_put().
  This allows OVS to run properly with CONFIG_RTE_LIBRTE_MBUF_DEBUG DPDK option
- Do not bulk free mbufs in a transmission queue. They may belong to different
  mempools

Signed-off-by: Daniele Di Proietto <ddiproietto@vmware.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>

netlink-socket: Do not make flow_dump block on netlink socket.

Commit 93295354 (netlink-socket: Simplify multithreaded dumping
to match Linux reality.) makes the call to recvmsg() block if no
messages are available. This can cause revalidator threads hanging
for long time or even deadlock when main thread tries to stop the
revalidator threads.

This commit fixes the issue by enabling the MSG_DONTWAIT flag in
the call to recvmsg().

Signed-off-by: Alex Wang <alexw@nicira.com>
Acked-by: Ben Pfaff <blp@nicira.com>

ovsdb: Don't add ovsdb-server.c to libovsdb.

Without this change, with shared libraries, VLOG
constructor for ovsdb-server would get called twice corrupting
the 'vlog_modules' list causing an infinite loop.

Signed-off-by: Gurucharan Shetty <gshetty@nicira.com>
Reported-by: Gur Stavi <gstavi@mrv.com>
Acked-by: Ben Pfaff <blp@nicira.com>

lib/classifier: Clarify subtable skipping.

Clarify comments for trie-based subtable skipping.

Perform the cheaper check first.

Signed-off-by: Jarno Rajahalme <jrajahalme@nicira.com>
Acked-by: Ben Pfaff <blp@nicira.com>
Acked-by: YAMAMOTO Takashi <yamamoto@valinux.co.jp>

lib/classifier: Return all matching prefix lengths from trie lookup.

Previously we only returned the last matching prefix length
encountered during a trie lookup, and skipped subtables that had
prefixes longer than that. This patch changes the trie lookup
functions to return all matching prefix lengths seen, so that all
non-matching prefix lengths can be skipped.

Signed-off-by: Jarno Rajahalme <jrajahalme@nicira.com>
Acked-by: Ben Pfaff <blp@nicira.com>

lib/classifier: Change local variable names.

These stylistic changes makes the following patch a bit simpler.

Signed-off-by: Jarno Rajahalme <jrajahalme@nicira.com>
Acked-by: YAMAMOTO Takashi <yamamoto@valinux.co.jp>

lib/classifier: Unify struct classifier and cls_classifier.

Now that it is clear that struct cls_classifier itself does not
need RCU indirection and pvector is defined in its own header, it
is possible get rid of the indirection from struct classifier to
struct cls_classifier.

Suggested-by: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
Signed-off-by: Jarno Rajahalme <jrajahalme@nicira.com>
Acked-by: Ben Pfaff <blp@nicira.com>

netdev-dpdk: Refactor dpdk_class_init()

The following changes were made:

- Since we have two dpdk classes, we should split the initial operations needed
  by both classes from the initialization needed by each class.
- The dpdk_ring class does not need an initialization function: it has been
  removed. This also prevents many testcase from failing, because
  dpdk_ring_class_init() was printing an unexpected log message
  (OVS_VSWITCHD_START at tests/ofproto-macros.at:54 check for a specific set of
  startup log messages)
- If the user doesn't pass the --dpdk option we do not register the dpdk*
  classes
- Do not call VLOG_ERR if there are 0 dpdk ethernet device. OVS can now be used
  with dpdk_ring devices.

Signed-off-by: Daniele Di Proietto <ddiproietto@vmware.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>

ofproto: Report controller rate limiting statistics in database.

Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Gurucharan Shetty <gshetty@nicira.com>

smap: New function smap_add_nocopy().

An upcoming commit will add a caller that needs to format both the key and
the value. That isn't cleanly possible with the current interface.

Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Gurucharan Shetty <gshetty@nicira.com>

Simplify ofproto_controller_info by using a struct smap in place of array.

The only client for ofproto_controller_info was transforming the array of
pairs into an smap anyway. It's easy for the code that fills in the array
to generate it as an smap directly, and it's also easier to extend later.

Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Gurucharan Shetty <gshetty@nicira.com>

pinsched: Report queued packet count correctly.

'n_txq' was initialized to 0 and never modified, so pinsched_count_txqlen()
always returned 0. Instead, return the correct number.

This only affected the results of "ovs-appctl memory/show", and only if
controller rate limiting was turned on, so it is not a serious bug.

Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Gurucharan Shetty <gshetty@nicira.com>

vswitch.xml: Fix typo in documentation.

Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Gurucharan Shetty <gshetty@nicira.com>

stp: Make stp-disabled port forward stp bpdu packets.

Commit 0d1cee123a84 (stp: Fix bpdu tx problem in listening state)
makes ovs drop the stp bpdu packets if stp is not enabled on the
input port.  However, when pif bridge is used and stp is enabled
on the integration bridge.  The flow translation of stp bpdu
packets will go through a level of resubmission which changes
the input port to the corresponding peer port.  Since, the
patch port on the pif bridge does not have stp enabled, the
flow translation will drop the bpdu packets.

This commit fixes the issue by making ovs forward stp bpdu packets
on stp-disabled port.

VMware-BZ: #1284695

Signed-off-by: Alex Wang <alexw@nicira.com>
Acked-by: Ben Pfaff <blp@nicira.com>
Acked-by: Joe Stringer <joestringer@nicira.com>

cmap: Fix cmap_next_position()

cmap_next_position() didn't update the node pointer while iterating through a
list of nodes with the same hash.
This commit fixes the bug and improve test-cmap to detect it.

Signed-off-by: Daniele Di Proietto <ddiproietto@vmware.com>
Signed-off-by: Ben Pfaff <blp@nicira.com>

netdev-dpdk: add dpdk rings to netdev-dpdk

Shared memory ring patch

This patch enables the client dpdk rings within the netdev-dpdk.  It adds
a new dpdk device called dpdkr (other naming suggestions?).  This allows
for the use of shared memory to communicate with other dpdk applications,
on the host or within a virtual machine.  Instructions for use are in
INSTALL.DPDK.

This has been tested on Intel multi-core platforms and with the client
application within the host.

Signed-off-by: Gerald Rogers <gerald.rogers@intel.com>
Signed-off-by: Maryam Tahhan <maryam.tahhan@intel.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>

netlink-socket: Simplify multithreaded dumping to match Linux reality.

Commit 0791315e4d (netlink-socket: Work around kernel Netlink dump thread
races.) introduced a simple workaround for Linux kernel races in Netlink
dumps.  However, the code remained more complicated than needed.  This
commit simplifies it.

The main reason for complication in the code was 'status_seq' in nl_dump.
This member was there to allow a thread to wait for some other thread to
refill the socket buffer with another dump message (although we did not
understand the reason at the time it was introduced).  Now that we know
that Netlink dumps properly need to be serialized to work in existing
Linux kernels, there's no additional value in having 'status_seq',
because serialized recvmsg() calls always refill the socket buffer
properly.

This commit updates nl_msg_next() to clear its buffer argument on error.
This is a more convenient interface for the new version of the Netlink
dump code.  nl_msg_next() doesn't have any other callers.

Signed-off-by: Ben Pfaff <blp@nicira.com>

Factor the ovsdb-server main loop into a new function

This refactors the ovsdb-server main loop into a new function, which allows
to call it from multiple places.

Signed-off-by: Eric Sesterhenn <eric.sesterhenn@lsexperts.de>
Signed-off-by: Ben Pfaff <blp@nicira.com>

CodingStyle: Add suggested GNU indent options.

Suggested-by: Gerald Rogers <gerald.rogers@intel.com>
Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Andy Zhou <azhou@nicira.com>

dpif: Update documentation for RCU-protected actions.

The userspace datapath returns RCU-protected actions from flow_get() and
flow_dump_next(). This doesn't cause any trouble for current users of
these functions, but it imposes additional constraints on their use.
This patch makes the dpif documentation more explicit about how the
results of these functions can be used.

Signed-off-by: Joe Stringer <joestringer@nicira.com>
Acked-by: Ben Pfaff <blp@nicira.com>

tests: Use the 'LARGE_MSECS' variation of time/warp at more places.

commit 8661af798(timeval: Provide a variation for time/warp command)
provided a variation of time/warp command to prevent multiple
invocations of time/warp. That commit changed only the users in bfd.at
and cfm.at as they used it the most. Since we haven't had any negative
confequences because of the change, add the remaining users.

Signed-off-by: Gurucharan Shetty <gshetty@nicira.com>
Acked-by: Ben Pfaff <blp@nicira.com>

debian: Automatically start openvswitch before first invocation of ovs-vsctl.

In the 'interfaces' file, if an admin adds the openvswitch interface
in 'auto', ifupdown will try to create OVS interfaces even before
openvswitch has started. In a case like that, assume that the admin
knows what he is doing and try to start openvswitch.

The negatives I see are
1. /usr is NFS mounted, in which case expect the admin
to mount it through initramfs.
2. syslog through network may not be accessible.

This is similar to what rhel does with commit 602453000e28ec10
(rhel: Automatically start openvswitch service before bringing an ovs
interface up or down)

Signed-off-by: Gurucharan Shetty <gshetty@nicira.com>
Acked-by: Ben Pfaff <blp@nicira.com>

debian: Option to create patch ports through 'interfaces'

This is a port of commit d7aab661 ( rhel: Add Patch Port support to
initscripts.) from rhel to debian's ifupdown script.

Signed-off-by: Gurucharan Shetty <gshetty@nicira.com>
Acked-by: Ben Pfaff <blp@nicira.com>

lib/coverage: Removed set but not used variables

This removes a GCC 4.9 warning (unused-but-set-variable)

Signed-off-by: Daniele Di Proietto <ddiproietto@vmware.com>
Signed-off-by: Ben Pfaff <blp@nicira.com>

datapath: Check for NULL upcall_portids.

Following patch adds NULL check for memory allocated
by kmalloc.

Reported-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>

netlink-socket: Fix handling socket allocation failure in nl_dump_start().

If nl_pool_alloc() failed, then 'dump' was not initialized at all and
further use of the dump would access uninitialized data, probably causing
a crash.

Found by inspection.

Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Joe Stringer <joestringer@nicira.com>

netlink-socket: Refill comment to fit within 79 columns.

Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Joe Stringer <joestringer@nicira.com>

dpif-linux: Avoid null dereference if all ports disappear.

When dpif_linux_refresh_channels() refreshes the set of channels when
the number of handlers changes, it destroys all the dpif's channels and
sets dpif->uc_array_size to 0.  If the port dump later in the function
turns up no ports (which generally indicates a bug), then no channels will
be allocated and thus dpif->uc_array_size will remain 0 and 'channels' will
be null in each handler.  This is self-consistent, at least, but
dpif_linux_port_get_pid__() was still willing in this situation to
try to access element 0 of the set of channels, dereferencing a null
pointer.

This fixes the problem.

I encountered this while looking at a bug that I had introduced during
development that caused the port dump to always be empty.  It would be
difficult to encounter in normal use.

Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Joe Stringer <joestringer@nicira.com>

ofp-msgs: Correct code for queue configuration messages in OpenFlow 1.0.

Reported-by: Simon Jouet <simon.jouet@gmail.com>
Signed-off-by: Ben Pfaff <blp@nicira.com>

Fix documentation error that led user to wrong file to install dependency packages.

Signed-off-by: Kirkland Spector <kspector@salesforce.com>
Acked-by: Andrey Falko <afalko@salesforce.com>
Signed-off-by: Ben Pfaff <blp@nicira.com>

Drop assignments whose values are never used.

Found by clang-analyzer.

Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Jarno Rajahalme <jrajahalme@nicira.com>

coverage: Move m_idx, h_idx to an inner scope in coverage_run().

These variables were initialized in an outer scope and then immediately
changed in an inner one, so they might as well be farther in.

Found by clang-analyzer.

Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Jarno Rajahalme <jrajahalme@nicira.com>

ofproto: Avoid theoretical double free of large rule collections.

collect_rules_strict() and collect_rules_loose() destroy the rule
collections that they create if they return an error, and some of their
callers then go on to destroy them again.  This could cause a double-free
in the case where rule_collection_destroy() actually calls free().  That
never happens in the current tree, because free() is only necessary if
malloc() was called and there's a 64-entry stub that none of the current
code in collect_rules_*() can fill up in their error cases.  Still, it
seems better to fix the problem.

Found by clang-analyzer.

Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Jarno Rajahalme <jrajahalme@nicira.com>

ofp-util: Fix null pointer dereference in ofputil_pull_buckets().

Found by clang-analyzer.

Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Jarno Rajahalme <jrajahalme@nicira.com>

revalidator: Revalidate missed flows.

If the datapath doesn't dump a flow for some reason, and the current
dump is expected to revalidate all flows in the datapath, then perform
revalidation for those flows by fetching them during the sweep phase.
If revalidation is not required, then leave the flow in the datapath and
don't revalidate it.

Signed-off-by: Joe Stringer <joestringer@nicira.com>
Acked-by: Ben Pfaff <blp@nicira.com>

dpif: Support fetching flow mask via dpif_flow_get().

Change the interface to allow implementations to pass back a buffer, and
allow callers to specify which of actions, mask, and stats they wish to
receive. This will be used in the next commit.

Signed-off-by: Joe Stringer <joestringer@nicira.com>
Acked-by: Ben Pfaff <blp@nicira.com>

ovsdb-server.at: Skip tests that use ovsdb-server's "--run" on Windows.

ovsdb-server's port on Windows does not support the "--run" option.
The two tests skipped in this commit make use of "--run" option to
test ovsdb-server's truncating of corrupt log or bad transaction.
It looks a little tricky to get this test running on Windows without
the "--run" option implemented.

Signed-off-by: Gurucharan Shetty <gshetty@nicira.com>
Acked-by: Ben Pfaff <blp@nicira.com>

datapath: Refactor ovs_flow_cmd_fill_info().

Split up ovs_flow_cmd_fill_info() to make it easier to cache parts of a
dump reply. This will be used to streamline flow_dump in a future patch.

Signed-off-by: Joe Stringer <joestringer@nicira.com>
Acked-by: Thomas Graf <tgraf@noironetworks.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>

tests: Disable glibc memory checking under glibc <= 2.11.

We noticed that the unit tests sometimes fail on XenServer inside glibc's
memory checker, in the free_check() function. It turns out that the
glibc memory checker in glibc 2.11 and earlier had an internal race that
caused false positives in multithreaded programs.

This commit avoids the problem by disabling the glibc memory checker in
glibc 2.11 and earlier.

VMware-BZ: #1267127
Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Gurucharan Shetty <gshetty@nicira.com>

datapath/flow_netlink: Create right mask with disabled megaflows

If megaflows are disabled, the userspace does not send the netlink attribute
OVS_FLOW_ATTR_MASK, and the kernel must create an exact match mask.

sw_flow_mask_set() sets every bytes (in 'range') of the mask to 0xff, even the
bytes that represent padding for struct sw_flow, or the bytes that represent
fields that may not be set during ovs_flow_extract().
This is a problem, because when we extract a flow from a packet,
we do not memset() anymore the struct sw_flow to 0 (since commit 9cef26ac6a71).

This commit gets rid of sw_flow_mask_set() and introduces mask_set_nlattr(),
which operates on the netlink attributes rather than on the mask key. Using
this approach we are sure that only the bytes that the user provided in the
flow are matched.

Also, if the parse_flow_mask_nlattrs() for the mask ENCAP attribute fails, we
now return with an error.

Reported-by: Alex Wang <alexw@nicira.com>
Suggested-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: Daniele Di Proietto <ddiproietto@vmware.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>

datapath: Enable tunnel GSO features.

Following patch enables all available tunnel GSO features for OVS
bridge device so that ovs can use hardware offloads available to
underling device.

Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Andy Zhou <azhou@nicira.com>

lib/hash: Use CRC32 for hashing.

Use CRC32 intrinsics for hash computations when building for
X86_64 with SSE4_2.

Add a new hash_words64() and change hash_words() to be inlined when
'n_words' is a compile-time constant.

Signed-off-by: Jarno Rajahalme <jrajahalme@nicira.com>
Acked-by: Ben Pfaff <blp@nicira.com>

lib/classifier: Lockless lookups.

Now that all the relevant classifier structures use RCU and internal
mutual exclusion for modifications, we can remove the fat-rwlock and
thus make the classifier lookups lockless.

As the readers are operating concurrently with the writers, a
concurrent reader may or may not see a new rule being added by a
writer, depending on how the concurrent events overlap with each
other.  Overall, this is no different from the former locked behavior,
but there the visibility of the new rule only depended on the timing
of the locking functions.

A new rule is first added to the segment indices, so the readers may
find the rule in the indices before the rule is visible in the
subtables 'rules' map.  This may result in us losing the opportunity
to quit lookups earlier, resulting in sub-optimal wildcarding.  This
will be fixed by forthcoming revalidation always scheduled after flow
table changes.

Similar behavior may happen due to us removing the overlapping rule
(if any) from the indices only after the corresponding new rule has
been added.

The subtable's max priority is updated only after a rule is inserted
to the maps, so the concurrent readers may not see the rule, as the
updated priority ordered subtable list will only be visible after the
subtable's max priority is updated.

Similarly, the classifier's partitions are updated by the caller after
the rule is inserted to the maps, so the readers may keep skipping the
subtable until they see the updated partitions.

Signed-off-by: Jarno Rajahalme <jrajahalme@nicira.com>
Acked-by: YAMAMOTO Takashi <yamamoto@valinux.co.jp>

lib/classifier: RCUify prefix trie code.

cls_set_prefix_fields() now synchronizes explicitly with the readers,
waiting them to finish using the old configuration before changing to
the new configuration.

Signed-off-by: Jarno Rajahalme <jrajahalme@nicira.com>
Acked-by: YAMAMOTO Takashi <yamamoto@valinux.co.jp>

lib/classifier: Use internal mutex.

Add an internal mutex to struct cls_classifier, and reorganize
classifier internal structures according to the user of each field,
marking the fields that need to be protected by the mutex. This makes
locking requirements easier to track, and may make lookup more memory
efficient.

After this patch there is some double locking, as callers are taking
the fat-rwlock, and we take the mutex internally. A following patch
will remove the classifier fat-rwlock, removing the (double) locking
overhead.

Signed-off-by: Jarno Rajahalme <jrajahalme@nicira.com>
Acked-by: YAMAMOTO Takashi <yamamoto@valinux.co.jp>

lib/classifier: Stylistic change.

Rename 'nbits' as 'n_bits' to be more consistent with other count-like
fields.

Signed-off-by: Jarno Rajahalme <jrajahalme@nicira.com>
Acked-by: YAMAMOTO Takashi <yamamoto@valinux.co.jp>

lib/ovs-rcu: Export ovsrcu_synchronize().

A following patch will add the first external user.

Signed-off-by: Jarno Rajahalme <jrajahalme@nicira.com>
Acked-by: YAMAMOTO Takashi <yamamoto@valinux.co.jp>

lib/classifier: Simplify iteration with C99 declaration.

Hide the cursor from the classifier iteration users and move locking to
the iterators. This will make following RCU changes simpler, as the call
sites of the iterators need not be changed at that point.

Signed-off-by: Jarno Rajahalme <jrajahalme@nicira.com>
Acked-by: YAMAMOTO Takashi <yamamoto@valinux.co.jp>

lib/classifier: Use cmap.

Use cmap instead of hmap & hindex in classifier.

Signed-off-by: Jarno Rajahalme <jrajahalme@nicira.com>
Acked-by: Ben Pfaff <blp@nicira.com>
Acked-by: YAMAMOTO Takashi <yamamoto@valinux.co.jp>

netlink-socket: Work around kernel Netlink dump thread races.

The Linux kernel Netlink implementation has two races that cause problems
for processes that attempt to dump a table in a multithreaded manner.

The first race is in the structure of the kernel netlink_recv() function.
This function pulls a message from the socket queue and, if there is none,
reports EAGAIN:
skb = skb_recv_datagram(sk, flags, noblock, &err);
if (skb == NULL)
goto out;
Only if a message is successfully read from the socket queue does the
function, toward the end, try to queue up a new message to be dumped:
if (nlk->cb && atomic_read(&sk->sk_rmem_alloc) <= sk->sk_rcvbuf / 2) {
ret = netlink_dump(sk);
if (ret) {
sk->sk_err = ret;
sk->sk_error_report(sk);
}
}
This means that if thread A reads a message from a dump, then thread B
attempts to read one before A queues up the next, B will get EAGAIN.  This
means that, following EAGAIN, B needs to wait until A returns to userspace
before it tries to read the socket again.  nl_dump_next() already does
this, using 'dump->status_seq' (although the need for it has never been
explained clearly, to my knowledge).

The second race is more serious.  Suppose thread X and thread Y both
simultaneously attempt to queue up a new message to be dumped, using the
call to netlink_dump() quoted above.  netlink_dump() begins with:
mutex_lock(nlk->cb_mutex);

cb = nlk->cb;
if (cb == NULL) {
err = -EINVAL;
goto errout_skb;
}
Suppose that X gets cb_mutex first and finds that the dump is complete.  It
will therefore, toward the end of netlink_dump(), clear nlk->cb to NULL to
indicate that no dump is in progress and release the mutex:
nlk->cb = NULL;
mutex_unlock(nlk->cb_mutex);
When Y grabs cb_mutex afterward, it will see that nlk->cb is NULL and
return -EINVAL as quoted above.  netlink_recv() stuffs -EINVAL in sk_err,
but that error is not reported immediately; instead, it is saved for the
next read from the socket.  Since Open vSwitch maintains a pool of Netlink
sockets, that next failure can crop up pretty much anywhere.  One of the
worst places for it to crop up is in the execution of a later transaction
(e.g. in nl_sock_transact_multiple__()), because userspace treats Netlink
transactions as idempotent and will re-execute them when socket errors
occur.  For a transaction that sends a packet, this causes packet
duplication, which we actually observed in practice.  (ENOBUFS should
actually cause transactions to be re-executed in many cases, but EINVAL
should not; this is a separate bug in the userspace netlink code.)

VMware-BZ: #1283188
Reported-and-tested-by: Alex Wang <alexw@nicira.com>
Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Alex Wang <alexw@nicira.com>

netlink-socket: Fix sign of error code.

Commit 8f20fd98db (netlink-socket: Work around upstream kernel Netlink
bug.) got the sign of the error code wrong, so that it reported e.g. -22
for EINVAL to nl_sock_recv__()'s caller, instead of 22.

Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Alex Wang <alexw@nicira.com>

datapath: refactor do_output() to move skb_clone NULL check out of fast path

skb_clone() NULL check is implemented in do_output(), as past of the
common (fast) path. Refactoring so that NULL check is done in the
slow path, immediately after skb_clone() is called.

Besides optimization, this change also improves code readability by
making the skb_clone() NULL check consistent within OVS datapath
module.

Signed-off-by: Andy Zhou <azhou@nicira.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>

datapath: add skb_clone NULL check for the sampling action.

Fix a bug where skb_clone() NULL check is missing in sample action
implementation.

Signed-off-by: Andy Zhou <azhou@nicira.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>

datapath: add skb_clone NULL check in the recirc action.

Refactoring recirc action implementation.

The main change is to fix a bug where the NULL check after skb clone()
call is missing. The fix is to return -ENOMEM whenever skb_clone()
failed to create a clone.

Also rearranged adjacent code to improve readability.

Reported-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: Andy Zhou <azhou@nicira.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>

netlink-protocol: Remove definition of struct sockaddr_nl.

This struct is used only in netlink-socket.c, which is only used on Linux,
which in turn gets the definition from <linux/netlink.h>. On Windows the
definition actually causes a small amount of trouble because Windows does
not define sa_family_t (despite POSIX), so it's better to remove it.

Even if other platforms adopt Netlink, I have no reason to believe that
they will use the same sockaddr format as Linux.

CC: Saurabh Shah <ssaurabh@vmware.com>
Signed-off-by: Ben Pfaff <blp@nicira.com>
Acked-by: Gurucharan Shetty <gshetty@nicira.com>