doc: Populate 'topics' section

author Stephen Finucane <stephen@that.guru>

Thu, 8 Dec 2016 12:55:26 +0000 (12:55 +0000)

committer Ben Pfaff <blp@ovn.org>

Mon, 12 Dec 2016 16:57:06 +0000 (08:57 -0800)
author Stephen Finucane <stephen@that.guru>
Thu, 8 Dec 2016 12:55:26 +0000 (12:55 +0000)
committer Ben Pfaff <blp@ovn.org>
Mon, 12 Dec 2016 16:57:06 +0000 (08:57 -0800)
diff --git a/DESIGN.rst b/DESIGN.rst

deleted file mode 100644 (file)

index adc407a..0000000
--- a/DESIGN.rst
+++ /dev/null
@@ -1,1163 +0,0 @@
-..
-      Licensed under the Apache License, Version 2.0 (the "License"); you may
-      not use this file except in compliance with the License. You may obtain
-      a copy of the License at
-
-          http://www.apache.org/licenses/LICENSE-2.0
-
-      Unless required by applicable law or agreed to in writing, software
-      distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
-      WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
-      License for the specific language governing permissions and limitations
-      under the License.
-
-      Convention for heading levels in Open vSwitch documentation:
-
-      =======  Heading 0 (reserved for the title in a document)
-      -------  Heading 1
-      ~~~~~~~  Heading 2
-      +++++++  Heading 3
-      '''''''  Heading 4
-
-      Avoid deeper levels because they do not render well.
-
-================================
-Design Decisions In Open vSwitch
-================================
-
-This document describes design decisions that went into implementing Open
-vSwitch.  While we believe these to be reasonable decisions, it is impossible
-to predict how Open vSwitch will be used in all environments.  Understanding
-assumptions made by Open vSwitch is critical to a successful deployment.  The
-end of this document contains contact information that can be used to let us
-know how we can make Open vSwitch more generally useful.
-
-Asynchronous Messages
----------------------
-
-Over time, Open vSwitch has added many knobs that control whether a given
-controller receives OpenFlow asynchronous messages.  This section describes how
-all of these features interact.
-
-First, a service controller never receives any asynchronous messages unless it
-changes its miss_send_len from the service controller default of zero in one of
-the following ways:
-
-- Sending an ``OFPT_SET_CONFIG`` message with nonzero ``miss_send_len``.
-
-- Sending any ``NXT_SET_ASYNC_CONFIG`` message: as a side effect, this message
-  changes the ``miss_send_len`` to ``OFP_DEFAULT_MISS_SEND_LEN`` (128) for
-  service controllers.
-
-Second, ``OFPT_FLOW_REMOVED`` and ``NXT_FLOW_REMOVED`` messages are generated
-only if the flow that was removed had the ``OFPFF_SEND_FLOW_REM`` flag set.
-
-Third, ``OFPT_PACKET_IN`` and ``NXT_PACKET_IN`` messages are sent only to
-OpenFlow controller connections that have the correct connection ID (see
-``struct nx_controller_id`` and ``struct nx_action_controller``):
-
-- For packet-in messages generated by a ``NXAST_CONTROLLER`` action, the
-  controller ID specified in the action.
-
-- For other packet-in messages, controller ID zero.  (This is the default ID
-  when an OpenFlow controller does not configure one.)
-
-Finally, Open vSwitch consults a per-connection table indexed by the message
-type, reason code, and current role.  The following table shows how this table
-is initialized by default when an OpenFlow connection is made.  An entry
-labeled ``yes`` means that the message is sent, an entry labeled ``---`` means
-that the message is suppressed.
-
-.. table:: ``OFPT_PACKET_IN`` / ``NXT_PACKET_IN``
-
-  =========================================== ======= =====
-                                              master/
-           message and reason code            other   slave
-  =========================================== ======= =====
-  ``OFPR_NO_MATCH``                             yes    ---
-  ``OFPR_ACTION``                               yes    ---
-  ``OFPR_INVALID_TTL``                          ---    ---
-  ``OFPR_ACTION_SET`` (OF1.4+)                  yes    ---
-  ``OFPR_GROUP`` (OF1.4+)                       yes    ---
-  =========================================== ======= =====
-
-.. table:: ``OFPT_FLOW_REMOVED`` / ``NXT_FLOW_REMOVED``
-
-  =========================================== ======= =====
-                                              master/
-           message and reason code            other   slave
-  =========================================== ======= =====
-  ``OFPRR_IDLE_TIMEOUT``                        yes    ---
-  ``OFPRR_HARD_TIMEOUT``                        yes    ---
-  ``OFPRR_DELETE``                              yes    ---
-  ``OFPRR_GROUP_DELETE`` (OF1.4+)               yes    ---
-  ``OFPRR_METER_DELETE`` (OF1.4+)               yes    ---
-  ``OFPRR_EVICTION`` (OF1.4+)                   yes    ---
-  =========================================== ======= =====
-
-.. table:: ``OFPT_PORT_STATUS``
-
-  =========================================== ======= =====
-                                              master/
-           message and reason code            other   slave
-  =========================================== ======= =====
-  ``OFPPR_ADD``                                 yes    yes
-  ``OFPPR_DELETE``                              yes    yes
-  ``OFPPR_MODIFY``                              yes    yes
-  =========================================== ======= =====
-
-.. table:: ``OFPT_ROLE_REQUEST`` / ``OFPT_ROLE_REPLY`` (OF1.4+)
-
-  =========================================== ======= =====
-                                              master/
-           message and reason code            other   slave
-  =========================================== ======= =====
-  ``OFPCRR_MASTER_REQUEST``                     ---    ---
-  ``OFPCRR_CONFIG``                             ---    ---
-  ``OFPCRR_EXPERIMENTER``                       ---    ---
-  =========================================== ======= =====
-
-.. table:: ``OFPT_TABLE_STATUS`` (OF1.4+)
-
-  =========================================== ======= =====
-                                              master/
-           message and reason code            other   slave
-  =========================================== ======= =====
-  ``OFPTR_VACANCY_DOWN``                        ---    ---
-  ``OFPTR_VACANCY_UP``                          ---    ---
-  =========================================== ======= =====
-
-
-.. table:: ``OFPT_REQUESTFORWARD`` (OF1.4+)
-
-  =========================================== ======= =====
-                                              master/
-           message and reason code            other   slave
-  =========================================== ======= =====
-  ``OFPRFR_GROUP_MOD``                          ---    ---
-  ``OFPRFR_METER_MOD``                          ---    ---
-  =========================================== ======= =====
-
-The ``NXT_SET_ASYNC_CONFIG`` message directly sets all of the values in this
-table for the current connection.  The ``OFPC_INVALID_TTL_TO_CONTROLLER`` bit
-in the ``OFPT_SET_CONFIG`` message controls the setting for
-``OFPR_INVALID_TTL`` for the "master" role.
-
-``OFPAT_ENQUEUE``
------------------
-
-The OpenFlow 1.0 specification requires the output port of the
-``OFPAT_ENQUEUE`` action to "refer to a valid physical port (i.e. <
-``OFPP_MAX``) or ``OFPP_IN_PORT``".  Although ``OFPP_LOCAL`` is not less than
-``OFPP_MAX``, it is an 'internal' port which can have QoS applied to it in
-Linux.  Since we allow the ``OFPAT_ENQUEUE`` to apply to 'internal' ports whose
-port numbers are less than ``OFPP_MAX``, we interpret ``OFPP_LOCAL`` as a
-physical port and support ``OFPAT_ENQUEUE`` on it as well.
-
-``OFPT_FLOW_MOD``
------------------
-
-The OpenFlow specification for the behavior of ``OFPT_FLOW_MOD`` is confusing.
-The following tables summarize the Open vSwitch implementation of its behavior
-in the following categories:
-
-"match on priority"
-  Whether the ``flow_mod`` acts only on flows whose priority matches that
-  included in the ``flow_mod`` message.
-
-"match on out_port"
-  Whether the ``flow_mod`` acts only on flows that output to the out_port
-  included in the flow_mod message (if out_port is not ``OFPP_NONE``).
-  OpenFlow 1.1 and later have a similar feature (not listed separately here)
-  for ``out_group``.
-
-"match on flow_cookie":
-  Whether the ``flow_mod`` acts only on flows whose ``flow_cookie`` matches an
-  optional controller-specified value and mask.
-
-"updates flow_cookie":
-  Whether the ``flow_mod`` changes the ``flow_cookie`` of the flow or flows
-  that it matches to the ``flow_cookie`` included in the flow_mod message.
-
-"updates ``OFPFF_`` flags":
-  Whether the flow_mod changes the ``OFPFF_SEND_FLOW_REM`` flag of the flow or
-  flows that it matches to the setting included in the flags of the flow_mod
-  message.
-
-"honors ``OFPFF_CHECK_OVERLAP``":
-  Whether the ``OFPFF_CHECK_OVERLAP`` flag in the flow_mod is significant.
-
-"updates ``idle_timeout``" and "updates ``hard_timeout``":
-  Whether the ``idle_timeout`` and hard_timeout in the ``flow_mod``,
-  respectively, have an effect on the flow or flows matched by the
-  ``flow_mod``.
-
-"updates idle timer":
-  Whether the ``flow_mod`` resets the per-flow timer that measures how long a
-  flow has been idle.
-
-"updates hard timer":
-  Whether the ``flow_mod`` resets the per-flow timer that measures how long it
-  has been since a flow was modified.
-
-"zeros counters":
-  Whether the ``flow_mod`` resets per-flow packet and byte counters to zero.
-
-"may add a new flow":
-  Whether the ``flow_mod`` may add a new flow to the flow table.  (Obviously
-  this is always true for "add" commands but in some OpenFlow versions "modify"
-  and "modify-strict" can also add new flows.)
-
-"sends ``flow_removed`` message":
-  Whether the flow_mod generates a flow_removed message for the flow or flows
-  that it affects.
-
-An entry labeled ``yes`` means that the flow mod type does have the indicated
-behavior, ``---`` means that it does not, an empty cell means that the property
-is not applicable, and other values are explained below the table.
-
-OpenFlow 1.0
-~~~~~~~~~~~~
-
-================================ === ====== ====== ====== ======
-                                            MODIFY        DELETE
-RULE                             ADD MODIFY STRICT DELETE STRICT
-================================ === ====== ====== ====== ======
-match on ``priority``            yes  ---    yes    ---    yes
-match on ``out_port``            ---  ---    ---    yes    yes
-match on ``flow_cookie``         ---  ---    ---    ---    ---
-match on ``table_id``            ---  ---    ---    ---    ---
-controller chooses ``table_id``  ---  ---    ---
-updates ``flow_cookie``          yes  yes    yes
-updates ``OFPFF_SEND_FLOW_REM``  yes   +      +
-honors ``OFPFF_CHECK_OVERLAP``   yes   +      +
-updates ``idle_timeout``         yes   +      +
-updates ``hard_timeout``         yes   +      +
-resets idle timer                yes   +      +
-resets hard timer                yes  yes    yes
-zeros counters                   yes   +      +
-may add a new flow               yes  yes    yes
-sends ``flow_removed`` message   ---  ---    ---     %      %
-================================ === ====== ====== ====== ======
-
-where:
-
-``+``
-  "modify" and "modify-strict" only take these actions when they create a new
-  flow, not when they update an existing flow.
-
-``%``
-  "delete" and "delete_strict" generates a flow_removed message if the deleted
-  flow or flows have the ``OFPFF_SEND_FLOW_REM`` flag set.  (Each controller
-  can separately control whether it wants to receive the generated messages.)
-
-OpenFlow 1.1
-~~~~~~~~~~~~
-
-OpenFlow 1.1 makes these changes:
-
-- The controller now must specify the ``table_id`` of the flow match searched
-  and into which a flow may be inserted.  Behavior for a ``table_id`` of 255 is
-  undefined.
-
-- A ``flow_mod``, except an "add", can now match on the ``flow_cookie``.
-
-- When a ``flow_mod`` matches on the ``flow_cookie``, "modify" and
-  "modify-strict" never insert a new flow.
-
-================================ === ====== ====== ====== ======
-                                            MODIFY        DELETE
-RULE                             ADD MODIFY STRICT DELETE STRICT
-================================ === ====== ====== ====== ======
-match on ``priority``            yes  ---    yes    ---    yes
-match on ``out_port``            ---  ---    ---    yes    yes
-match on ``flow_cookie``         ---  yes    yes    yes    yes
-match on ``table_id``            yes  yes    yes    yes    yes
-controller chooses ``table_id``  yes  yes    yes
-updates ``flow_cookie``          yes  ---    ---
-updates ``OFPFF_SEND_FLOW_REM``  yes   +      +
-honors ``OFPFF_CHECK_OVERLAP``   yes   +      +
-updates ``idle_timeout``         yes   +      +
-updates ``hard_timeout``         yes   +      +
-resets idle timer                yes   +      +
-resets hard timer                yes  yes    yes
-zeros counters                   yes   +      +
-may add a new flow               yes   #      #
-sends ``flow_removed`` message   ---  ---    ---     %      %
-================================ === ====== ====== ====== ======
-
-where:
-
-``+``
-  "modify" and "modify-strict" only take these actions when they create a new
-  flow, not when they update an existing flow.
-
-``%``
-  "delete" and "delete_strict" generates a flow_removed message if the deleted
-  flow or flows have the ``OFPFF_SEND_FLOW_REM`` flag set.  (Each controller
-  can separately control whether it wants to receive the generated messages.)
-
-``#``
-  "modify" and "modify-strict" only add a new flow if the flow_mod does not
-  match on any bits of the flow cookie
-
-OpenFlow 1.2
-~~~~~~~~~~~~
-
-OpenFlow 1.2 makes these changes:
-
-- Only "add" commands ever add flows, "modify" and "modify-strict" never do.
-
-- A new flag ``OFPFF_RESET_COUNTS`` now controls whether "modify" and
-  "modify-strict" reset counters, whereas previously they never reset counters
-  (except when they inserted a new flow).
-
-================================ === ====== ====== ====== ======
-                                            MODIFY        DELETE
-RULE                             ADD MODIFY STRICT DELETE STRICT
-================================ === ====== ====== ====== ======
-match on ``priority``            yes  ---    yes    ---    yes
-match on ``out_port``            ---  ---    ---    yes    yes
-match on ``flow_cookie``         ---  yes    yes    yes    yes
-match on ``table_id``            yes  yes    yes    yes    yes
-controller chooses ``table_id``  yes  yes    yes
-updates ``flow_cookie``          yes  ---    ---
-updates ``OFPFF_SEND_FLOW_REM``  yes  ---    ---
-honors ``OFPFF_CHECK_OVERLAP``   yes  ---    ---
-updates ``idle_timeout``         yes  ---    ---
-updates ``hard_timeout``         yes  ---    ---
-resets idle timer                yes  ---    ---
-resets hard timer                yes  yes    yes
-zeros counters                   yes   &      &
-may add a new flow               yes  ---    ---
-sends ``flow_removed`` message   ---  ---    ---     %      %
-================================ === ====== ====== ====== ======
-
-``%``
-  "delete" and "delete_strict" generates a flow_removed message if the deleted
-  flow or flows have the ``OFPFF_SEND_FLOW_REM`` flag set.  (Each controller
-  can separately control whether it wants to receive the generated messages.)
-
-``&``
-  "modify" and "modify-strict" reset counters if the ``OFPFF_RESET_COUNTS``
-  flag is specified.
-
-OpenFlow 1.3
-~~~~~~~~~~~~
-
-OpenFlow 1.3 makes these changes:
-
-- Behavior for a table_id of 255 is now defined, for "delete" and
-  "delete-strict" commands, as meaning to delete from all tables.  A table_id
-  of 255 is now explicitly invalid for other commands.
-
-- New flags ``OFPFF_NO_PKT_COUNTS`` and ``OFPFF_NO_BYT_COUNTS`` for "add"
-  operations.
-
-The table for 1.3 is the same as the one shown above for 1.2.
-
-OpenFlow 1.4
-~~~~~~~~~~~~
-
-OpenFlow 1.4 makes these changes:
-
-- Adds the "importance" field to ``flow_mods``, but it does not explicitly
-  specify which kinds of ``flow_mods`` set the importance.  For consistency,
-  Open vSwitch uses the same rule for importance as for ``idle_timeout`` and
-  ``hard_timeout``, that is, only an "ADD" flow_mod sets the importance.  (This
-  issue has been filed with the ONF as EXT-496.)
-
-.. TODO(stephenfin) Link to EXT-496
-
-- Eviction Mechanism to automatically delete entries of lower importance to
-  make space for newer entries.
-
-OpenFlow 1.4 Bundles
---------------------
-
-Open vSwitch makes all flow table modifications atomically, i.e., any datapath
-packet only sees flow table configurations either before or after any change
-made by any ``flow_mod``.  For example, if a controller removes all flows with
-a single OpenFlow ``flow_mod``, no packet sees an intermediate version of the
-OpenFlow pipeline where only some of the flows have been deleted.
-
-It should be noted that Open vSwitch caches datapath flows, and that the cached
-flows are *NOT* flushed immediately when a flow table changes.  Instead, the
-datapath flows are revalidated against the new flow table as soon as possible,
-and usually within one second of the modification.  This design amortizes the
-cost of datapath cache flushing across multiple flow table changes, and has a
-significant performance effect during simultaneous heavy flow table churn and
-high traffic load.  This means that different cached datapath flows may have
-been computed based on a different flow table configurations, but each of the
-datapath flows is guaranteed to have been computed over a coherent view of the
-flow tables, as described above.
-
-With OpenFlow 1.4 bundles this atomicity can be extended across an arbitrary
-set of ``flow_mod``.  Bundles are supported for ``flow_mod`` and port_mod
-messages only.  For ``flow_mod``, both ``atomic`` and ``ordered`` bundle flags
-are trivially supported, as all bundled messages are executed in the order they
-were added and all flow table modifications are now atomic to the datapath.
-Port mods may not appear in atomic bundles, as port status modifications are
-not atomic.
-
-To support bundles, ovs-ofctl has a ``--bundle`` option that makes the
-flow mod commands (``add-flow``, ``add-flows``, ``mod-flows``, ``del-flows``,
-and ``replace-flows``) use an OpenFlow 1.4 bundle to operate the
-modifications as a single atomic transaction.  If any of the flow mods
-in a transaction fail, none of them are executed.  All flow mods in a
-bundle appear to datapath lookups simultaneously.
-
-Furthermore, ovs-ofctl ``add-flow`` and ``add-flows`` commands now accept
-arbitrary flow mods as an input by allowing the flow specification to
-start with an explicit ``add``, ``modify``, ``modify_strict``, ``delete``, or
-``delete_strict`` keyword.  A missing keyword is treated as ``add``, so
-this is fully backwards compatible.  With the new ``--bundle`` option
-all the flow mods are executed as a single atomic transaction using an
-OpenFlow 1.4 bundle.  Without the ``--bundle`` option the flow mods are
-executed in order up to the first failing ``flow_mod``, and in case of an
-error the earlier successful ``flow_mod`` calls are not rolled back.
-
-``OFPT_PACKET_IN``
-------------------
-
-The OpenFlow 1.1 specification for ``OFPT_PACKET_IN`` is confusing.  The
-definition in OF1.1 ``openflow.h`` is[*]:
-
-::
-
-    /* Packet received on port (datapath -> controller). */
-    struct ofp_packet_in {
-        struct ofp_header header;
-        uint32_t buffer_id;     /* ID assigned by datapath. */
-        uint32_t in_port;       /* Port on which frame was received. */
-        uint32_t in_phy_port;   /* Physical Port on which frame was received. */
-        uint16_t total_len;     /* Full length of frame. */
-        uint8_t reason;         /* Reason packet is being sent (one of OFPR_*) */
-        uint8_t table_id;       /* ID of the table that was looked up */
-        uint8_t data[0];        /* Ethernet frame, halfway through 32-bit word,
-                                   so the IP header is 32-bit aligned.  The
-                                   amount of data is inferred from the length
-                                   field in the header.  Because of padding,
-                                   offsetof(struct ofp_packet_in, data) ==
-                                   sizeof(struct ofp_packet_in) - 2. */
-    };
-    OFP_ASSERT(sizeof(struct ofp_packet_in) == 24);
-
-The confusing part is the comment on the ``data[]`` member.  This comment is a
-leftover from OF1.0 ``openflow.h``, in which the comment was correct:
-``sizeof(struct ofp_packet_in)`` is 20 in OF1.0 and ``ffsetof(struct
-ofp_packet_in, data)`` is 18.  When OF1.1 was written, the structure members
-were changed but the comment was carelessly not updated, and the comment became
-wrong: ``sizeof(struct ofp_packet_in)`` and offsetof(struct ofp_packet_in,
-data) are both 24 in OF1.1.
-
-That leaves the question of how to implement ``ofp_packet_in`` in OF1.1.  The
-OpenFlow reference implementation for OF1.1 does not include any padding, that
-is, the first byte of the encapsulated frame immediately follows the
-``table_id`` member without a gap.  Open vSwitch therefore implements it the
-same way for compatibility.
-
-For an earlier discussion, please see the thread archived at:
-https://mailman.stanford.edu/pipermail/openflow-discuss/2011-August/002604.html
-
-[*] The quoted definition is directly from OF1.1.  Definitions used inside OVS
-omit the 8-byte ``ofp_header`` members, so the sizes in this discussion are
-8 bytes larger than those declared in OVS header files.
-
-VLAN Matching
--------------
-
-The 802.1Q VLAN header causes more trouble than any other 4 bytes in
-networking.  More specifically, three versions of OpenFlow and Open vSwitch
-have among them four different ways to match the contents and presence of the
-VLAN header.  The following table describes how each version works.
-
-======== ============= =============== =============== ================
- Match        NXM          OF1.0            OF1.1           OF1.2
-======== ============= =============== =============== ================
- ``[1]`` ``0000/0000`` ``????/1,??/?`` ``????/1,??/?`` ``0000/0000,--``
- ``[2]`` ``0000/ffff`` ``ffff/0,??/?`` ``ffff/0,??/?`` ``0000/ffff,--``
- ``[3]`` ``1xxx/1fff`` ``0xxx/0,??/1`` ``0xxx/0,??/1`` ``1xxx/ffff,--``
- ``[4]`` ``z000/f000`` ``????/1,0y/0`` ``fffe/0,0y/0`` ``1000/1000,0y``
- ``[5]`` ``zxxx/ffff`` ``0xxx/0,0y/0`` ``0xxx/0,0y/0`` ``1xxx/ffff,0y``
- ``[6]`` ``0000/0fff`` ``<none>``      ``<none>``      ``<none>``
- ``[7]`` ``0000/f000`` ``<none>``      ``<none>``      ``<none>``
- ``[8]`` ``0000/efff`` ``<none>``      ``<none>``      ``<none>``
- ``[9]`` ``1001/1001`` ``<none>``      ``<none>``      ``1001/1001,--``
-``[10]`` ``3000/3000`` ``<none>``      ``<none>``      ``<none>``
-``[11]`` ``1000/1000`` ``<none>``      ``fffe/0,??/1`` ``1000/1000,--``
-======== ============= =============== =============== ================
-
-where:
-
-Match:
-  See the list below.
-
-NXM:
-  ``xxxx/yyyy`` means ``NXM_OF_VLAN_TCI_W`` with value ``xxxx`` and mask
-  ``yyyy``.  A mask of ``0000`` is equivalent to omitting
-  ``NXM_OF_VLAN_TCI(_W)``, a mask of ``ffff`` is equivalent to
-  ``NXM_OF_VLAN_TCI``.
-
-OF1.0, OF1.1:
-  ``wwww/x,yy/z`` means ``dl_vlan`` ``wwww``, ``OFPFW_DL_VLAN`` ``x``,
-  ``dl_vlan_pcp`` ``yy``, and ``OFPFW_DL_VLAN_PCP`` ``z``.  If
-  ``OFPFW_DL_VLAN`` or ``OFPFW_DL_VLAN_PCP`` is 1, the corresponding field
-  value is wildcarded, otherwise it is matched.  ``?`` means that the given
-  bits are ignored (their conventional values are ``0000/x,00/0`` in OF1.0,
-  ``0000/x,00/1`` in OF1.1; ``x`` is never ignored).  ``<none>`` means that the
-  given match is not supported.
-
-OF1.2:
-  ``xxxx/yyyy,zz`` means ``OXM_OF_VLAN_VID_W`` with value ``xxxx`` and mask
-  ``yyyy``, and ``OXM_OF_VLAN_PCP`` (which is not maskable) with value ``zz``.
-  A mask of ``0000`` is equivalent to omitting ``OXM_OF_VLAN_VID(_W)``, a mask
-  of ``ffff`` is equivalent to ``OXM_OF_VLAN_VID``.  ``--`` means that
-  ``OXM_OF_VLAN_PCP`` is omitted.  ``<none>`` means that the given match is not
-  supported.
-
-The matches are:
-
-``[1]``:
-  Matches any packet, that is, one without an 802.1Q header or with an 802.1Q
-  header with any TCI value.
-
-``[2]``
-  Matches only packets without an 802.1Q header.
-
-  NXM:
-    Any match with ``vlan_tci == 0`` and ``(vlan_tci_mask & 0x1000) != 0`` is
-    equivalent to the one listed in the table.
-
-  OF1.0:
-    The spec doesn't define behavior if ``dl_vlan`` is set to ``0xffff`` and
-    ``OFPFW_DL_VLAN_PCP`` is not set.
-
-  OF1.1:
-    The spec says explicitly to ignore ``dl_vlan_pcp`` when ``dl_vlan`` is set
-    to ``0xffff``.
-
-  OF1.2:
-    The spec doesn't say what should happen if ``vlan_vid == 0`` and
-    ``(vlan_vid_mask & 0x1000) != 0`` but ``vlan_vid_mask != 0x1000``, but it
-    would be straightforward to also interpret as ``[2]``.
-
-``[3]``
-  Matches only packets that have an 802.1Q header with VID ``xxx`` (and any
-  PCP).
-
-``[4]``
-  Matches only packets that have an 802.1Q header with PCP ``y`` (and any VID).
-
-  NXM:
-    ``z`` is ``(y << 1) | 1``.
-
-  OF1.0:
-    The spec isn't very clear, but OVS implements it this way.
-
-  OF1.2:
-    Presumably other masks such that ``(vlan_vid_mask & 0x1fff) == 0x1000``
-    would also work, but the spec doesn't define their behavior.
-
-``[5]``
-  Matches only packets that have an 802.1Q header with VID ``xxx`` and PCP
-  ``y``.
-
-   NXM:
-     ``z`` is ``((y << 1) | 1)``.
-
-   OF1.2:
-     Presumably other masks such that ``(vlan_vid_mask & 0x1fff) == 0x1fff``
-     would also work.
-
-``[6]``
-  Matches packets with no 802.1Q header or with an 802.1Q header with a VID of
-  0.  Only possible with NXM.
-
-``[7]``
-  Matches packets with no 802.1Q header or with an 802.1Q header with a PCP of
-  0.  Only possible with NXM.
-
-``[8]``
-  Matches packets with no 802.1Q header or with an 802.1Q header with both VID
-  and PCP of 0.  Only possible with NXM.
-
-``[9]``
-  Matches only packets that have an 802.1Q header with an odd-numbered VID (and
-  any PCP).  Only possible with NXM and OF1.2.  (This is just an example; one
-  can match on any desired VID bit pattern.)
-
-``[10]``
-  Matches only packets that have an 802.1Q header with an odd-numbered PCP (and
-  any VID).  Only possible with NXM.  (This is just an example; one can match
-  on any desired VID bit pattern.)
-
-``[11]``
-  Matches any packet with an 802.1Q header, regardless of VID or PCP.
-
-Additional notes:
-
-OF1.2:
-  The top three bits of ``OXM_OF_VLAN_VID`` are fixed to zero, so bits 13, 14,
-  and 15 in the masks listed in the table may be set to arbitrary values, as
-  long as the corresponding value bits are also zero.  The suggested ``ffff``
-  mask for [2], [3], and [5] allows a shorter OXM representation (the mask is
-  omitted) than the minimal ``1fff`` mask.
-
-Flow Cookies
-------------
-
-OpenFlow 1.0 and later versions have the concept of a "flow cookie", which is a
-64-bit integer value attached to each flow.  The treatment of the flow cookie
-has varied greatly across OpenFlow versions, however.
-
-In OpenFlow 1.0:
-
-- ``OFPFC_ADD`` set the cookie in the flow that it added.
-
-- ``OFPFC_MODIFY`` and ``OFPFC_MODIFY_STRICT`` updated the cookie for the flow
-  or flows that it modified.
-
-- ``OFPST_FLOW`` messages included the flow cookie.
-
-- ``OFPT_FLOW_REMOVED`` messages reported the cookie of the flow that was
-  removed.
-
-OpenFlow 1.1 made the following changes:
-
-- Flow mod operations ``OFPFC_MODIFY``, ``OFPFC_MODIFY_STRICT``,
-  ``OFPFC_DELETE``, and ``OFPFC_DELETE_STRICT``, plus flow stats requests and
-  aggregate stats requests, gained the ability to match on flow cookies with an
-  arbitrary mask.
-
-- ``OFPFC_MODIFY`` and ``OFPFC_MODIFY_STRICT`` were changed to add a new flow,
-  in the case of no match, only if the flow table modification operation did
-  not match on the cookie field.  (In OpenFlow 1.0, modify operations always
-  added a new flow when there was no match.)
-
-- ``OFPFC_MODIFY`` and ``OFPFC_MODIFY_STRICT`` no longer updated flow cookies.
-
-OpenFlow 1.2 made the following changes:
-
-- ``OFPC_MODIFY`` and ``OFPFC_MODIFY_STRICT`` were changed to never add a new
-  flow, regardless of whether the flow cookie was used for matching.
-
-Open vSwitch support for OpenFlow 1.0 implements the OpenFlow 1.0 behavior with
-the following extensions:
-
-- An NXM extension field ``NXM_NX_COOKIE(_W)`` allows the NXM versions of
-  ``OFPFC_MODIFY``, ``OFPFC_MODIFY_STRICT``, ``OFPFC_DELETE``, and
-  ``OFPFC_DELETE_STRICT`` ``flow_mod`` calls, plus flow stats requests and
-  aggregate stats requests, to match on flow cookies with arbitrary masks.
-  This is much like the equivalent OpenFlow 1.1 feature.
-
-- Like OpenFlow 1.1, ``OFPC_MODIFY`` and ``OFPFC_MODIFY_STRICT`` add a new flow
-  if there is no match and the mask is zero (or not given).
-
-- The ``cookie`` field in ``OFPT_FLOW_MOD`` and ``NXT_FLOW_MOD`` messages is
-  used as the cookie value for ``OFPFC_ADD`` commands, as described in OpenFlow
-  1.0.  For ``OFPFC_MODIFY`` and ``OFPFC_MODIFY_STRICT`` commands, the
-  ``cookie`` field is used as a new cookie for flows that match unless it is
-  ``UINT64_MAX``, in which case the flow's cookie is not updated.
-
-- ``NXT_PACKET_IN`` (the Nicira extended version of ``OFPT_PACKET_IN``) reports
-  the cookie of the rule that generated the packet, or all-1-bits if no rule
-  generated the packet.  (Older versions of OVS used all-0-bits instead of
-  all-1-bits.)
-
-The following table shows the handling of different protocols when receiving
-``OFPFC_MODIFY`` and ``OFPFC_MODIFY_STRICT`` messages.  A mask of 0 indicates
-either an explicit mask of zero or an implicit one by not specifying the
-``NXM_NX_COOKIE(_W)`` field.
-
-==============  ======  ======  =============  =============
-                Match   Update   Add on miss    Add on miss
-                cookie  cookie     mask!=0        mask==0
-==============  ======  ======  =============  =============
-OpenFlow 1.0      no     yes    (add on miss)  (add on miss)
-OpenFlow 1.1     yes      no         no             yes
-OpenFlow 1.2     yes      no         no             no
-NXM              yes     yes\*       no             yes
-==============  ======  ======  =============  =============
-
-\* Updates the flow's cookie unless the ``cookie`` field is ``UINT64_MAX``.
-
-Multiple Table Support
-----------------------
-
-OpenFlow 1.0 has only rudimentary support for multiple flow tables.  Notably,
-OpenFlow 1.0 does not allow the controller to specify the flow table to which a
-flow is to be added.  Open vSwitch adds an extension for this purpose, which is
-enabled on a per-OpenFlow connection basis using the ``NXT_FLOW_MOD_TABLE_ID``
-message.  When the extension is enabled, the upper 8 bits of the ``command``
-member in an ``OFPT_FLOW_MOD`` or ``NXT_FLOW_MOD`` message designates the table
-to which a flow is to be added.
-
-The Open vSwitch software switch implementation offers 255 flow tables.  On
-packet ingress, only the first flow table (table 0) is searched, and the
-contents of the remaining tables are not considered in any way.  Tables other
-than table 0 only come into play when an ``NXAST_RESUBMIT_TABLE`` action
-specifies another table to search.
-
-Tables 128 and above are reserved for use by the switch itself.  Controllers
-should use only tables 0 through 127.
-
-``OFPTC_*`` Table Configuration
--------------------------------
-
-This section covers the history of the ``OFPTC_*`` table configuration bits
-across OpenFlow versions.
-
-OpenFlow 1.0 flow tables had fixed configurations.
-
-OpenFlow 1.1 enabled controllers to configure behavior upon flow table miss and
-added the ``OFPTC_MISS_*`` constants for that purpose.  ``OFPTC_*`` did not
-control anything else but it was nevertheless conceptualized as a set of
-bit-fields instead of an enum.  OF1.1 added the ``OFPT_TABLE_MOD`` message to
-set ``OFPTC_MISS_*`` for a flow table and added the ``config`` field to the
-``OFPST_TABLE`` reply to report the current setting.
-
-OpenFlow 1.2 did not change anything in this regard.
-
-OpenFlow 1.3 switched to another means to changing flow table miss behavior and
-deprecated ``OFPTC_MISS_*`` without adding any more ``OFPTC_*`` constants.
-This meant that ``OFPT_TABLE_MOD`` now had no purpose at all, but OF1.3 kept it
-around "for backward compatibility with older and newer versions of the
-specification."  At the same time, OF1.3 introduced a new message
-OFPMP_TABLE_FEATURES that included a field ``config`` documented as reporting
-the ``OFPTC_*`` values set with ``OFPT_TABLE_MOD``; of course this served no
-real purpose because no ``OFPTC_*`` values are defined.  OF1.3 did remove the
-``OFPTC_*`` field from ``OFPMP_TABLE`` (previously named ``OFPST_TABLE``).
-
-OpenFlow 1.4 defined two new ``OFPTC_*`` constants, ``OFPTC_EVICTION`` and
-``OFPTC_VACANCY_EVENTS``, using bits that did not overlap with ``OFPTC_MISS_*``
-even though those bits had not been defined since OF1.2.  ``OFPT_TABLE_MOD``
-still controlled these settings.  The field for ``OFPTC_*`` values in
-``OFPMP_TABLE_FEATURES`` was renamed from ``config`` to ``capabilities`` and
-documented as reporting the flags that are supported in a ``OFPT_TABLE_MOD``
-message.  The ``OFPMP_TABLE_DESC`` message newly added in OF1.4 reported the
-``OFPTC_*`` setting.
-
-OpenFlow 1.5 did not change anything in this regard.
-
-.. list-table:: Revisions
-   :header-rows: 1
-
-   * - OpenFlow
-     - ``OFPTC_*`` flags
-     - ``TABLE_MOD``
-     - Statistics
-     - ``TABLE_FEATURES``
-     - ``TABLE_DESC``
-   * - OF1.0
-     - none
-     - no (\*)(+)
-     - no (\*)
-     - nothing (\*)(+)
-     - no (\*)(+)
-   * - OF1.1/1.2
-     - ``MISS_*``
-     - yes
-     - yes
-     - nothing (+)
-     - no (+)
-   * - OF1.3
-     - none
-     - yes (\*)
-     - no (\*)
-     - config (\*)
-     - no (\*)(+)
-   * - OF1.4/1.5
-     - ``EVICTION``/``VACANCY_EVENTS``
-     - yes
-     - no
-     - capabilities
-     - yes
-
-where:
-
-OpenFlow:
-  The OpenFlow version(s).
-
-``OFPTC_*`` flags:
-  The ``OFPTC_*`` flags defined in those versions.
-
-``TABLE_MOD``:
-  Whether ``OFPT_TABLE_MOD`` can modify ``OFPTC_*`` flags.
-
-Statistics:
-  Whether ``OFPST_TABLE/OFPMP_TABLE`` reports the ``OFPTC_*`` flags.
-
-``TABLE_FEATURES``:
-  What ``OFPMP_TABLE_FEATURES`` reports (if it exists): either the current
-  configuration or the switch's capabilities.
-
-``TABLE_DESC``:
-  Whether ``OFPMP_TABLE_DESC`` reports the current configuration.
-
-(\*): Nothing to report/change anyway.
-
-(+): No such message.
-
-IPv6
-----
-
-Open vSwitch supports stateless handling of IPv6 packets.  Flows can be written
-to support matching TCP, UDP, and ICMPv6 headers within an IPv6 packet.  Deeper
-matching of some Neighbor Discovery messages is also supported.
-
-IPv6 was not designed to interact well with middle-boxes.  This, combined with
-Open vSwitch's stateless nature, have affected the processing of IPv6 traffic,
-which is detailed below.
-
-Extension Headers
-~~~~~~~~~~~~~~~~~
-
-The base IPv6 header is incredibly simple with the intention of only containing
-information relevant for routing packets between two endpoints.  IPv6 relies
-heavily on the use of extension headers to provide any other functionality.
-Unfortunately, the extension headers were designed in such a way that it is
-impossible to move to the next header (including the layer-4 payload) unless
-the current header is understood.
-
-Open vSwitch will process the following extension headers and continue to the
-next header:
-
-- Fragment (see the next section)
-- AH (Authentication Header)
-- Hop-by-Hop Options
-- Routing
-- Destination Options
-
-When a header is encountered that is not in that list, it is considered
-"terminal".  A terminal header's IPv6 protocol value is stored in ``nw_proto``
-for matching purposes.  If a terminal header is TCP, UDP, or ICMPv6, the packet
-will be further processed in an attempt to extract layer-4 information.
-
-Fragments
-~~~~~~~~~
-
-IPv6 requires that every link in the internet have an MTU of 1280 octets or
-greater (RFC 2460).  As such, a terminal header (as described above in
-"Extension Headers") in the first fragment should generally be reachable.  In
-this case, the terminal header's IPv6 protocol type is stored in the
-``nw_proto`` field for matching purposes.  If a terminal header cannot be found
-in the first fragment (one with a fragment offset of zero), the ``nw_proto``
-field is set to 0.  Subsequent fragments (those with a non-zero fragment
-offset) have the ``nw_proto`` field set to the IPv6 protocol type for fragments
-(44).
-
-Jumbograms
-~~~~~~~~~~
-
-An IPv6 jumbogram (RFC 2675) is a packet containing a payload longer than
-65,535 octets.  A jumbogram is only relevant in subnets with a link MTU greater
-than 65,575 octets, and are not required to be supported on nodes that do not
-connect to link with such large MTUs.  Currently, Open vSwitch doesn't process
-jumbograms.
-
-In-Band Control
----------------
-
-Motivation
-~~~~~~~~~~
-
-An OpenFlow switch must establish and maintain a TCP network connection to its
-controller.  There are two basic ways to categorize the network that this
-connection traverses: either it is completely separate from the one that the
-switch is otherwise controlling, or its path may overlap the network that the
-switch controls.  We call the former case "out-of-band control", the latter
-case "in-band control".
-
-Out-of-band control has the following benefits:
-
-- Simplicity: Out-of-band control slightly simplifies the switch
-  implementation.
-
-- Reliability: Excessive switch traffic volume cannot interfere with control
-  traffic.
-
-- Integrity: Machines not on the control network cannot impersonate a switch or
-  a controller.
-
-- Confidentiality: Machines not on the control network cannot snoop on control
-  traffic.
-
-In-band control, on the other hand, has the following advantages:
-
-- No dedicated port: There is no need to dedicate a physical switch port to
-  control, which is important on switches that have few ports (e.g. wireless
-  routers, low-end embedded platforms).
-
-- No dedicated network: There is no need to build and maintain a separate
-  control network.  This is important in many environments because it reduces
-  proliferation of switches and wiring.
-
-Open vSwitch supports both out-of-band and in-band control.  This section
-describes the principles behind in-band control.  See the description of the
-Controller table in ovs-vswitchd.conf.db(5) to configure OVS for in-band
-control.
-
-Principles
-~~~~~~~~~~
-
-The fundamental principle of in-band control is that an OpenFlow switch must
-recognize and switch control traffic without involving the OpenFlow controller.
-All the details of implementing in-band control are special cases of this
-principle.
-
-The rationale for this principle is simple.  If the switch does not handle
-in-band control traffic itself, then it will be caught in a contradiction: it
-must contact the controller, but it cannot, because only the controller can set
-up the flows that are needed to contact the controller.
-
-The following points describe important special cases of this principle.
-
-- In-band control must be implemented regardless of whether the switch is
-  connected.
-
-  It is tempting to implement the in-band control rules only when the switch is
-  not connected to the controller, using the reasoning that the controller
-  should have complete control once it has established a connection with the
-  switch.
-
-  This does not work in practice.  Consider the case where the switch is
-  connected to the controller.  Occasionally it can happen that the controller
-  forgets or otherwise needs to obtain the MAC address of the switch.  To do
-  so, the controller sends a broadcast ARP request.  A switch that implements
-  the in-band control rules only when it is disconnected will then send an
-  ``OFPT_PACKET_IN`` message up to the controller.  The controller will be
-  unable to respond, because it does not know the MAC address of the switch.
-  This is a deadlock situation that can only be resolved by the switch noticing
-  that its connection to the controller has hung and reconnecting.
-
-- In-band control must override flows set up by the controller.
-
-  It is reasonable to assume that flows set up by the OpenFlow controller
-  should take precedence over in-band control, on the basis that the controller
-  should be in charge of the switch.
-
-  Again, this does not work in practice.  Reasonable controller implementations
-  may set up a "last resort" fallback rule that wildcards every field and,
-  e.g., sends it up to the controller or discards it.  If a controller does
-  that, then it will isolate itself from the switch.
-
-- The switch must recognize all control traffic.
-
-  The fundamental principle of in-band control states, in part, that a switch
-  must recognize control traffic without involving the OpenFlow controller.
-  More specifically, the switch must recognize *all* control traffic.  "False
-  negatives", that is, packets that constitute control traffic but that the
-  switch does not recognize as control traffic, lead to control traffic storms.
-
-  Consider an OpenFlow switch that only recognizes control packets sent to or
-  from that switch.  Now suppose that two switches of this type, named A and B,
-  are connected to ports on an Ethernet hub (not a switch) and that an OpenFlow
-  controller is connected to a third hub port.  In this setup, control traffic
-  sent by switch A will be seen by switch B, which will send it to the
-  controller as part of an OFPT_PACKET_IN message.  Switch A will then see the
-  OFPT_PACKET_IN message's packet, re-encapsulate it in another OFPT_PACKET_IN,
-  and send it to the controller.  Switch B will then see that OFPT_PACKET_IN,
-  and so on in an infinite loop.
-
-  Incidentally, the consequences of "false positives", where packets that are
-  not control traffic are nevertheless recognized as control traffic, are much
-  less severe.  The controller will not be able to control their behavior, but
-  the network will remain in working order.  False positives do constitute a
-  security problem.
-
-- The switch should use echo-requests to detect disconnection.
-
-  TCP will notice that a connection has hung, but this can take a considerable
-  amount of time.  For example, with default settings the Linux kernel TCP
-  implementation will retransmit for between 13 and 30 minutes, depending on
-  the connection's retransmission timeout, according to kernel documentation.
-  This is far too long for a switch to be disconnected, so an OpenFlow switch
-  should implement its own connection timeout.  OpenFlow ``OFPT_ECHO_REQUEST``
-  messages are the best way to do this, since they test the OpenFlow connection
-  itself.
-
-Implementation
-~~~~~~~~~~~~~~
-
-This section describes how Open vSwitch implements in-band control.  Correctly
-implementing in-band control has proven difficult due to its many subtleties,
-and has thus gone through many iterations.  Please read through and understand
-the reasoning behind the chosen rules before making modifications.
-
-Open vSwitch implements in-band control as "hidden" flows, that is, flows that
-are not visible through OpenFlow, and at a higher priority than wildcarded
-flows can be set up through OpenFlow.  This is done so that the OpenFlow
-controller cannot interfere with them and possibly break connectivity with its
-switches.  It is possible to see all flows, including in-band ones, with the
-ovs-appctl "bridge/dump-flows" command.
-
-The Open vSwitch implementation of in-band control can hide traffic to
-arbitrary "remotes", where each remote is one TCP port on one IP address.
-Currently the remotes are automatically configured as the in-band OpenFlow
-controllers plus the OVSDB managers, if any.  (The latter is a requirement
-because OVSDB managers are responsible for configuring OpenFlow controllers, so
-if the manager cannot be reached then OpenFlow cannot be reconfigured.)
-
-The following rules (with the OFPP_NORMAL action) are set up on any bridge that
-has any remotes:
-
-(a)
-  DHCP requests sent from the local port.
-(b)
-  ARP replies to the local port's MAC address.
-(c)
-  ARP requests from the local port's MAC address.
-
-In-band also sets up the following rules for each unique next-hop MAC address
-for the remotes' IPs (the "next hop" is either the remote itself, if it is on a
-local subnet, or the gateway to reach the remote):
-
-(d)
-  ARP replies to the next hop's MAC address.
-(e)
-  ARP requests from the next hop's MAC address.
-
-In-band also sets up the following rules for each unique remote IP address:
-
-(f)
-  ARP replies containing the remote's IP address as a target.
-(g)
-  ARP requests containing the remote's IP address as a source.
-
-In-band also sets up the following rules for each unique remote (IP,port) pair:
-
-(h)
-  TCP traffic to the remote's IP and port.
-(i)
-  TCP traffic from the remote's IP and port.
-
-The goal of these rules is to be as narrow as possible to allow a switch to
-join a network and be able to communicate with the remotes.  As mentioned
-earlier, these rules have higher priority than the controller's rules, so if
-they are too broad, they may prevent the controller from implementing its
-policy.  As such, in-band actively monitors some aspects of flow and packet
-processing so that the rules can be made more precise.
-
-In-band control monitors attempts to add flows into the datapath that could
-interfere with its duties.  The datapath only allows exact match entries, so
-in-band control is able to be very precise about the flows it prevents.  Flows
-that miss in the datapath are sent to userspace to be processed, so preventing
-these flows from being cached in the "fast path" does not affect correctness.
-The only type of flow that is currently prevented is one that would prevent
-DHCP replies from being seen by the local port.  For example, a rule that
-forwarded all DHCP traffic to the controller would not be allowed, but one that
-forwarded to all ports (including the local port) would.
-
-As mentioned earlier, packets that miss in the datapath are sent to the
-userspace for processing.  The userspace has its own flow table, the
-"classifier", so in-band checks whether any special processing is needed before
-the classifier is consulted.  If a packet is a DHCP response to a request from
-the local port, the packet is forwarded to the local port, regardless of the
-flow table.  Note that this requires L7 processing of DHCP replies to determine
-whether the 'chaddr' field matches the MAC address of the local port.
-
-It is interesting to note that for an L3-based in-band control mechanism, the
-majority of rules are devoted to ARP traffic.  At first glance, some of these
-rules appear redundant.  However, each serves an important role.  First, in
-order to determine the MAC address of the remote side (controller or gateway)
-for other ARP rules, we must allow ARP traffic for our local port with rules
-(b) and (c).  If we are between a switch and its connection to the remote, we
-have to allow the other switch's ARP traffic to through.  This is done with
-rules (d) and (e), since we do not know the addresses of the other switches a
-priori, but do know the remote's or gateway's.  Finally, if the remote is
-running in a local guest VM that is not reached through the local port, the
-switch that is connected to the VM must allow ARP traffic based on the remote's
-IP address, since it will not know the MAC address of the local port that is
-sending the traffic or the MAC address of the remote in the guest VM.
-
-With a few notable exceptions below, in-band should work in most network
-setups.  The following are considered "supported" in the current
-implementation:
-
-- Locally Connected.  The switch and remote are on the same subnet.  This uses
-  rules (a), (b), (c), (h), and (i).
-
-- Reached through Gateway.  The switch and remote are on different subnets and
-  must go through a gateway.  This uses rules (a), (b), (c), (h), and (i).
-
-- Between Switch and Remote.  This switch is between another switch and the
-  remote, and we want to allow the other switch's traffic through.  This uses
-  rules (d), (e), (h), and (i).  It uses (b) and (c) indirectly in order to
-  know the MAC address for rules (d) and (e).  Note that DHCP for the other
-  switch will not work unless an OpenFlow controller explicitly lets this
-  switch pass the traffic.
-
-- Between Switch and Gateway.  This switch is between another switch and the
-  gateway, and we want to allow the other switch's traffic through.  This uses
-  the same rules and logic as the "Between Switch and Remote" configuration
-  described earlier.
-
-- Remote on Local VM.  The remote is a guest VM on the system running in-band
-  control.  This uses rules (a), (b), (c), (h), and (i).
-
-- Remote on Local VM with Different Networks.  The remote is a guest VM on the
-  system running in-band control, but the local port is not used to connect to
-  the remote.  For example, an IP address is configured on eth0 of the switch.
-  The remote's VM is connected through eth1 of the switch, but an IP address
-  has not been configured for that port on the switch.  As such, the switch
-  will use eth0 to connect to the remote, and eth1's rules about the local port
-  will not work.  In the example, the switch attached to eth0 would use rules
-  (a), (b), (c), (h), and (i) on eth0.  The switch attached to eth1 would use
-  rules (f), (g), (h), and (i).
-
-The following are explicitly *not* supported by in-band control:
-
-- Specify Remote by Name.  Currently, the remote must be identified by IP
-  address.  A naive approach would be to permit all DNS traffic.
-  Unfortunately, this would prevent the controller from defining any policy
-  over DNS.  Since switches that are located behind us need to connect to the
-  remote, in-band cannot simply add a rule that allows DNS traffic from the
-  local port.  The "correct" way to support this is to parse DNS requests to
-  allow all traffic related to a request for the remote's name through.  Due to
-  the potential security problems and amount of processing, we decided to hold
-  off for the time-being.
-
-- Differing Remotes for Switches.  All switches must know the L3 addresses for
-  all the remotes that other switches may use, since rules need to be set up to
-  allow traffic related to those remotes through.  See rules (f), (g), (h), and
-  (i).
-
-- Differing Routes for Switches.  In order for the switch to allow other
-  switches to connect to a remote through a gateway, it allows the gateway's
-  traffic through with rules (d) and (e).  If the routes to the remote differ
-  for the two switches, we will not know the MAC address of the alternate
-  gateway.
-
-Action Reproduction
--------------------
-
-It seems likely that many controllers, at least at startup, use the OpenFlow
-"flow statistics" request to obtain existing flows, then compare the flows'
-actions against the actions that they expect to find.  Before version 1.8.0,
-Open vSwitch always returned exact, byte-for-byte copies of the actions that
-had been added to the flow table.  The current version of Open vSwitch does not
-always do this in some exceptional cases.  This section lists the exceptions
-that controller authors must keep in mind if they compare actual actions
-against desired actions in a bytewise fashion:
-
-- Open vSwitch zeros padding bytes in action structures, regardless of their
-  values when the flows were added.
-
-- Open vSwitch "normalizes" the instructions in OpenFlow 1.1 (and later) in the
-  following way:
-
-  * OVS sorts the instructions into the following order: Apply-Actions,
-    Clear-Actions, Write-Actions, Write-Metadata, Goto-Table.
-
-  * OVS drops Apply-Actions instructions that have empty action lists.
-
-  * OVS drops Write-Actions instructions that have empty action sets.
-
-Please report other discrepancies, if you notice any, so that we can fix or
-document them.
-
-Suggestions
------------
-
-Suggestions to improve Open vSwitch are welcome at discuss@openvswitch.org.
diff --git a/Documentation/OVSDB-replication.rst b/Documentation/OVSDB-replication.rst

deleted file mode 100644 (file)

index fbf5a9b..0000000
--- a/Documentation/OVSDB-replication.rst
+++ /dev/null
@@ -1,172 +0,0 @@
-..
-      Licensed under the Apache License, Version 2.0 (the "License"); you may
-      not use this file except in compliance with the License. You may obtain
-      a copy of the License at
-
-          http://www.apache.org/licenses/LICENSE-2.0
-
-      Unless required by applicable law or agreed to in writing, software
-      distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
-      WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
-      License for the specific language governing permissions and limitations
-      under the License.
-
-      Convention for heading levels in Open vSwitch documentation:
-
-      =======  Heading 0 (reserved for the title in a document)
-      -------  Heading 1
-      ~~~~~~~  Heading 2
-      +++++++  Heading 3
-      '''''''  Heading 4
-
-      Avoid deeper levels because they do not render well.
-
-================================
-OVSDB Replication Implementation
-================================
-
-Given two Open vSwitch databases with the same schema, OVSDB replication keeps
-these databases in the same state, i.e. each of the databases have the same
-contents at any given time even if they are not running in the same host.  This
-document elaborates on the implementation details to provide this
-functionality.
-
-Terminology
------------
-
-Source of truth database
-  database whose content will be replicated to another database.
-
-Active server
-  ovsdb-server providing RPC interface to the source of truth database.
-
-Standby server
-  ovsdb-server providing RPC interface to the database that is not the source
-  of truth.
-
-Design
-------
-
-The overall design of replication consists of one ovsdb-server (active server)
-communicating the state of its databases to another ovsdb-server (standby
-server) so that the latter keep its own databases in that same state.  To
-achieve this, the standby server acts as a client of the active server, in the
-sense that it sends a monitor request to keep up to date with the changes in
-the active server databases. When a notification from the active server
-arrives, the standby server executes the necessary set of operations so its
-databases reach the same state as the the active server databases. Below is the
-design represented as a diagram.::
-
-    +--------------+    replication     +--------------+
-    |    Active    |<-------------------|   Standby    |
-    | OVSDB-server |                    | OVSDB-server |
-    +--------------+                    +--------------+
-            |                                  |
-            |                                  |
-        +-------+                          +-------+
-        |  SoT  |                          |       |
-        | OVSDB |                          | OVSDB |
-        +-------+                          +-------+
-
-Setting Up The Replication
---------------------------
-
-To initiate the replication process, the standby server must be executed
-indicating the location of the active server via the command line option
-``--sync-from=server``, where server can take any form described in the
-ovsdb-client manpage and it must specify an active connection type (tcp, unix,
-ssl). This option will cause the standby server to attempt to send a monitor
-request to the active server in every main loop iteration, until the active
-server responds.
-
-When sending a monitor request the standby server is doing the following:
-
-1. Erase the content of the databases for which it is providing a RPC
-   interface.
-
-2. Open the jsonrpc channel to communicate with the active server.
-
-3. Fetch all the databases located in the active server.
-
-4. For each database with the same schema in both the active and standby
-   servers: construct and send a monitor request message specifying the tables
-   that will be monitored (i.e all the tables on the database except the ones
-   blacklisted [*]).
-
-5. Set the standby database to the current state of the active database.
-
-Once the monitor request message is sent, the standby server will continuously
-receive notifications of changes occurring to the tables specified in the
-request. The process of handling this notifications is detailed in the next
-section.
-
-[*] A set of tables that will be excluded from replication can be configure as
-a blacklist of tables via the command line option
-``--sync-exclude-tables=db:table[,db:table]...``, where db corresponds to the
-database where the table resides.
-
-Replication Process
--------------------
-
-The replication process consists on handling the update notifications received
-in the standby server caused by the monitor request that was previously sent to
-the active server. In every loop iteration, the standby server attempts to
-receive a message from the active server which can be an error, an echo message
-(used to keep the connection alive) or an update notification. In case the
-message is a fatal error, the standby server will disconnect from the active
-without dropping the replicated data. If it is an echo message, the standby
-server will reply with an echo message as well. If the message is an update
-notification, the following process occurs:
-
-1. Create a new transaction.
-
-2. Get the ``<table-updates>`` object from the ``params`` member of the
-   notification.
-
-3. For each ``<table-update>`` in the ``<table-updates>`` object do:
-
-    1. For each ``<row-update>`` in ``<table-update>`` check what kind of
-       operation should be executed according to the following criteria
-       about the presence of the object members:
-
-       - If ``old`` member is not present, execute an insert operation using
-         ``<row>`` from the ``new`` member.
-
-       - If ``old`` member is present and ``new`` member is not present,
-         execute a delete operation using ``<row>`` from the ``old`` member
-
-       - If both ``old`` and ``new`` members are present, execute an update
-         operation using ``<row>`` from the ``new`` member.
-
-4. Commit the transaction.
-
-   If an error occurs during the replication process, all replication is
-   restarted by resending a new monitor request as described in the section
-   "Setting up the replication".
-
-Runtime Management Commands
----------------------------
-
-Runtime management commands can be sent to a running standby server via
-ovs-appctl in order to configure the replication functionality. The available
-commands are the following.
-
-``ovsdb-server/set-remote-ovsdb-server {server}``
-  sets the name of the active server
-
-``ovsdb-server/get-remote-ovsdb-server``
-  gets the name of the active server
-
-``ovsdb-server/connect-remote-ovsdb-server``
-  causes the server to attempt to send a monitor request every main loop
-  iteration
-
-``ovsdb-server/disconnect-remote-ovsdb-server``
-  closes the jsonrpc channel between the active server and frees the memory
-  used for the replication configuration.
-
-``ovsdb-server/set-sync-exclude-tables {db:table,...}``
-  sets the tables list that will be excluded from being replicated
-
-``ovsdb-server/get-sync-excluded-tables``
-  gets the tables list that is currently excluded from replication
diff --git a/Documentation/automake.mk b/Documentation/automake.mk

index 04afdcf9cf209adf0e9e27da95b6e12e1cb5d618..24ba1c9ea0c8eb918fdbfa68836535f9e79b8c16 100644 (file)
--- a/Documentation/automake.mk
+++ b/Documentation/automake.mk
@@ -1,6 +1,5 @@
  docs += \
-       Documentation/group-selection-method-property.txt \
-       Documentation/OVSDB-replication.rst
+       Documentation/group-selection-method-property.txt
  
  EXTRA_DIST += \
         Documentation/_static/logo.png \
@@ -22,6 +21,16 @@ EXTRA_DIST += \
         Documentation/intro/install/xenserver.rst \
         Documentation/tutorials/index.rst \
         Documentation/topics/index.rst \
+       Documentation/topics/bonding.rst \
+       Documentation/topics/datapath.rst \
+       Documentation/topics/design.rst \
+       Documentation/topics/dpdk.rst \
+       Documentation/topics/high-availability.rst \
+       Documentation/topics/integration.rst \
+       Documentation/topics/openflow.rst \
+       Documentation/topics/ovsdb-replication.rst \
+       Documentation/topics/porting.rst \
+       Documentation/topics/windows.rst \
         Documentation/howto/index.rst \
         Documentation/howto/docker.rst \
         Documentation/howto/kvm.rst \
@@ -58,8 +67,7 @@ SPHINXBUILDDIR = $(srcdir)/Documentation/_build
  # Internal variables.
  PAPEROPT_a4 = -D latex_paper_size=a4
  PAPEROPT_letter = -D latex_paper_size=letter
-# TODO(stephenfin): Add '-W' flag here once we've integrated required docs
-ALLSPHINXOPTS = -d $(SPHINXBUILDDIR)/doctrees $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) $(SPHINXSRCDIR)
+ALLSPHINXOPTS = -W -d $(SPHINXBUILDDIR)/doctrees $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) $(SPHINXSRCDIR)
  
  .PHONY: htmldocs
  htmldocs:
diff --git a/Documentation/howto/openstack-containers.rst b/Documentation/howto/openstack-containers.rst

index f10f60e148b40e787bb923afd48ff6eea32e406a..692fe25e5649a6c1f59f9d2aee13ee2b91db25ae 100644 (file)
--- a/Documentation/howto/openstack-containers.rst
+++ b/Documentation/howto/openstack-containers.rst
@@ -45,10 +45,10 @@ example.
  
  * When VM-A is created on a hypervisor, its VIF gets added to the Open vSwitch
    integration bridge.  This creates a row in the Interface table of the
-  ``Open_vSwitch`` database.  As explained in the `integration guide`, the
-  vif-id associated with the VM network interface gets added in the
-  ``external_ids:iface-id`` column of the newly created row in the Interface
-  table.
+  ``Open_vSwitch`` database.  As explained in the :doc:`integration guide
+  </topics/integration>`, the vif-id associated with the VM network interface
+  gets added in the ``external_ids:iface-id`` column of the newly created row
+  in the Interface table.
  
  * Since VM-A belongs to a logical network, it gets an IP address.  This IP
    address is used to spawn containers (either manually or through container
diff --git a/Documentation/intro/install/netbsd.rst b/Documentation/intro/install/netbsd.rst

index 2b52eaf08a2f76196aa56d449d41c91cd644fe0a..b32da20081061ce31475b6cbf7e7aeaa6dfe1e63 100644 (file)
--- a/Documentation/intro/install/netbsd.rst
+++ b/Documentation/intro/install/netbsd.rst
@@ -58,4 +58,4 @@ As all executables installed with pkgsrc are placed in ``/usr/pkg/bin/``
  directory, it might be a good idea to add it to your PATH.
  
  Open vSwitch on NetBSD is currently "userspace switch" implementation in the
-sense described in :doc:`userspace` and the `porting guide`.
+sense described in :doc:`userspace` and :doc:`/topics/porting`.
diff --git a/Documentation/topics/bonding.rst b/Documentation/topics/bonding.rst

new file mode 100644 (file)

index 0000000..2f67cbb
--- /dev/null
+++ b/Documentation/topics/bonding.rst
@@ -0,0 +1,238 @@
+..
+      Licensed under the Apache License, Version 2.0 (the "License"); you may
+      not use this file except in compliance with the License. You may obtain
+      a copy of the License at
+
+          http://www.apache.org/licenses/LICENSE-2.0
+
+      Unless required by applicable law or agreed to in writing, software
+      distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
+      WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
+      License for the specific language governing permissions and limitations
+      under the License.
+
+      Convention for heading levels in Open vSwitch documentation:
+
+      =======  Heading 0 (reserved for the title in a document)
+      -------  Heading 1
+      ~~~~~~~  Heading 2
+      +++++++  Heading 3
+      '''''''  Heading 4
+
+      Avoid deeper levels because they do not render well.
+
+=======
+Bonding
+=======
+
+Bonding allows two or more interfaces (the "slaves") to share network traffic.
+From a high-level point of view, bonded interfaces act like a single port, but
+they have the bandwidth of multiple network devices, e.g. two 1 GB physical
+interfaces act like a single 2 GB interface.  Bonds also increase robustness:
+the bonded port does not go down as long as at least one of its slaves is up.
+
+In vswitchd, a bond always has at least two slaves (and may have more).  If a
+configuration error, etc. would cause a bond to have only one slave, the port
+becomes an ordinary port, not a bonded port, and none of the special features
+of bonded ports described in this section apply.
+
+There are many forms of bonding of which ovs-vswitchd implements only a few.
+The most complex bond ovs-vswitchd implements is called "source load balancing"
+or SLB bonding.  SLB bonding divides traffic among the slaves based on the
+Ethernet source address.  This is useful only if the traffic over the bond has
+multiple Ethernet source addresses, for example if network traffic from
+multiple VMs are multiplexed over the bond.
+
+.. note::
+
+   Most of the ovs-vswitchd implementation is in ``vswitchd/bridge.c``, so code
+   references below should be assumed to refer to that file except as otherwise
+   specified.
+
+
+Enabling and Disabling Slaves
+-----------------------------
+
+When a bond is created, a slave is initially enabled or disabled based on
+whether carrier is detected on the NIC (see ``iface_create()``).  After that, a
+slave is disabled if its carrier goes down for a period of time longer than the
+downdelay, and it is enabled if carrier comes up for longer than the updelay
+(see ``bond_link_status_update()``).  There is one exception where the updelay
+is skipped: if no slaves at all are currently enabled, then the first slave on
+which carrier comes up is enabled immediately.
+
+The updelay should be set to a time longer than the STP forwarding delay of the
+physical switch to which the bond port is connected (if STP is enabled on that
+switch).  Otherwise, the slave will be enabled, and load may be shifted to it,
+before the physical switch starts forwarding packets on that port, which can
+cause some data to be "blackholed" for a time.  The exception for a single
+enabled slave does not cause any problem in this regard because when no slaves
+are enabled all output packets are blackholed anyway.
+
+When a slave becomes disabled, the vswitch immediately chooses a new output
+port for traffic that was destined for that slave (see
+``bond_enable_slave()``).  It also sends a "gratuitous learning packet",
+specifically a RARP, on the bond port (on the newly chosen slave) for each MAC
+address that the vswitch has learned on a port other than the bond (see
+``bond_send_learning_packets()``), to teach the physical switch that the new
+slave should be used in place of the one that is now disabled.  (This behavior
+probably makes sense only for a vswitch that has only one port (the bond)
+connected to a physical switch; vswitchd should probably provide a way to
+disable or configure it in other scenarios.)
+
+Bond Packet Input
+-----------------
+
+Bonding accepts unicast packets on any bond slave.  This can occasionally cause
+packet duplication for the first few packets sent to a given MAC, if the
+physical switch attached to the bond is flooding packets to that MAC because it
+has not yet learned the correct slave for that MAC.
+
+Bonding only accepts multicast (and broadcast) packets on a single bond slave
+(the "active slave") at any given time.  Multicast packets received on other
+slaves are dropped.  Otherwise, every multicast packet would be duplicated,
+once for every bond slave, because the physical switch attached to the bond
+will flood those packets.
+
+Bonding also drops received packets when the vswitch has learned that the
+packet's MAC is on a port other than the bond port itself.  This is because it
+is likely that the vswitch itself sent the packet out the bond port on a
+different slave and is now receiving the packet back.  This occurs when the
+packet is multicast or the physical switch has not yet learned the MAC and is
+flooding it.  However, the vswitch makes an exception to this rule for
+broadcast ARP replies, which indicate that the MAC has moved to another switch,
+probably due to VM migration.  (ARP replies are normally unicast, so this
+exception does not match normal ARP replies.  It will match the learning
+packets sent on bond fail-over.)
+
+The active slave is simply the first slave to be enabled after the bond is
+created (see ``bond_choose_active_iface()``).  If the active slave is disabled,
+then a new active slave is chosen among the slaves that remain active.
+Currently due to the way that configuration works, this tends to be the
+remaining slave whose interface name is first alphabetically, but this is by no
+means guaranteed.
+
+Bond Packet Output
+------------------
+
+When a packet is sent out a bond port, the bond slave actually used is selected
+based on the packet's source MAC and VLAN tag (see ``choose_output_iface()``).
+In particular, the source MAC and VLAN tag are hashed into one of 256 values,
+and that value is looked up in a hash table (the "bond hash") kept in the
+``bond_hash`` member of struct port.  The hash table entry identifies a bond
+slave.  If no bond slave has yet been chosen for that hash table entry,
+vswitchd chooses one arbitrarily.
+
+Every 10 seconds, vswitchd rebalances the bond slaves (see
+``bond_rebalance_port()``).  To rebalance, vswitchd examines the statistics for
+the number of bytes transmitted by each slave over approximately the past
+minute, with data sent more recently weighted more heavily than data sent less
+recently.  It considers each of the slaves in order from most-loaded to
+least-loaded.  If highly loaded slave H is significantly more heavily loaded
+than the least-loaded slave L, and slave H carries at least two hashes, then
+vswitchd shifts one of H's hashes to L.  However, vswitchd will only shift a
+hash from H to L if it will decrease the ratio of the load between H and L by
+at least 0.1.
+
+Currently, "significantly more loaded" means that H must carry at least 1 Mbps
+more traffic, and that traffic must be at least 3% greater than L's.
+
+Bond Balance Modes
+------------------
+
+Each bond balancing mode has different considerations, described below.
+
+LACP Bonding
+~~~~~~~~~~~~
+
+LACP bonding requires the remote switch to implement LACP, but it is otherwise
+very simple in that, after LACP negotiation is complete, there is no need for
+special handling of received packets.
+
+Several of the physical switches that support LACP block all traffic for ports
+that are configured to use LACP, until LACP is negotiated with the host. When
+configuring a LACP bond on a OVS host (eg: XenServer), this means that there
+will be an interruption of the network connectivity between the time the ports
+on the physical switch and the bond on the OVS host are configured. The
+interruption may be relatively long, if different people are responsible for
+managing the switches and the OVS host.
+
+Such network connectivity failure can be avoided if LACP can be configured on
+the OVS host before configuring the physical switch, and having the OVS host
+fall back to a bond mode (active-backup) till the physical switch LACP
+configuration is complete. An option "lacp-fallback-ab" exists to provide such
+behavior on openvswitch.
+
+Active Backup Bonding
+~~~~~~~~~~~~~~~~~~~~~
+
+Active Backup bonds send all traffic out one "active" slave until that slave
+becomes unavailable.  Since they are significantly less complicated than SLB
+bonds, they are preferred when LACP is not an option.  Additionally, they are
+the only bond mode which supports attaching each slave to a different upstream
+switch.
+
+SLB Bonding
+~~~~~~~~~~~
+
+SLB bonding allows a limited form of load balancing without the remote switch's
+knowledge or cooperation.  The basics of SLB are simple.  SLB assigns each
+source MAC+VLAN pair to a link and transmits all packets from that MAC+VLAN
+through that link.  Learning in the remote switch causes it to send packets to
+that MAC+VLAN through the same link.
+
+SLB bonding has the following complications:
+
+0. When the remote switch has not learned the MAC for the destination of a
+   unicast packet and hence floods the packet to all of the links on the SLB
+   bond, Open vSwitch will forward duplicate packets, one per link, to each
+   other switch port.
+
+   Open vSwitch does not solve this problem.
+
+1. When the remote switch receives a multicast or broadcast packet from a port
+   not on the SLB bond, it will forward it to all of the links in the SLB bond.
+   This would cause packet duplication if not handled specially.
+
+   Open vSwitch avoids packet duplication by accepting multicast and broadcast
+   packets on only the active slave, and dropping multicast and broadcast
+   packets on all other slaves.
+
+2. When Open vSwitch forwards a multicast or broadcast packet to a link in the
+   SLB bond other than the active slave, the remote switch will forward it to
+   all of the other links in the SLB bond, including the active slave.  Without
+   special handling, this would mean that Open vSwitch would forward a second
+   copy of the packet to each switch port (other than the bond), including the
+   port that originated the packet.
+
+   Open vSwitch deals with this case by dropping packets received on any SLB
+   bonded link that have a source MAC+VLAN that has been learned on any other
+   port.  (This means that SLB as implemented in Open vSwitch relies critically
+   on MAC learning.  Notably, SLB is incompatible with the "flood_vlans"
+   feature.)
+
+3. Suppose that a MAC+VLAN moves to an SLB bond from another port (e.g. when a
+   VM is migrated from this hypervisor to a different one).  Without additional
+   special handling, Open vSwitch will not notice until the MAC learning entry
+   expires, up to 60 seconds later as a consequence of rule #2.
+
+   Open vSwitch avoids a 60-second delay by listening for gratuitous ARPs,
+   which VMs commonly emit upon migration.  As an exception to rule #2, a
+   gratuitous ARP received on an SLB bond is not dropped and updates the MAC
+   learning table in the usual way.  (If a move does not trigger a gratuitous
+   ARP, or if the gratuitous ARP is lost in the network, then a 60-second delay
+   still occurs.)
+
+4. Suppose that a MAC+VLAN moves from an SLB bond to another port (e.g. when a
+   VM is migrated from a different hypervisor to this one), that the MAC+VLAN
+   emits a gratuitous ARP, and that Open vSwitch forwards that gratuitous ARP
+   to a link in the SLB bond other than the active slave.  The remote switch
+   will forward the gratuitous ARP to all of the other links in the SLB bond,
+   including the active slave.  Without additional special handling, this would
+   mean that Open vSwitch would learn that the MAC+VLAN was located on the SLB
+   bond, as a consequence of rule #3.
+
+   Open vSwitch avoids this problem by "locking" the MAC learning table entry
+   for a MAC+VLAN from which a gratuitous ARP was received from a non-SLB bond
+   port.  For 5 seconds, a locked MAC learning table entry will not be updated
+   based on a gratuitous ARP received on a SLB bond.
diff --git a/Documentation/topics/datapath.rst b/Documentation/topics/datapath.rst

new file mode 100644 (file)

index 0000000..47e0e23
--- /dev/null
+++ b/Documentation/topics/datapath.rst
@@ -0,0 +1,265 @@
+..
+      Licensed under the Apache License, Version 2.0 (the "License"); you may
+      not use this file except in compliance with the License. You may obtain
+      a copy of the License at
+
+          http://www.apache.org/licenses/LICENSE-2.0
+
+      Unless required by applicable law or agreed to in writing, software
+      distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
+      WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
+      License for the specific language governing permissions and limitations
+      under the License.
+
+      Convention for heading levels in Open vSwitch documentation:
+
+      =======  Heading 0 (reserved for the title in a document)
+      -------  Heading 1
+      ~~~~~~~  Heading 2
+      +++++++  Heading 3
+      '''''''  Heading 4
+
+      Avoid deeper levels because they do not render well.
+
+=======================================
+Open vSwitch Datapath Development Guide
+=======================================
+
+The Open vSwitch kernel module allows flexible userspace control over
+flow-level packet processing on selected network devices.  It can be used to
+implement a plain Ethernet switch, network device bonding, VLAN processing,
+network access control, flow-based network control, and so on.
+
+The kernel module implements multiple "datapaths" (analogous to bridges), each
+of which can have multiple "vports" (analogous to ports within a bridge).  Each
+datapath also has associated with it a "flow table" that userspace populates
+with "flows" that map from keys based on packet headers and metadata to sets of
+actions.  The most common action forwards the packet to another vport; other
+actions are also implemented.
+
+When a packet arrives on a vport, the kernel module processes it by extracting
+its flow key and looking it up in the flow table.  If there is a matching flow,
+it executes the associated actions.  If there is no match, it queues the packet
+to userspace for processing (as part of its processing, userspace will likely
+set up a flow to handle further packets of the same type entirely in-kernel).
+
+Flow Key Compatibility
+----------------------
+
+Network protocols evolve over time.  New protocols become important and
+existing protocols lose their prominence.  For the Open vSwitch kernel module
+to remain relevant, it must be possible for newer versions to parse additional
+protocols as part of the flow key.  It might even be desirable, someday, to
+drop support for parsing protocols that have become obsolete.  Therefore, the
+Netlink interface to Open vSwitch is designed to allow carefully written
+userspace applications to work with any version of the flow key, past or
+future.
+
+To support this forward and backward compatibility, whenever the kernel module
+passes a packet to userspace, it also passes along the flow key that it parsed
+from the packet.  Userspace then extracts its own notion of a flow key from the
+packet and compares it against the kernel-provided version:
+
+- If userspace's notion of the flow key for the packet matches the kernel's,
+  then nothing special is necessary.
+
+- If the kernel's flow key includes more fields than the userspace version of
+  the flow key, for example if the kernel decoded IPv6 headers but userspace
+  stopped at the Ethernet type (because it does not understand IPv6), then
+  again nothing special is necessary.  Userspace can still set up a flow in the
+  usual way, as long as it uses the kernel-provided flow key to do it.
+
+- If the userspace flow key includes more fields than the kernel's, for example
+  if userspace decoded an IPv6 header but the kernel stopped at the Ethernet
+  type, then userspace can forward the packet manually, without setting up a
+  flow in the kernel.  This case is bad for performance because every packet
+  that the kernel considers part of the flow must go to userspace, but the
+  forwarding behavior is correct.  (If userspace can determine that the values
+  of the extra fields would not affect forwarding behavior, then it could set
+  up a flow anyway.)
+
+How flow keys evolve over time is important to making this work, so
+the following sections go into detail.
+
+Flow Key Format
+---------------
+
+A flow key is passed over a Netlink socket as a sequence of Netlink attributes.
+Some attributes represent packet metadata, defined as any information about a
+packet that cannot be extracted from the packet itself, e.g. the vport on which
+the packet was received.  Most attributes, however, are extracted from headers
+within the packet, e.g. source and destination addresses from Ethernet, IP, or
+TCP headers.
+
+The ``<linux/openvswitch.h>`` header file defines the exact format of the flow
+key attributes.  For informal explanatory purposes here, we write them as
+comma-separated strings, with parentheses indicating arguments and nesting.
+For example, the following could represent a flow key corresponding to a TCP
+packet that arrived on vport 1::
+
+    in_port(1), eth(src=e0:91:f5:21:d0:b2, dst=00:02:e3:0f:80:a4),
+    eth_type(0x0800), ipv4(src=172.16.0.20, dst=172.18.0.52, proto=17, tos=0,
+    frag=no), tcp(src=49163, dst=80)
+
+Often we ellipsize arguments not important to the discussion, e.g.::
+
+    in_port(1), eth(...), eth_type(0x0800), ipv4(...), tcp(...)
+
+Wildcarded Flow Key Format
+--------------------------
+
+A wildcarded flow is described with two sequences of Netlink attributes passed
+over the Netlink socket. A flow key, exactly as described above, and an
+optional corresponding flow mask.
+
+A wildcarded flow can represent a group of exact match flows. Each ``1`` bit
+in the mask specifies an exact match with the corresponding bit in the flow key.
+A ``0`` bit specifies a don't care bit, which will match either a ``1`` or
+``0`` bit of an incoming packet. Using a wildcarded flow can improve the flow
+set up rate by reducing the number of new flows that need to be processed by
+the user space program.
+
+Support for the mask Netlink attribute is optional for both the kernel and user
+space program. The kernel can ignore the mask attribute, installing an exact
+match flow, or reduce the number of don't care bits in the kernel to less than
+what was specified by the user space program. In this case, variations in bits
+that the kernel does not implement will simply result in additional flow
+setups.  The kernel module will also work with user space programs that neither
+support nor supply flow mask attributes.
+
+Since the kernel may ignore or modify wildcard bits, it can be difficult for
+the userspace program to know exactly what matches are installed. There are two
+possible approaches: reactively install flows as they miss the kernel flow
+table (and therefore not attempt to determine wildcard changes at all) or use
+the kernel's response messages to determine the installed wildcards.
+
+When interacting with userspace, the kernel should maintain the match portion
+of the key exactly as originally installed. This will provides a handle to
+identify the flow for all future operations. However, when reporting the mask
+of an installed flow, the mask should include any restrictions imposed by the
+kernel.
+
+The behavior when using overlapping wildcarded flows is undefined. It is the
+responsibility of the user space program to ensure that any incoming packet can
+match at most one flow, wildcarded or not. The current implementation performs
+best-effort detection of overlapping wildcarded flows and may reject some but
+not all of them. However, this behavior may change in future versions.
+
+Unique Flow Identifiers
+-----------------------
+
+An alternative to using the original match portion of a key as the handle for
+flow identification is a unique flow identifier, or "UFID". UFIDs are optional
+for both the kernel and user space program.
+
+User space programs that support UFID are expected to provide it during flow
+setup in addition to the flow, then refer to the flow using the UFID for all
+future operations. The kernel is not required to index flows by the original
+flow key if a UFID is specified.
+
+Basic Rule for Evolving Flow Keys
+---------------------------------
+
+Some care is needed to really maintain forward and backward compatibility for
+applications that follow the rules listed under "Flow key compatibility" above.
+
+The basic rule is obvious:
+
+    New network protocol support must only supplement existing flow key
+    attributes.  It must not change the meaning of already defined flow key
+    attributes.
+
+This rule does have less-obvious consequences so it is worth working through a
+few examples.  Suppose, for example, that the kernel module did not already
+implement VLAN parsing.  Instead, it just interpreted the 802.1Q TPID
+(``0x8100``) as the Ethertype then stopped parsing the packet.  The flow key
+for any packet with an 802.1Q header would look essentially like this, ignoring
+metadata::
+
+    eth(...), eth_type(0x8100)
+
+Naively, to add VLAN support, it makes sense to add a new "vlan" flow key
+attribute to contain the VLAN tag, then continue to decode the encapsulated
+headers beyond the VLAN tag using the existing field definitions.  With this
+change, a TCP packet in VLAN 10 would have a flow key much like this::
+
+    eth(...), vlan(vid=10, pcp=0), eth_type(0x0800), ip(proto=6, ...), tcp(...)
+
+But this change would negatively affect a userspace application that has not
+been updated to understand the new "vlan" flow key attribute.  The application
+could, following the flow compatibility rules above, ignore the "vlan"
+attribute that it does not understand and therefore assume that the flow
+contained IP packets.  This is a bad assumption (the flow only contains IP
+packets if one parses and skips over the 802.1Q header) and it could cause the
+application's behavior to change across kernel versions even though it follows
+the compatibility rules.
+
+The solution is to use a set of nested attributes.  This is, for example, why
+802.1Q support uses nested attributes.  A TCP packet in VLAN 10 is actually
+expressed as::
+
+    eth(...), eth_type(0x8100), vlan(vid=10, pcp=0), encap(eth_type(0x0800),
+    ip(proto=6, ...), tcp(...)))
+
+Notice how the ``eth_type``, ``ip``, and ``tcp`` flow key attributes are nested
+inside the ``encap`` attribute.  Thus, an application that does not understand
+the ``vlan`` key will not see either of those attributes and therefore will not
+misinterpret them.  (Also, the outer ``eth_type`` is still ``0x8100``, not
+changed to ``0x0800``)
+
+Handling Malformed Packets
+--------------------------
+
+Don't drop packets in the kernel for malformed protocol headers, bad checksums,
+etc.  This would prevent userspace from implementing a simple Ethernet switch
+that forwards every packet.
+
+Instead, in such a case, include an attribute with "empty" content.  It doesn't
+matter if the empty content could be valid protocol values, as long as those
+values are rarely seen in practice, because userspace can always forward all
+packets with those values to userspace and handle them individually.
+
+For example, consider a packet that contains an IP header that indicates
+protocol 6 for TCP, but which is truncated just after the IP header, so that
+the TCP header is missing.  The flow key for this packet would include a tcp
+attribute with all-zero ``src`` and ``dst``, like this::
+
+    eth(...), eth_type(0x0800), ip(proto=6, ...), tcp(src=0, dst=0)
+
+As another example, consider a packet with an Ethernet type of 0x8100,
+indicating that a VLAN TCI should follow, but which is truncated just after the
+Ethernet type.  The flow key for this packet would include an all-zero-bits
+vlan and an empty encap attribute, like this::
+
+    eth(...), eth_type(0x8100), vlan(0), encap()
+
+Unlike a TCP packet with source and destination ports 0, an all-zero-bits VLAN
+TCI is not that rare, so the CFI bit (aka VLAN_TAG_PRESENT inside the kernel)
+is ordinarily set in a vlan attribute expressly to allow this situation to be
+distinguished.  Thus, the flow key in this second example unambiguously
+indicates a missing or malformed VLAN TCI.
+
+Other Rules
+-----------
+
+The other rules for flow keys are much less subtle:
+
+- Duplicate attributes are not allowed at a given nesting level.
+
+- Ordering of attributes is not significant.
+
+- When the kernel sends a given flow key to userspace, it always composes it
+  the same way.  This allows userspace to hash and compare entire flow keys
+  that it may not be able to fully interpret.
+
+Coding Rules
+------------
+
+Implement the headers and codes for compatibility with older kernel in
+``linux/compat/`` directory.  All public functions should be exported using
+``EXPORT_SYMBOL`` macro.  Public function replacing the same-named kernel
+function should be prefixed with ``rpl_``.  Otherwise, the function should be
+prefixed with ``ovs_``.  For special case when it is not possible to follow
+this rule (e.g., the ``pskb_expand_head()`` function), the function name must
+be added to ``linux/compat/build-aux/export-check-whitelist``, otherwise, the
+compilation check ``check-export-symbol`` will fail.
diff --git a/Documentation/topics/design.rst b/Documentation/topics/design.rst

new file mode 100644 (file)

index 0000000..adc407a
--- /dev/null
+++ b/Documentation/topics/design.rst
@@ -0,0 +1,1163 @@
+..
+      Licensed under the Apache License, Version 2.0 (the "License"); you may
+      not use this file except in compliance with the License. You may obtain
+      a copy of the License at
+
+          http://www.apache.org/licenses/LICENSE-2.0
+
+      Unless required by applicable law or agreed to in writing, software
+      distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
+      WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
+      License for the specific language governing permissions and limitations
+      under the License.
+
+      Convention for heading levels in Open vSwitch documentation:
+
+      =======  Heading 0 (reserved for the title in a document)
+      -------  Heading 1
+      ~~~~~~~  Heading 2
+      +++++++  Heading 3
+      '''''''  Heading 4
+
+      Avoid deeper levels because they do not render well.
+
+================================
+Design Decisions In Open vSwitch
+================================
+
+This document describes design decisions that went into implementing Open
+vSwitch.  While we believe these to be reasonable decisions, it is impossible
+to predict how Open vSwitch will be used in all environments.  Understanding
+assumptions made by Open vSwitch is critical to a successful deployment.  The
+end of this document contains contact information that can be used to let us
+know how we can make Open vSwitch more generally useful.
+
+Asynchronous Messages
+---------------------
+
+Over time, Open vSwitch has added many knobs that control whether a given
+controller receives OpenFlow asynchronous messages.  This section describes how
+all of these features interact.
+
+First, a service controller never receives any asynchronous messages unless it
+changes its miss_send_len from the service controller default of zero in one of
+the following ways:
+
+- Sending an ``OFPT_SET_CONFIG`` message with nonzero ``miss_send_len``.
+
+- Sending any ``NXT_SET_ASYNC_CONFIG`` message: as a side effect, this message
+  changes the ``miss_send_len`` to ``OFP_DEFAULT_MISS_SEND_LEN`` (128) for
+  service controllers.
+
+Second, ``OFPT_FLOW_REMOVED`` and ``NXT_FLOW_REMOVED`` messages are generated
+only if the flow that was removed had the ``OFPFF_SEND_FLOW_REM`` flag set.
+
+Third, ``OFPT_PACKET_IN`` and ``NXT_PACKET_IN`` messages are sent only to
+OpenFlow controller connections that have the correct connection ID (see
+``struct nx_controller_id`` and ``struct nx_action_controller``):
+
+- For packet-in messages generated by a ``NXAST_CONTROLLER`` action, the
+  controller ID specified in the action.
+
+- For other packet-in messages, controller ID zero.  (This is the default ID
+  when an OpenFlow controller does not configure one.)
+
+Finally, Open vSwitch consults a per-connection table indexed by the message
+type, reason code, and current role.  The following table shows how this table
+is initialized by default when an OpenFlow connection is made.  An entry
+labeled ``yes`` means that the message is sent, an entry labeled ``---`` means
+that the message is suppressed.
+
+.. table:: ``OFPT_PACKET_IN`` / ``NXT_PACKET_IN``
+
+  =========================================== ======= =====
+                                              master/
+           message and reason code            other   slave
+  =========================================== ======= =====
+  ``OFPR_NO_MATCH``                             yes    ---
+  ``OFPR_ACTION``                               yes    ---
+  ``OFPR_INVALID_TTL``                          ---    ---
+  ``OFPR_ACTION_SET`` (OF1.4+)                  yes    ---
+  ``OFPR_GROUP`` (OF1.4+)                       yes    ---
+  =========================================== ======= =====
+
+.. table:: ``OFPT_FLOW_REMOVED`` / ``NXT_FLOW_REMOVED``
+
+  =========================================== ======= =====
+                                              master/
+           message and reason code            other   slave
+  =========================================== ======= =====
+  ``OFPRR_IDLE_TIMEOUT``                        yes    ---
+  ``OFPRR_HARD_TIMEOUT``                        yes    ---
+  ``OFPRR_DELETE``                              yes    ---
+  ``OFPRR_GROUP_DELETE`` (OF1.4+)               yes    ---
+  ``OFPRR_METER_DELETE`` (OF1.4+)               yes    ---
+  ``OFPRR_EVICTION`` (OF1.4+)                   yes    ---
+  =========================================== ======= =====
+
+.. table:: ``OFPT_PORT_STATUS``
+
+  =========================================== ======= =====
+                                              master/
+           message and reason code            other   slave
+  =========================================== ======= =====
+  ``OFPPR_ADD``                                 yes    yes
+  ``OFPPR_DELETE``                              yes    yes
+  ``OFPPR_MODIFY``                              yes    yes
+  =========================================== ======= =====
+
+.. table:: ``OFPT_ROLE_REQUEST`` / ``OFPT_ROLE_REPLY`` (OF1.4+)
+
+  =========================================== ======= =====
+                                              master/
+           message and reason code            other   slave
+  =========================================== ======= =====
+  ``OFPCRR_MASTER_REQUEST``                     ---    ---
+  ``OFPCRR_CONFIG``                             ---    ---
+  ``OFPCRR_EXPERIMENTER``                       ---    ---
+  =========================================== ======= =====
+
+.. table:: ``OFPT_TABLE_STATUS`` (OF1.4+)
+
+  =========================================== ======= =====
+                                              master/
+           message and reason code            other   slave
+  =========================================== ======= =====
+  ``OFPTR_VACANCY_DOWN``                        ---    ---
+  ``OFPTR_VACANCY_UP``                          ---    ---
+  =========================================== ======= =====
+
+
+.. table:: ``OFPT_REQUESTFORWARD`` (OF1.4+)
+
+  =========================================== ======= =====
+                                              master/
+           message and reason code            other   slave
+  =========================================== ======= =====
+  ``OFPRFR_GROUP_MOD``                          ---    ---
+  ``OFPRFR_METER_MOD``                          ---    ---
+  =========================================== ======= =====
+
+The ``NXT_SET_ASYNC_CONFIG`` message directly sets all of the values in this
+table for the current connection.  The ``OFPC_INVALID_TTL_TO_CONTROLLER`` bit
+in the ``OFPT_SET_CONFIG`` message controls the setting for
+``OFPR_INVALID_TTL`` for the "master" role.
+
+``OFPAT_ENQUEUE``
+-----------------
+
+The OpenFlow 1.0 specification requires the output port of the
+``OFPAT_ENQUEUE`` action to "refer to a valid physical port (i.e. <
+``OFPP_MAX``) or ``OFPP_IN_PORT``".  Although ``OFPP_LOCAL`` is not less than
+``OFPP_MAX``, it is an 'internal' port which can have QoS applied to it in
+Linux.  Since we allow the ``OFPAT_ENQUEUE`` to apply to 'internal' ports whose
+port numbers are less than ``OFPP_MAX``, we interpret ``OFPP_LOCAL`` as a
+physical port and support ``OFPAT_ENQUEUE`` on it as well.
+
+``OFPT_FLOW_MOD``
+-----------------
+
+The OpenFlow specification for the behavior of ``OFPT_FLOW_MOD`` is confusing.
+The following tables summarize the Open vSwitch implementation of its behavior
+in the following categories:
+
+"match on priority"
+  Whether the ``flow_mod`` acts only on flows whose priority matches that
+  included in the ``flow_mod`` message.
+
+"match on out_port"
+  Whether the ``flow_mod`` acts only on flows that output to the out_port
+  included in the flow_mod message (if out_port is not ``OFPP_NONE``).
+  OpenFlow 1.1 and later have a similar feature (not listed separately here)
+  for ``out_group``.
+
+"match on flow_cookie":
+  Whether the ``flow_mod`` acts only on flows whose ``flow_cookie`` matches an
+  optional controller-specified value and mask.
+
+"updates flow_cookie":
+  Whether the ``flow_mod`` changes the ``flow_cookie`` of the flow or flows
+  that it matches to the ``flow_cookie`` included in the flow_mod message.
+
+"updates ``OFPFF_`` flags":
+  Whether the flow_mod changes the ``OFPFF_SEND_FLOW_REM`` flag of the flow or
+  flows that it matches to the setting included in the flags of the flow_mod
+  message.
+
+"honors ``OFPFF_CHECK_OVERLAP``":
+  Whether the ``OFPFF_CHECK_OVERLAP`` flag in the flow_mod is significant.
+
+"updates ``idle_timeout``" and "updates ``hard_timeout``":
+  Whether the ``idle_timeout`` and hard_timeout in the ``flow_mod``,
+  respectively, have an effect on the flow or flows matched by the
+  ``flow_mod``.
+
+"updates idle timer":
+  Whether the ``flow_mod`` resets the per-flow timer that measures how long a
+  flow has been idle.
+
+"updates hard timer":
+  Whether the ``flow_mod`` resets the per-flow timer that measures how long it
+  has been since a flow was modified.
+
+"zeros counters":
+  Whether the ``flow_mod`` resets per-flow packet and byte counters to zero.
+
+"may add a new flow":
+  Whether the ``flow_mod`` may add a new flow to the flow table.  (Obviously
+  this is always true for "add" commands but in some OpenFlow versions "modify"
+  and "modify-strict" can also add new flows.)
+
+"sends ``flow_removed`` message":
+  Whether the flow_mod generates a flow_removed message for the flow or flows
+  that it affects.
+
+An entry labeled ``yes`` means that the flow mod type does have the indicated
+behavior, ``---`` means that it does not, an empty cell means that the property
+is not applicable, and other values are explained below the table.
+
+OpenFlow 1.0
+~~~~~~~~~~~~
+
+================================ === ====== ====== ====== ======
+                                            MODIFY        DELETE
+RULE                             ADD MODIFY STRICT DELETE STRICT
+================================ === ====== ====== ====== ======
+match on ``priority``            yes  ---    yes    ---    yes
+match on ``out_port``            ---  ---    ---    yes    yes
+match on ``flow_cookie``         ---  ---    ---    ---    ---
+match on ``table_id``            ---  ---    ---    ---    ---
+controller chooses ``table_id``  ---  ---    ---
+updates ``flow_cookie``          yes  yes    yes
+updates ``OFPFF_SEND_FLOW_REM``  yes   +      +
+honors ``OFPFF_CHECK_OVERLAP``   yes   +      +
+updates ``idle_timeout``         yes   +      +
+updates ``hard_timeout``         yes   +      +
+resets idle timer                yes   +      +
+resets hard timer                yes  yes    yes
+zeros counters                   yes   +      +
+may add a new flow               yes  yes    yes
+sends ``flow_removed`` message   ---  ---    ---     %      %
+================================ === ====== ====== ====== ======
+
+where:
+
+``+``
+  "modify" and "modify-strict" only take these actions when they create a new
+  flow, not when they update an existing flow.
+
+``%``
+  "delete" and "delete_strict" generates a flow_removed message if the deleted
+  flow or flows have the ``OFPFF_SEND_FLOW_REM`` flag set.  (Each controller
+  can separately control whether it wants to receive the generated messages.)
+
+OpenFlow 1.1
+~~~~~~~~~~~~
+
+OpenFlow 1.1 makes these changes:
+
+- The controller now must specify the ``table_id`` of the flow match searched
+  and into which a flow may be inserted.  Behavior for a ``table_id`` of 255 is
+  undefined.
+
+- A ``flow_mod``, except an "add", can now match on the ``flow_cookie``.
+
+- When a ``flow_mod`` matches on the ``flow_cookie``, "modify" and
+  "modify-strict" never insert a new flow.
+
+================================ === ====== ====== ====== ======
+                                            MODIFY        DELETE
+RULE                             ADD MODIFY STRICT DELETE STRICT
+================================ === ====== ====== ====== ======
+match on ``priority``            yes  ---    yes    ---    yes
+match on ``out_port``            ---  ---    ---    yes    yes
+match on ``flow_cookie``         ---  yes    yes    yes    yes
+match on ``table_id``            yes  yes    yes    yes    yes
+controller chooses ``table_id``  yes  yes    yes
+updates ``flow_cookie``          yes  ---    ---
+updates ``OFPFF_SEND_FLOW_REM``  yes   +      +
+honors ``OFPFF_CHECK_OVERLAP``   yes   +      +
+updates ``idle_timeout``         yes   +      +
+updates ``hard_timeout``         yes   +      +
+resets idle timer                yes   +      +
+resets hard timer                yes  yes    yes
+zeros counters                   yes   +      +
+may add a new flow               yes   #      #
+sends ``flow_removed`` message   ---  ---    ---     %      %
+================================ === ====== ====== ====== ======
+
+where:
+
+``+``
+  "modify" and "modify-strict" only take these actions when they create a new
+  flow, not when they update an existing flow.
+
+``%``
+  "delete" and "delete_strict" generates a flow_removed message if the deleted
+  flow or flows have the ``OFPFF_SEND_FLOW_REM`` flag set.  (Each controller
+  can separately control whether it wants to receive the generated messages.)
+
+``#``
+  "modify" and "modify-strict" only add a new flow if the flow_mod does not
+  match on any bits of the flow cookie
+
+OpenFlow 1.2
+~~~~~~~~~~~~
+
+OpenFlow 1.2 makes these changes:
+
+- Only "add" commands ever add flows, "modify" and "modify-strict" never do.
+
+- A new flag ``OFPFF_RESET_COUNTS`` now controls whether "modify" and
+  "modify-strict" reset counters, whereas previously they never reset counters
+  (except when they inserted a new flow).
+
+================================ === ====== ====== ====== ======
+                                            MODIFY        DELETE
+RULE                             ADD MODIFY STRICT DELETE STRICT
+================================ === ====== ====== ====== ======
+match on ``priority``            yes  ---    yes    ---    yes
+match on ``out_port``            ---  ---    ---    yes    yes
+match on ``flow_cookie``         ---  yes    yes    yes    yes
+match on ``table_id``            yes  yes    yes    yes    yes
+controller chooses ``table_id``  yes  yes    yes
+updates ``flow_cookie``          yes  ---    ---
+updates ``OFPFF_SEND_FLOW_REM``  yes  ---    ---
+honors ``OFPFF_CHECK_OVERLAP``   yes  ---    ---
+updates ``idle_timeout``         yes  ---    ---
+updates ``hard_timeout``         yes  ---    ---
+resets idle timer                yes  ---    ---
+resets hard timer                yes  yes    yes
+zeros counters                   yes   &      &
+may add a new flow               yes  ---    ---
+sends ``flow_removed`` message   ---  ---    ---     %      %
+================================ === ====== ====== ====== ======
+
+``%``
+  "delete" and "delete_strict" generates a flow_removed message if the deleted
+  flow or flows have the ``OFPFF_SEND_FLOW_REM`` flag set.  (Each controller
+  can separately control whether it wants to receive the generated messages.)
+
+``&``
+  "modify" and "modify-strict" reset counters if the ``OFPFF_RESET_COUNTS``
+  flag is specified.
+
+OpenFlow 1.3
+~~~~~~~~~~~~
+
+OpenFlow 1.3 makes these changes:
+
+- Behavior for a table_id of 255 is now defined, for "delete" and
+  "delete-strict" commands, as meaning to delete from all tables.  A table_id
+  of 255 is now explicitly invalid for other commands.
+
+- New flags ``OFPFF_NO_PKT_COUNTS`` and ``OFPFF_NO_BYT_COUNTS`` for "add"
+  operations.
+
+The table for 1.3 is the same as the one shown above for 1.2.
+
+OpenFlow 1.4
+~~~~~~~~~~~~
+
+OpenFlow 1.4 makes these changes:
+
+- Adds the "importance" field to ``flow_mods``, but it does not explicitly
+  specify which kinds of ``flow_mods`` set the importance.  For consistency,
+  Open vSwitch uses the same rule for importance as for ``idle_timeout`` and
+  ``hard_timeout``, that is, only an "ADD" flow_mod sets the importance.  (This
+  issue has been filed with the ONF as EXT-496.)
+
+.. TODO(stephenfin) Link to EXT-496
+
+- Eviction Mechanism to automatically delete entries of lower importance to
+  make space for newer entries.
+
+OpenFlow 1.4 Bundles
+--------------------
+
+Open vSwitch makes all flow table modifications atomically, i.e., any datapath
+packet only sees flow table configurations either before or after any change
+made by any ``flow_mod``.  For example, if a controller removes all flows with
+a single OpenFlow ``flow_mod``, no packet sees an intermediate version of the
+OpenFlow pipeline where only some of the flows have been deleted.
+
+It should be noted that Open vSwitch caches datapath flows, and that the cached
+flows are *NOT* flushed immediately when a flow table changes.  Instead, the
+datapath flows are revalidated against the new flow table as soon as possible,
+and usually within one second of the modification.  This design amortizes the
+cost of datapath cache flushing across multiple flow table changes, and has a
+significant performance effect during simultaneous heavy flow table churn and
+high traffic load.  This means that different cached datapath flows may have
+been computed based on a different flow table configurations, but each of the
+datapath flows is guaranteed to have been computed over a coherent view of the
+flow tables, as described above.
+
+With OpenFlow 1.4 bundles this atomicity can be extended across an arbitrary
+set of ``flow_mod``.  Bundles are supported for ``flow_mod`` and port_mod
+messages only.  For ``flow_mod``, both ``atomic`` and ``ordered`` bundle flags
+are trivially supported, as all bundled messages are executed in the order they
+were added and all flow table modifications are now atomic to the datapath.
+Port mods may not appear in atomic bundles, as port status modifications are
+not atomic.
+
+To support bundles, ovs-ofctl has a ``--bundle`` option that makes the
+flow mod commands (``add-flow``, ``add-flows``, ``mod-flows``, ``del-flows``,
+and ``replace-flows``) use an OpenFlow 1.4 bundle to operate the
+modifications as a single atomic transaction.  If any of the flow mods
+in a transaction fail, none of them are executed.  All flow mods in a
+bundle appear to datapath lookups simultaneously.
+
+Furthermore, ovs-ofctl ``add-flow`` and ``add-flows`` commands now accept
+arbitrary flow mods as an input by allowing the flow specification to
+start with an explicit ``add``, ``modify``, ``modify_strict``, ``delete``, or
+``delete_strict`` keyword.  A missing keyword is treated as ``add``, so
+this is fully backwards compatible.  With the new ``--bundle`` option
+all the flow mods are executed as a single atomic transaction using an
+OpenFlow 1.4 bundle.  Without the ``--bundle`` option the flow mods are
+executed in order up to the first failing ``flow_mod``, and in case of an
+error the earlier successful ``flow_mod`` calls are not rolled back.
+
+``OFPT_PACKET_IN``
+------------------
+
+The OpenFlow 1.1 specification for ``OFPT_PACKET_IN`` is confusing.  The
+definition in OF1.1 ``openflow.h`` is[*]:
+
+::
+
+    /* Packet received on port (datapath -> controller). */
+    struct ofp_packet_in {
+        struct ofp_header header;
+        uint32_t buffer_id;     /* ID assigned by datapath. */
+        uint32_t in_port;       /* Port on which frame was received. */
+        uint32_t in_phy_port;   /* Physical Port on which frame was received. */
+        uint16_t total_len;     /* Full length of frame. */
+        uint8_t reason;         /* Reason packet is being sent (one of OFPR_*) */
+        uint8_t table_id;       /* ID of the table that was looked up */
+        uint8_t data[0];        /* Ethernet frame, halfway through 32-bit word,
+                                   so the IP header is 32-bit aligned.  The
+                                   amount of data is inferred from the length
+                                   field in the header.  Because of padding,
+                                   offsetof(struct ofp_packet_in, data) ==
+                                   sizeof(struct ofp_packet_in) - 2. */
+    };
+    OFP_ASSERT(sizeof(struct ofp_packet_in) == 24);
+
+The confusing part is the comment on the ``data[]`` member.  This comment is a
+leftover from OF1.0 ``openflow.h``, in which the comment was correct:
+``sizeof(struct ofp_packet_in)`` is 20 in OF1.0 and ``ffsetof(struct
+ofp_packet_in, data)`` is 18.  When OF1.1 was written, the structure members
+were changed but the comment was carelessly not updated, and the comment became
+wrong: ``sizeof(struct ofp_packet_in)`` and offsetof(struct ofp_packet_in,
+data) are both 24 in OF1.1.
+
+That leaves the question of how to implement ``ofp_packet_in`` in OF1.1.  The
+OpenFlow reference implementation for OF1.1 does not include any padding, that
+is, the first byte of the encapsulated frame immediately follows the
+``table_id`` member without a gap.  Open vSwitch therefore implements it the
+same way for compatibility.
+
+For an earlier discussion, please see the thread archived at:
+https://mailman.stanford.edu/pipermail/openflow-discuss/2011-August/002604.html
+
+[*] The quoted definition is directly from OF1.1.  Definitions used inside OVS
+omit the 8-byte ``ofp_header`` members, so the sizes in this discussion are
+8 bytes larger than those declared in OVS header files.
+
+VLAN Matching
+-------------
+
+The 802.1Q VLAN header causes more trouble than any other 4 bytes in
+networking.  More specifically, three versions of OpenFlow and Open vSwitch
+have among them four different ways to match the contents and presence of the
+VLAN header.  The following table describes how each version works.
+
+======== ============= =============== =============== ================
+ Match        NXM          OF1.0            OF1.1           OF1.2
+======== ============= =============== =============== ================
+ ``[1]`` ``0000/0000`` ``????/1,??/?`` ``????/1,??/?`` ``0000/0000,--``
+ ``[2]`` ``0000/ffff`` ``ffff/0,??/?`` ``ffff/0,??/?`` ``0000/ffff,--``
+ ``[3]`` ``1xxx/1fff`` ``0xxx/0,??/1`` ``0xxx/0,??/1`` ``1xxx/ffff,--``
+ ``[4]`` ``z000/f000`` ``????/1,0y/0`` ``fffe/0,0y/0`` ``1000/1000,0y``
+ ``[5]`` ``zxxx/ffff`` ``0xxx/0,0y/0`` ``0xxx/0,0y/0`` ``1xxx/ffff,0y``
+ ``[6]`` ``0000/0fff`` ``<none>``      ``<none>``      ``<none>``
+ ``[7]`` ``0000/f000`` ``<none>``      ``<none>``      ``<none>``
+ ``[8]`` ``0000/efff`` ``<none>``      ``<none>``      ``<none>``
+ ``[9]`` ``1001/1001`` ``<none>``      ``<none>``      ``1001/1001,--``
+``[10]`` ``3000/3000`` ``<none>``      ``<none>``      ``<none>``
+``[11]`` ``1000/1000`` ``<none>``      ``fffe/0,??/1`` ``1000/1000,--``
+======== ============= =============== =============== ================
+
+where:
+
+Match:
+  See the list below.
+
+NXM:
+  ``xxxx/yyyy`` means ``NXM_OF_VLAN_TCI_W`` with value ``xxxx`` and mask
+  ``yyyy``.  A mask of ``0000`` is equivalent to omitting
+  ``NXM_OF_VLAN_TCI(_W)``, a mask of ``ffff`` is equivalent to
+  ``NXM_OF_VLAN_TCI``.
+
+OF1.0, OF1.1:
+  ``wwww/x,yy/z`` means ``dl_vlan`` ``wwww``, ``OFPFW_DL_VLAN`` ``x``,
+  ``dl_vlan_pcp`` ``yy``, and ``OFPFW_DL_VLAN_PCP`` ``z``.  If
+  ``OFPFW_DL_VLAN`` or ``OFPFW_DL_VLAN_PCP`` is 1, the corresponding field
+  value is wildcarded, otherwise it is matched.  ``?`` means that the given
+  bits are ignored (their conventional values are ``0000/x,00/0`` in OF1.0,
+  ``0000/x,00/1`` in OF1.1; ``x`` is never ignored).  ``<none>`` means that the
+  given match is not supported.
+
+OF1.2:
+  ``xxxx/yyyy,zz`` means ``OXM_OF_VLAN_VID_W`` with value ``xxxx`` and mask
+  ``yyyy``, and ``OXM_OF_VLAN_PCP`` (which is not maskable) with value ``zz``.
+  A mask of ``0000`` is equivalent to omitting ``OXM_OF_VLAN_VID(_W)``, a mask
+  of ``ffff`` is equivalent to ``OXM_OF_VLAN_VID``.  ``--`` means that
+  ``OXM_OF_VLAN_PCP`` is omitted.  ``<none>`` means that the given match is not
+  supported.
+
+The matches are:
+
+``[1]``:
+  Matches any packet, that is, one without an 802.1Q header or with an 802.1Q
+  header with any TCI value.
+
+``[2]``
+  Matches only packets without an 802.1Q header.
+
+  NXM:
+    Any match with ``vlan_tci == 0`` and ``(vlan_tci_mask & 0x1000) != 0`` is
+    equivalent to the one listed in the table.
+
+  OF1.0:
+    The spec doesn't define behavior if ``dl_vlan`` is set to ``0xffff`` and
+    ``OFPFW_DL_VLAN_PCP`` is not set.
+
+  OF1.1:
+    The spec says explicitly to ignore ``dl_vlan_pcp`` when ``dl_vlan`` is set
+    to ``0xffff``.
+
+  OF1.2:
+    The spec doesn't say what should happen if ``vlan_vid == 0`` and
+    ``(vlan_vid_mask & 0x1000) != 0`` but ``vlan_vid_mask != 0x1000``, but it
+    would be straightforward to also interpret as ``[2]``.
+
+``[3]``
+  Matches only packets that have an 802.1Q header with VID ``xxx`` (and any
+  PCP).
+
+``[4]``
+  Matches only packets that have an 802.1Q header with PCP ``y`` (and any VID).
+
+  NXM:
+    ``z`` is ``(y << 1) | 1``.
+
+  OF1.0:
+    The spec isn't very clear, but OVS implements it this way.
+
+  OF1.2:
+    Presumably other masks such that ``(vlan_vid_mask & 0x1fff) == 0x1000``
+    would also work, but the spec doesn't define their behavior.
+
+``[5]``
+  Matches only packets that have an 802.1Q header with VID ``xxx`` and PCP
+  ``y``.
+
+   NXM:
+     ``z`` is ``((y << 1) | 1)``.
+
+   OF1.2:
+     Presumably other masks such that ``(vlan_vid_mask & 0x1fff) == 0x1fff``
+     would also work.
+
+``[6]``
+  Matches packets with no 802.1Q header or with an 802.1Q header with a VID of
+  0.  Only possible with NXM.
+
+``[7]``
+  Matches packets with no 802.1Q header or with an 802.1Q header with a PCP of
+  0.  Only possible with NXM.
+
+``[8]``
+  Matches packets with no 802.1Q header or with an 802.1Q header with both VID
+  and PCP of 0.  Only possible with NXM.
+
+``[9]``
+  Matches only packets that have an 802.1Q header with an odd-numbered VID (and
+  any PCP).  Only possible with NXM and OF1.2.  (This is just an example; one
+  can match on any desired VID bit pattern.)
+
+``[10]``
+  Matches only packets that have an 802.1Q header with an odd-numbered PCP (and
+  any VID).  Only possible with NXM.  (This is just an example; one can match
+  on any desired VID bit pattern.)
+
+``[11]``
+  Matches any packet with an 802.1Q header, regardless of VID or PCP.
+
+Additional notes:
+
+OF1.2:
+  The top three bits of ``OXM_OF_VLAN_VID`` are fixed to zero, so bits 13, 14,
+  and 15 in the masks listed in the table may be set to arbitrary values, as
+  long as the corresponding value bits are also zero.  The suggested ``ffff``
+  mask for [2], [3], and [5] allows a shorter OXM representation (the mask is
+  omitted) than the minimal ``1fff`` mask.
+
+Flow Cookies
+------------
+
+OpenFlow 1.0 and later versions have the concept of a "flow cookie", which is a
+64-bit integer value attached to each flow.  The treatment of the flow cookie
+has varied greatly across OpenFlow versions, however.
+
+In OpenFlow 1.0:
+
+- ``OFPFC_ADD`` set the cookie in the flow that it added.
+
+- ``OFPFC_MODIFY`` and ``OFPFC_MODIFY_STRICT`` updated the cookie for the flow
+  or flows that it modified.
+
+- ``OFPST_FLOW`` messages included the flow cookie.
+
+- ``OFPT_FLOW_REMOVED`` messages reported the cookie of the flow that was
+  removed.
+
+OpenFlow 1.1 made the following changes:
+
+- Flow mod operations ``OFPFC_MODIFY``, ``OFPFC_MODIFY_STRICT``,
+  ``OFPFC_DELETE``, and ``OFPFC_DELETE_STRICT``, plus flow stats requests and
+  aggregate stats requests, gained the ability to match on flow cookies with an
+  arbitrary mask.
+
+- ``OFPFC_MODIFY`` and ``OFPFC_MODIFY_STRICT`` were changed to add a new flow,
+  in the case of no match, only if the flow table modification operation did
+  not match on the cookie field.  (In OpenFlow 1.0, modify operations always
+  added a new flow when there was no match.)
+
+- ``OFPFC_MODIFY`` and ``OFPFC_MODIFY_STRICT`` no longer updated flow cookies.
+
+OpenFlow 1.2 made the following changes:
+
+- ``OFPC_MODIFY`` and ``OFPFC_MODIFY_STRICT`` were changed to never add a new
+  flow, regardless of whether the flow cookie was used for matching.
+
+Open vSwitch support for OpenFlow 1.0 implements the OpenFlow 1.0 behavior with
+the following extensions:
+
+- An NXM extension field ``NXM_NX_COOKIE(_W)`` allows the NXM versions of
+  ``OFPFC_MODIFY``, ``OFPFC_MODIFY_STRICT``, ``OFPFC_DELETE``, and
+  ``OFPFC_DELETE_STRICT`` ``flow_mod`` calls, plus flow stats requests and
+  aggregate stats requests, to match on flow cookies with arbitrary masks.
+  This is much like the equivalent OpenFlow 1.1 feature.
+
+- Like OpenFlow 1.1, ``OFPC_MODIFY`` and ``OFPFC_MODIFY_STRICT`` add a new flow
+  if there is no match and the mask is zero (or not given).
+
+- The ``cookie`` field in ``OFPT_FLOW_MOD`` and ``NXT_FLOW_MOD`` messages is
+  used as the cookie value for ``OFPFC_ADD`` commands, as described in OpenFlow
+  1.0.  For ``OFPFC_MODIFY`` and ``OFPFC_MODIFY_STRICT`` commands, the
+  ``cookie`` field is used as a new cookie for flows that match unless it is
+  ``UINT64_MAX``, in which case the flow's cookie is not updated.
+
+- ``NXT_PACKET_IN`` (the Nicira extended version of ``OFPT_PACKET_IN``) reports
+  the cookie of the rule that generated the packet, or all-1-bits if no rule
+  generated the packet.  (Older versions of OVS used all-0-bits instead of
+  all-1-bits.)
+
+The following table shows the handling of different protocols when receiving
+``OFPFC_MODIFY`` and ``OFPFC_MODIFY_STRICT`` messages.  A mask of 0 indicates
+either an explicit mask of zero or an implicit one by not specifying the
+``NXM_NX_COOKIE(_W)`` field.
+
+==============  ======  ======  =============  =============
+                Match   Update   Add on miss    Add on miss
+                cookie  cookie     mask!=0        mask==0
+==============  ======  ======  =============  =============
+OpenFlow 1.0      no     yes    (add on miss)  (add on miss)
+OpenFlow 1.1     yes      no         no             yes
+OpenFlow 1.2     yes      no         no             no
+NXM              yes     yes\*       no             yes
+==============  ======  ======  =============  =============
+
+\* Updates the flow's cookie unless the ``cookie`` field is ``UINT64_MAX``.
+
+Multiple Table Support
+----------------------
+
+OpenFlow 1.0 has only rudimentary support for multiple flow tables.  Notably,
+OpenFlow 1.0 does not allow the controller to specify the flow table to which a
+flow is to be added.  Open vSwitch adds an extension for this purpose, which is
+enabled on a per-OpenFlow connection basis using the ``NXT_FLOW_MOD_TABLE_ID``
+message.  When the extension is enabled, the upper 8 bits of the ``command``
+member in an ``OFPT_FLOW_MOD`` or ``NXT_FLOW_MOD`` message designates the table
+to which a flow is to be added.
+
+The Open vSwitch software switch implementation offers 255 flow tables.  On
+packet ingress, only the first flow table (table 0) is searched, and the
+contents of the remaining tables are not considered in any way.  Tables other
+than table 0 only come into play when an ``NXAST_RESUBMIT_TABLE`` action
+specifies another table to search.
+
+Tables 128 and above are reserved for use by the switch itself.  Controllers
+should use only tables 0 through 127.
+
+``OFPTC_*`` Table Configuration
+-------------------------------
+
+This section covers the history of the ``OFPTC_*`` table configuration bits
+across OpenFlow versions.
+
+OpenFlow 1.0 flow tables had fixed configurations.
+
+OpenFlow 1.1 enabled controllers to configure behavior upon flow table miss and
+added the ``OFPTC_MISS_*`` constants for that purpose.  ``OFPTC_*`` did not
+control anything else but it was nevertheless conceptualized as a set of
+bit-fields instead of an enum.  OF1.1 added the ``OFPT_TABLE_MOD`` message to
+set ``OFPTC_MISS_*`` for a flow table and added the ``config`` field to the
+``OFPST_TABLE`` reply to report the current setting.
+
+OpenFlow 1.2 did not change anything in this regard.
+
+OpenFlow 1.3 switched to another means to changing flow table miss behavior and
+deprecated ``OFPTC_MISS_*`` without adding any more ``OFPTC_*`` constants.
+This meant that ``OFPT_TABLE_MOD`` now had no purpose at all, but OF1.3 kept it
+around "for backward compatibility with older and newer versions of the
+specification."  At the same time, OF1.3 introduced a new message
+OFPMP_TABLE_FEATURES that included a field ``config`` documented as reporting
+the ``OFPTC_*`` values set with ``OFPT_TABLE_MOD``; of course this served no
+real purpose because no ``OFPTC_*`` values are defined.  OF1.3 did remove the
+``OFPTC_*`` field from ``OFPMP_TABLE`` (previously named ``OFPST_TABLE``).
+
+OpenFlow 1.4 defined two new ``OFPTC_*`` constants, ``OFPTC_EVICTION`` and
+``OFPTC_VACANCY_EVENTS``, using bits that did not overlap with ``OFPTC_MISS_*``
+even though those bits had not been defined since OF1.2.  ``OFPT_TABLE_MOD``
+still controlled these settings.  The field for ``OFPTC_*`` values in
+``OFPMP_TABLE_FEATURES`` was renamed from ``config`` to ``capabilities`` and
+documented as reporting the flags that are supported in a ``OFPT_TABLE_MOD``
+message.  The ``OFPMP_TABLE_DESC`` message newly added in OF1.4 reported the
+``OFPTC_*`` setting.
+
+OpenFlow 1.5 did not change anything in this regard.
+
+.. list-table:: Revisions
+   :header-rows: 1
+
+   * - OpenFlow
+     - ``OFPTC_*`` flags
+     - ``TABLE_MOD``
+     - Statistics
+     - ``TABLE_FEATURES``
+     - ``TABLE_DESC``
+   * - OF1.0
+     - none
+     - no (\*)(+)
+     - no (\*)
+     - nothing (\*)(+)
+     - no (\*)(+)
+   * - OF1.1/1.2
+     - ``MISS_*``
+     - yes
+     - yes
+     - nothing (+)
+     - no (+)
+   * - OF1.3
+     - none
+     - yes (\*)
+     - no (\*)
+     - config (\*)
+     - no (\*)(+)
+   * - OF1.4/1.5
+     - ``EVICTION``/``VACANCY_EVENTS``
+     - yes
+     - no
+     - capabilities
+     - yes
+
+where:
+
+OpenFlow:
+  The OpenFlow version(s).
+
+``OFPTC_*`` flags:
+  The ``OFPTC_*`` flags defined in those versions.
+
+``TABLE_MOD``:
+  Whether ``OFPT_TABLE_MOD`` can modify ``OFPTC_*`` flags.
+
+Statistics:
+  Whether ``OFPST_TABLE/OFPMP_TABLE`` reports the ``OFPTC_*`` flags.
+
+``TABLE_FEATURES``:
+  What ``OFPMP_TABLE_FEATURES`` reports (if it exists): either the current
+  configuration or the switch's capabilities.
+
+``TABLE_DESC``:
+  Whether ``OFPMP_TABLE_DESC`` reports the current configuration.
+
+(\*): Nothing to report/change anyway.
+
+(+): No such message.
+
+IPv6
+----
+
+Open vSwitch supports stateless handling of IPv6 packets.  Flows can be written
+to support matching TCP, UDP, and ICMPv6 headers within an IPv6 packet.  Deeper
+matching of some Neighbor Discovery messages is also supported.
+
+IPv6 was not designed to interact well with middle-boxes.  This, combined with
+Open vSwitch's stateless nature, have affected the processing of IPv6 traffic,
+which is detailed below.
+
+Extension Headers
+~~~~~~~~~~~~~~~~~
+
+The base IPv6 header is incredibly simple with the intention of only containing
+information relevant for routing packets between two endpoints.  IPv6 relies
+heavily on the use of extension headers to provide any other functionality.
+Unfortunately, the extension headers were designed in such a way that it is
+impossible to move to the next header (including the layer-4 payload) unless
+the current header is understood.
+
+Open vSwitch will process the following extension headers and continue to the
+next header:
+
+- Fragment (see the next section)
+- AH (Authentication Header)
+- Hop-by-Hop Options
+- Routing
+- Destination Options
+
+When a header is encountered that is not in that list, it is considered
+"terminal".  A terminal header's IPv6 protocol value is stored in ``nw_proto``
+for matching purposes.  If a terminal header is TCP, UDP, or ICMPv6, the packet
+will be further processed in an attempt to extract layer-4 information.
+
+Fragments
+~~~~~~~~~
+
+IPv6 requires that every link in the internet have an MTU of 1280 octets or
+greater (RFC 2460).  As such, a terminal header (as described above in
+"Extension Headers") in the first fragment should generally be reachable.  In
+this case, the terminal header's IPv6 protocol type is stored in the
+``nw_proto`` field for matching purposes.  If a terminal header cannot be found
+in the first fragment (one with a fragment offset of zero), the ``nw_proto``
+field is set to 0.  Subsequent fragments (those with a non-zero fragment
+offset) have the ``nw_proto`` field set to the IPv6 protocol type for fragments
+(44).
+
+Jumbograms
+~~~~~~~~~~
+
+An IPv6 jumbogram (RFC 2675) is a packet containing a payload longer than
+65,535 octets.  A jumbogram is only relevant in subnets with a link MTU greater
+than 65,575 octets, and are not required to be supported on nodes that do not
+connect to link with such large MTUs.  Currently, Open vSwitch doesn't process
+jumbograms.
+
+In-Band Control
+---------------
+
+Motivation
+~~~~~~~~~~
+
+An OpenFlow switch must establish and maintain a TCP network connection to its
+controller.  There are two basic ways to categorize the network that this
+connection traverses: either it is completely separate from the one that the
+switch is otherwise controlling, or its path may overlap the network that the
+switch controls.  We call the former case "out-of-band control", the latter
+case "in-band control".
+
+Out-of-band control has the following benefits:
+
+- Simplicity: Out-of-band control slightly simplifies the switch
+  implementation.
+
+- Reliability: Excessive switch traffic volume cannot interfere with control
+  traffic.
+
+- Integrity: Machines not on the control network cannot impersonate a switch or
+  a controller.
+
+- Confidentiality: Machines not on the control network cannot snoop on control
+  traffic.
+
+In-band control, on the other hand, has the following advantages:
+
+- No dedicated port: There is no need to dedicate a physical switch port to
+  control, which is important on switches that have few ports (e.g. wireless
+  routers, low-end embedded platforms).
+
+- No dedicated network: There is no need to build and maintain a separate
+  control network.  This is important in many environments because it reduces
+  proliferation of switches and wiring.
+
+Open vSwitch supports both out-of-band and in-band control.  This section
+describes the principles behind in-band control.  See the description of the
+Controller table in ovs-vswitchd.conf.db(5) to configure OVS for in-band
+control.
+
+Principles
+~~~~~~~~~~
+
+The fundamental principle of in-band control is that an OpenFlow switch must
+recognize and switch control traffic without involving the OpenFlow controller.
+All the details of implementing in-band control are special cases of this
+principle.
+
+The rationale for this principle is simple.  If the switch does not handle
+in-band control traffic itself, then it will be caught in a contradiction: it
+must contact the controller, but it cannot, because only the controller can set
+up the flows that are needed to contact the controller.
+
+The following points describe important special cases of this principle.
+
+- In-band control must be implemented regardless of whether the switch is
+  connected.
+
+  It is tempting to implement the in-band control rules only when the switch is
+  not connected to the controller, using the reasoning that the controller
+  should have complete control once it has established a connection with the
+  switch.
+
+  This does not work in practice.  Consider the case where the switch is
+  connected to the controller.  Occasionally it can happen that the controller
+  forgets or otherwise needs to obtain the MAC address of the switch.  To do
+  so, the controller sends a broadcast ARP request.  A switch that implements
+  the in-band control rules only when it is disconnected will then send an
+  ``OFPT_PACKET_IN`` message up to the controller.  The controller will be
+  unable to respond, because it does not know the MAC address of the switch.
+  This is a deadlock situation that can only be resolved by the switch noticing
+  that its connection to the controller has hung and reconnecting.
+
+- In-band control must override flows set up by the controller.
+
+  It is reasonable to assume that flows set up by the OpenFlow controller
+  should take precedence over in-band control, on the basis that the controller
+  should be in charge of the switch.
+
+  Again, this does not work in practice.  Reasonable controller implementations
+  may set up a "last resort" fallback rule that wildcards every field and,
+  e.g., sends it up to the controller or discards it.  If a controller does
+  that, then it will isolate itself from the switch.
+
+- The switch must recognize all control traffic.
+
+  The fundamental principle of in-band control states, in part, that a switch
+  must recognize control traffic without involving the OpenFlow controller.
+  More specifically, the switch must recognize *all* control traffic.  "False
+  negatives", that is, packets that constitute control traffic but that the
+  switch does not recognize as control traffic, lead to control traffic storms.
+
+  Consider an OpenFlow switch that only recognizes control packets sent to or
+  from that switch.  Now suppose that two switches of this type, named A and B,
+  are connected to ports on an Ethernet hub (not a switch) and that an OpenFlow
+  controller is connected to a third hub port.  In this setup, control traffic
+  sent by switch A will be seen by switch B, which will send it to the
+  controller as part of an OFPT_PACKET_IN message.  Switch A will then see the
+  OFPT_PACKET_IN message's packet, re-encapsulate it in another OFPT_PACKET_IN,
+  and send it to the controller.  Switch B will then see that OFPT_PACKET_IN,
+  and so on in an infinite loop.
+
+  Incidentally, the consequences of "false positives", where packets that are
+  not control traffic are nevertheless recognized as control traffic, are much
+  less severe.  The controller will not be able to control their behavior, but
+  the network will remain in working order.  False positives do constitute a
+  security problem.
+
+- The switch should use echo-requests to detect disconnection.
+
+  TCP will notice that a connection has hung, but this can take a considerable
+  amount of time.  For example, with default settings the Linux kernel TCP
+  implementation will retransmit for between 13 and 30 minutes, depending on
+  the connection's retransmission timeout, according to kernel documentation.
+  This is far too long for a switch to be disconnected, so an OpenFlow switch
+  should implement its own connection timeout.  OpenFlow ``OFPT_ECHO_REQUEST``
+  messages are the best way to do this, since they test the OpenFlow connection
+  itself.
+
+Implementation
+~~~~~~~~~~~~~~
+
+This section describes how Open vSwitch implements in-band control.  Correctly
+implementing in-band control has proven difficult due to its many subtleties,
+and has thus gone through many iterations.  Please read through and understand
+the reasoning behind the chosen rules before making modifications.
+
+Open vSwitch implements in-band control as "hidden" flows, that is, flows that
+are not visible through OpenFlow, and at a higher priority than wildcarded
+flows can be set up through OpenFlow.  This is done so that the OpenFlow
+controller cannot interfere with them and possibly break connectivity with its
+switches.  It is possible to see all flows, including in-band ones, with the
+ovs-appctl "bridge/dump-flows" command.
+
+The Open vSwitch implementation of in-band control can hide traffic to
+arbitrary "remotes", where each remote is one TCP port on one IP address.
+Currently the remotes are automatically configured as the in-band OpenFlow
+controllers plus the OVSDB managers, if any.  (The latter is a requirement
+because OVSDB managers are responsible for configuring OpenFlow controllers, so
+if the manager cannot be reached then OpenFlow cannot be reconfigured.)
+
+The following rules (with the OFPP_NORMAL action) are set up on any bridge that
+has any remotes:
+
+(a)
+  DHCP requests sent from the local port.
+(b)
+  ARP replies to the local port's MAC address.
+(c)
+  ARP requests from the local port's MAC address.
+
+In-band also sets up the following rules for each unique next-hop MAC address
+for the remotes' IPs (the "next hop" is either the remote itself, if it is on a
+local subnet, or the gateway to reach the remote):
+
+(d)
+  ARP replies to the next hop's MAC address.
+(e)
+  ARP requests from the next hop's MAC address.
+
+In-band also sets up the following rules for each unique remote IP address:
+
+(f)
+  ARP replies containing the remote's IP address as a target.
+(g)
+  ARP requests containing the remote's IP address as a source.
+
+In-band also sets up the following rules for each unique remote (IP,port) pair:
+
+(h)
+  TCP traffic to the remote's IP and port.
+(i)
+  TCP traffic from the remote's IP and port.
+
+The goal of these rules is to be as narrow as possible to allow a switch to
+join a network and be able to communicate with the remotes.  As mentioned
+earlier, these rules have higher priority than the controller's rules, so if
+they are too broad, they may prevent the controller from implementing its
+policy.  As such, in-band actively monitors some aspects of flow and packet
+processing so that the rules can be made more precise.
+
+In-band control monitors attempts to add flows into the datapath that could
+interfere with its duties.  The datapath only allows exact match entries, so
+in-band control is able to be very precise about the flows it prevents.  Flows
+that miss in the datapath are sent to userspace to be processed, so preventing
+these flows from being cached in the "fast path" does not affect correctness.
+The only type of flow that is currently prevented is one that would prevent
+DHCP replies from being seen by the local port.  For example, a rule that
+forwarded all DHCP traffic to the controller would not be allowed, but one that
+forwarded to all ports (including the local port) would.
+
+As mentioned earlier, packets that miss in the datapath are sent to the
+userspace for processing.  The userspace has its own flow table, the
+"classifier", so in-band checks whether any special processing is needed before
+the classifier is consulted.  If a packet is a DHCP response to a request from
+the local port, the packet is forwarded to the local port, regardless of the
+flow table.  Note that this requires L7 processing of DHCP replies to determine
+whether the 'chaddr' field matches the MAC address of the local port.
+
+It is interesting to note that for an L3-based in-band control mechanism, the
+majority of rules are devoted to ARP traffic.  At first glance, some of these
+rules appear redundant.  However, each serves an important role.  First, in
+order to determine the MAC address of the remote side (controller or gateway)
+for other ARP rules, we must allow ARP traffic for our local port with rules
+(b) and (c).  If we are between a switch and its connection to the remote, we
+have to allow the other switch's ARP traffic to through.  This is done with
+rules (d) and (e), since we do not know the addresses of the other switches a
+priori, but do know the remote's or gateway's.  Finally, if the remote is
+running in a local guest VM that is not reached through the local port, the
+switch that is connected to the VM must allow ARP traffic based on the remote's
+IP address, since it will not know the MAC address of the local port that is
+sending the traffic or the MAC address of the remote in the guest VM.
+
+With a few notable exceptions below, in-band should work in most network
+setups.  The following are considered "supported" in the current
+implementation:
+
+- Locally Connected.  The switch and remote are on the same subnet.  This uses
+  rules (a), (b), (c), (h), and (i).
+
+- Reached through Gateway.  The switch and remote are on different subnets and
+  must go through a gateway.  This uses rules (a), (b), (c), (h), and (i).
+
+- Between Switch and Remote.  This switch is between another switch and the
+  remote, and we want to allow the other switch's traffic through.  This uses
+  rules (d), (e), (h), and (i).  It uses (b) and (c) indirectly in order to
+  know the MAC address for rules (d) and (e).  Note that DHCP for the other
+  switch will not work unless an OpenFlow controller explicitly lets this
+  switch pass the traffic.
+
+- Between Switch and Gateway.  This switch is between another switch and the
+  gateway, and we want to allow the other switch's traffic through.  This uses
+  the same rules and logic as the "Between Switch and Remote" configuration
+  described earlier.
+
+- Remote on Local VM.  The remote is a guest VM on the system running in-band
+  control.  This uses rules (a), (b), (c), (h), and (i).
+
+- Remote on Local VM with Different Networks.  The remote is a guest VM on the
+  system running in-band control, but the local port is not used to connect to
+  the remote.  For example, an IP address is configured on eth0 of the switch.
+  The remote's VM is connected through eth1 of the switch, but an IP address
+  has not been configured for that port on the switch.  As such, the switch
+  will use eth0 to connect to the remote, and eth1's rules about the local port
+  will not work.  In the example, the switch attached to eth0 would use rules
+  (a), (b), (c), (h), and (i) on eth0.  The switch attached to eth1 would use
+  rules (f), (g), (h), and (i).
+
+The following are explicitly *not* supported by in-band control:
+
+- Specify Remote by Name.  Currently, the remote must be identified by IP
+  address.  A naive approach would be to permit all DNS traffic.
+  Unfortunately, this would prevent the controller from defining any policy
+  over DNS.  Since switches that are located behind us need to connect to the
+  remote, in-band cannot simply add a rule that allows DNS traffic from the
+  local port.  The "correct" way to support this is to parse DNS requests to
+  allow all traffic related to a request for the remote's name through.  Due to
+  the potential security problems and amount of processing, we decided to hold
+  off for the time-being.
+
+- Differing Remotes for Switches.  All switches must know the L3 addresses for
+  all the remotes that other switches may use, since rules need to be set up to
+  allow traffic related to those remotes through.  See rules (f), (g), (h), and
+  (i).
+
+- Differing Routes for Switches.  In order for the switch to allow other
+  switches to connect to a remote through a gateway, it allows the gateway's
+  traffic through with rules (d) and (e).  If the routes to the remote differ
+  for the two switches, we will not know the MAC address of the alternate
+  gateway.
+
+Action Reproduction
+-------------------
+
+It seems likely that many controllers, at least at startup, use the OpenFlow
+"flow statistics" request to obtain existing flows, then compare the flows'
+actions against the actions that they expect to find.  Before version 1.8.0,
+Open vSwitch always returned exact, byte-for-byte copies of the actions that
+had been added to the flow table.  The current version of Open vSwitch does not
+always do this in some exceptional cases.  This section lists the exceptions
+that controller authors must keep in mind if they compare actual actions
+against desired actions in a bytewise fashion:
+
+- Open vSwitch zeros padding bytes in action structures, regardless of their
+  values when the flows were added.
+
+- Open vSwitch "normalizes" the instructions in OpenFlow 1.1 (and later) in the
+  following way:
+
+  * OVS sorts the instructions into the following order: Apply-Actions,
+    Clear-Actions, Write-Actions, Write-Metadata, Goto-Table.
+
+  * OVS drops Apply-Actions instructions that have empty action lists.
+
+  * OVS drops Write-Actions instructions that have empty action sets.
+
+Please report other discrepancies, if you notice any, so that we can fix or
+document them.
+
+Suggestions
+-----------
+
+Suggestions to improve Open vSwitch are welcome at discuss@openvswitch.org.
diff --git a/Documentation/topics/dpdk.rst b/Documentation/topics/dpdk.rst

new file mode 100644 (file)

index 0000000..74e0266
--- /dev/null
+++ b/Documentation/topics/dpdk.rst
@@ -0,0 +1,28 @@
+..
+      Licensed under the Apache License, Version 2.0 (the "License"); you may
+      not use this file except in compliance with the License. You may obtain
+      a copy of the License at
+
+          http://www.apache.org/licenses/LICENSE-2.0
+
+      Unless required by applicable law or agreed to in writing, software
+      distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
+      WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
+      License for the specific language governing permissions and limitations
+      under the License.
+
+      Convention for heading levels in Open vSwitch documentation:
+
+      =======  Heading 0 (reserved for the title in a document)
+      -------  Heading 1
+      ~~~~~~~  Heading 2
+      +++++++  Heading 3
+      '''''''  Heading 4
+
+      Avoid deeper levels because they do not render well.
+
+================
+DPDK Integration
+================
+
+**TODO**
diff --git a/Documentation/topics/high-availability.rst b/Documentation/topics/high-availability.rst

new file mode 100644 (file)

index 0000000..5b21b64
--- /dev/null
+++ b/Documentation/topics/high-availability.rst
@@ -0,0 +1,426 @@
+..
+      Licensed under the Apache License, Version 2.0 (the "License"); you may
+      not use this file except in compliance with the License. You may obtain
+      a copy of the License at
+
+          http://www.apache.org/licenses/LICENSE-2.0
+
+      Unless required by applicable law or agreed to in writing, software
+      distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
+      WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
+      License for the specific language governing permissions and limitations
+      under the License.
+
+      Convention for heading levels in Open vSwitch documentation:
+
+      =======  Heading 0 (reserved for the title in a document)
+      -------  Heading 1
+      ~~~~~~~  Heading 2
+      +++++++  Heading 3
+      '''''''  Heading 4
+
+      Avoid deeper levels because they do not render well.
+
+==================================
+OVN Gateway High Availability Plan
+==================================
+
+::
+
+    OVN Gateway
+
+         +---------------------------+
+         |                           |
+         |     External Network      |
+         |                           |
+         +-------------^-------------+
+                       |
+                       |
+                 +-----------+
+                 |           |
+                 |  Gateway  |
+                 |           |
+                 +-----------+
+                       ^
+                       |
+                       |
+         +-------------v-------------+
+         |                           |
+         |    OVN Virtual Network    |
+         |                           |
+         +---------------------------+
+
+The OVN gateway is responsible for shuffling traffic between the tunneled
+overlay network (governed by ovn-northd), and the legacy physical network.  In
+a naive implementation, the gateway is a single x86 server, or hardware VTEP.
+For most deployments, a single system has enough forwarding capacity to service
+the entire virtualized network, however, it introduces a single point of
+failure.  If this system dies, the entire OVN deployment becomes unavailable.
+To mitigate this risk, an HA solution is critical -- by spreading
+responsibility across multiple systems, no single server failure can take down
+the network.
+
+An HA solution is both critical to the manageability of the system, and
+extremely difficult to get right.  The purpose of this document, is to propose
+a plan for OVN Gateway High Availability which takes into account our past
+experience building similar systems.  It should be considered a fluid changing
+proposal, not a set-in-stone decree.
+
+Basic Architecture
+------------------
+
+In an OVN deployment, the set of hypervisors and network elements operating
+under the guidance of ovn-northd are in what's called "logical space".  These
+servers use VXLAN, STT, or Geneve to communicate, oblivious to the details of
+the underlying physical network.  When these systems need to communicate with
+legacy networks, traffic must be routed through a Gateway which translates from
+OVN controlled tunnel traffic, to raw physical network traffic.
+
+Since the gateway is typically the only system with a connection to the
+physical network all traffic between logical space and the WAN must travel
+through it.  This makes it a critical single point of failure -- if the gateway
+dies, communication with the WAN ceases for all systems in logical space.
+
+To mitigate this risk, multiple gateways should be run in a "High Availability
+Cluster" or "HA Cluster".  The HA cluster will be responsible for performing
+the duties of a gateways,  while being able to recover gracefully from
+individual member failures.
+
+::
+
+    OVN Gateway HA Cluster
+
+             +---------------------------+
+             |                           |
+             |     External Network      |
+             |                           |
+             +-------------^-------------+
+                           |
+                           |
+    +----------------------v----------------------+
+    |                                             |
+    |          High Availability Cluster          |
+    |                                             |
+    | +-----------+  +-----------+  +-----------+ |
+    | |           |  |           |  |           | |
+    | |  Gateway  |  |  Gateway  |  |  Gateway  | |
+    | |           |  |           |  |           | |
+    | +-----------+  +-----------+  +-----------+ |
+    +----------------------^----------------------+
+                           |
+                           |
+             +-------------v-------------+
+             |                           |
+             |    OVN Virtual Network    |
+             |                           |
+             +---------------------------+
+
+L2 vs L3 High Availability
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+In order to achieve this goal, there are two broad approaches one can take.
+The HA cluster can appear to the network like a giant Layer 2 Ethernet Switch,
+or like a giant IP Router. These approaches are called L2HA, and L3HA
+respectively.  L2HA allows ethernet broadcast domains to extend into logical
+space, a significant advantage, but this comes at a cost.  The need to avoid
+transient L2 loops during failover significantly complicates their design.  On
+the other hand, L3HA works for most use cases, is simpler, and fails more
+gracefully.  For these reasons, it is suggested that OVN supports an L3HA
+model, leaving L2HA for future work (or third party VTEP providers).  Both
+models are discussed further below.
+
+L3HA
+----
+
+In this section, we'll work through a basic simple L3HA implementation, on top
+of which we'll gradually build more sophisticated features explaining their
+motivations and implementations as we go.
+
+Naive active-backup
+~~~~~~~~~~~~~~~~~~~
+
+Let's assume that there are a collection of logical routers which a tenant has
+asked for, our task is to schedule these logical routers on one of N gateways,
+and gracefully redistribute the routers on gateways which have failed.  The
+absolute simplest way to achieve this is what we'll call "naive-active-backup".
+
+::
+
+    Naive Active Backup HA Implementation
+
+    +----------------+   +----------------+
+    | Leader         |   | Backup         |
+    |                |   |                |
+    |      A B C     |   |                |
+    |                |   |                |
+    +----+-+-+-+----++   +-+--------------+
+         ^ ^ ^ ^    |      |
+         | | | |    |      |
+         | | | |  +-+------+---+
+         + + + +  | ovn-northd |
+         Traffic  +------------+
+
+In a naive active-backup, one of the Gateways is chosen (arbitrarily) as a
+leader.  All logical routers (A, B, C in the figure), are scheduled on this
+leader gateway and all traffic flows through it.  ovn-northd monitors this
+gateway via OpenFlow echo requests (or some equivalent), and if the gateway
+dies, it recreates the routers on one of the backups.
+
+This approach basically works in most cases and should likely be the starting
+point for OVN -- it's strictly better than no HA solution and is a good
+foundation for more sophisticated solutions.  That said, it's not without it's
+limitations. Specifically, this approach doesn't coordinate with the physical
+network to minimize disruption during failures, and it tightly couples failover
+to ovn-northd (we'll discuss why this is bad in a bit), and wastes resources by
+leaving backup gateways completely unutilized.
+
+Router Failover
++++++++++++++++
+
+When ovn-northd notices the leader has died and decides to migrate routers to a
+backup gateway, the physical network has to be notified to direct traffic to
+the new gateway.  Otherwise, traffic could be blackholed for longer than
+necessary making failovers worse than they need to be.
+
+For now, let's assume that OVN requires all gateways to be on the same IP
+subnet on the physical network.  If this isn't the case, gateways would need to
+participate in routing protocols to orchestrate failovers, something which is
+difficult and out of scope of this document.
+
+Since all gateways are on the same IP subnet, we simply need to worry about
+updating the MAC learning tables of the Ethernet switches on that subnet.
+Presumably, they all have entries for each logical router pointing to the old
+leader.  If these entries aren't updated, all traffic will be sent to the (now
+defunct) old leader, instead of the new one.
+
+In order to mitigate this issue, it's recommended that the new gateway sends a
+Reverse ARP (RARP) onto the physical network for each logical router it now
+controls.  A Reverse ARP is a benign protocol used by many hypervisors when
+virtual machines migrate to update L2 forwarding tables.  In this case, the
+ethernet source address of the RARP is that of the logical router it
+corresponds to, and its destination is the broadcast address.  This causes the
+RARP to travel to every L2 switch in the broadcast domain, updating forwarding
+tables accordingly.  This strategy is recommended in all failover mechanisms
+discussed in this document -- when a router newly boots on a new leader, it
+should RARP its MAC address.
+
+Controller Independent Active-backup
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+::
+
+    Controller Independent Active-Backup Implementation
+
+    +----------------+   +----------------+
+    | Leader         |   | Backup         |
+    |                |   |                |
+    |      A B C     |   |                |
+    |                |   |                |
+    +----------------+   +----------------+
+         ^ ^ ^ ^
+         | | | |
+         | | | |
+         + + + +
+         Traffic
+
+The fundamental problem with naive active-backup, is it tightly couples the
+failover solution to ovn-northd.  This can significantly increase downtime in
+the event of a failover as the (often already busy) ovn-northd controller has
+to recompute state for the new leader. Worse, if ovn-northd goes down, we can't
+perform gateway failover at all.  This violates the principle that control
+plane outages should have no impact on dataplane functionality.
+
+In a controller independent active-backup configuration, ovn-northd is
+responsible for initial configuration while the HA cluster is responsible for
+monitoring the leader, and failing over to a backup if necessary.  ovn-northd
+sets HA policy, but doesn't actively participate when failovers occur.
+
+Of course, in this model, ovn-northd is not without some responsibility.  Its
+role is to pre-plan what should happen in the event of a failure, leaving it to
+the individual switches to execute this plan.  It does this by assigning each
+gateway a unique leadership priority.  Once assigned, it communicates this
+priority to each node it controls.  Nodes use the leadership priority to
+determine which gateway in the cluster is the active leader by using a simple
+metric: the leader is the gateway that is healthy, with the highest priority.
+If that gateway goes down, leadership falls to the next highest priority, and
+conversely, if a new gateway comes up with a higher priority, it takes over
+leadership.
+
+Thus, in this model, leadership of the HA cluster is determined simply by the
+status of its members.  Therefore if we can communicate the status of each
+gateway to each transport node, they can individually figure out which is the
+leader, and direct traffic accordingly.
+
+Tunnel Monitoring
++++++++++++++++++
+
+Since in this model leadership is determined exclusively by the health status
+of member gateways, a key problem is how do we communicate this information to
+the relevant transport nodes.  Luckily, we can do this fairly cheaply using
+tunnel monitoring protocols like BFD.
+
+The basic idea is pretty straightforward.  Each transport node maintains a
+tunnel to every gateway in the HA cluster (not just the leader).  These tunnels
+are monitored using the BFD protocol to see which are alive.  Given this
+information, hypervisors can trivially compute the highest priority live
+gateway, and thus the leader.
+
+In practice, this leadership computation can be performed trivially using the
+bundle or group action.  Rather than using OpenFlow to simply output to the
+leader, all gateways could be listed in an active-backup bundle action ordered
+by their priority.  The bundle action will automatically take into account the
+tunnel monitoring status to output the packet to the highest priority live
+gateway.
+
+Inter-Gateway Monitoring
+++++++++++++++++++++++++
+
+One somewhat subtle aspect of this model, is that failovers are not globally
+atomic.  When a failover occurs, it will take some time for all hypervisors to
+notice and adjust accordingly.  Similarly, if a new high priority Gateway comes
+up, it may take some time for all hypervisors to switch over to the new leader.
+In order to avoid confusing the physical network, under these circumstances
+it's important for the backup gateways to drop traffic they've received
+erroneously.  In order to do this, each Gateway must know whether or not it is,
+in fact active.  This can be achieved by creating a mesh of tunnels between
+gateways.  Each gateway monitors the other gateways its cluster to determine
+which are alive, and therefore whether or not that gateway happens to be the
+leader.  If leading, the gateway forwards traffic normally, otherwise it drops
+all traffic.
+
+Gateway Leadership Resignation
+++++++++++++++++++++++++++++++
+
+Sometimes a gateway may be healthy, but still may not be suitable to lead the
+HA cluster.  This could happen for several reasons including:
+
+* The physical network is unreachable
+
+* BFD (or ping) has detected the next hop router is unreachable
+
+* The Gateway recently booted and isn't fully configured
+
+In this case, the Gateway should resign leadership by holding its tunnels down
+using the ``other_config:cpath_down`` flag.  This indicates to participating
+hypervisors and Gateways that this gateway should be treated as if it's down,
+even though its tunnels are still healthy.
+
+Router Specific Active-Backup
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+::
+
+    Router Specific Active-Backup
+
+    +----------------+ +----------------+
+    |                | |                |
+    |      A C       | |     B D E      |
+    |                | |                |
+    +----------------+ +----------------+
+                  ^ ^   ^ ^
+                  | |   | |
+                  | |   | |
+                  + +   + +
+                   Traffic
+
+Controller independent active-backup is a great advance over naive
+active-backup, but it still has one glaring problem -- it under-utilizes the
+backup gateways.  In ideal scenario, all traffic would split evenly among the
+live set of gateways.  Getting all the way there is somewhat tricky, but as a
+step in the direction, one could use the "Router Specific Active-Backup"
+algorithm.  This algorithm looks a lot like active-backup on a per logical
+router basis, with one twist.  It chooses a different active Gateway for each
+logical router.  Thus, in situations where there are several logical routers,
+all with somewhat balanced load, this algorithm performs better.
+
+Implementation of this strategy is quite straightforward if built on top of
+basic controller independent active-backup.  On a per logical router basis, the
+algorithm is the same, leadership is determined by the liveness of the
+gateways.  The key difference here is that the gateways must have a different
+leadership priority for each logical router.  These leadership priorities can
+be computed by ovn-northd just as they had been in the controller independent
+active-backup model.
+
+Once we have these per logical router priorities, they simply need be
+communicated to the members of the gateway cluster and the hypervisors.  The
+hypervisors in particular, need simply have an active-backup bundle action (or
+group action) per logical router listing the gateways in priority order for
+*that router*, rather than having a single bundle action shared for all the
+routers.
+
+Additionally, the gateways need to be updated to take into account individual
+router priorities.  Specifically, each gateway should drop traffic of backup
+routers it's running, and forward traffic of active gateways, instead of simply
+dropping or forwarding everything.  This should likely be done by having
+ovn-controller recompute OpenFlow for the gateway, though other options exist.
+
+The final complication is that ovn-northd's logic must be updated to choose
+these per logical router leadership priorities in a more sophisticated manner.
+It doesn't matter much exactly what algorithm it chooses to do this, beyond
+that it should provide good balancing in the common case.  I.E. each logical
+routers priorities should be different enough that routers balance to different
+gateways even when failures occur.
+
+Preemption
+++++++++++
+
+In an active-backup setup, one issue that users will run into is that of
+gateway leader preemption.  If a new Gateway is added to a cluster, or for some
+reason an existing gateway is rebooted, we could end up in a situation where
+the newly activated gateway has higher priority than any other in the HA
+cluster.  In this case, as soon as that gateway appears, it will preempt
+leadership from the currently active leader causing an unnecessary failover.
+Since failover can be quite expensive, this preemption may be undesirable.
+
+The controller can optionally avoid preemption by cleverly tweaking the
+leadership priorities.  For each router, new gateways should be assigned
+priorities that put them second in line or later when they eventually come up.
+Furthermore, if a gateway goes down for a significant period of time, its old
+leadership priorities should be revoked and new ones should be assigned as if
+it's a brand new gateway.  Note that this should only happen if a gateway has
+been down for a while (several minutes), otherwise a flapping gateway could
+have wide ranging, unpredictable, consequences.
+
+Note that preemption avoidance should be optional depending on the deployment.
+One necessarily sacrifices optimal load balancing to satisfy these requirements
+as new gateways will get no traffic on boot.  Thus, this feature represents a
+trade-off which must be made on a per installation basis.
+
+Fully Active-Active HA
+~~~~~~~~~~~~~~~~~~~~~~
+
+::
+
+    Fully Active-Active HA
+
+    +----------------+ +----------------+
+    |                | |                |
+    |   A B C D E    | |    A B C D E   |
+    |                | |                |
+    +----------------+ +----------------+
+                  ^ ^   ^ ^
+                  | |   | |
+                  | |   | |
+                  + +   + +
+                   Traffic
+
+The final step in L3HA is to have true active-active HA.  In this scenario each
+router has an instance on each Gateway, and a mechanism similar to ECMP is used
+to distribute traffic evenly among all instances.  This mechanism would require
+Gateways to participate in routing protocols with the physical network to
+attract traffic and alert of failures.  It is out of scope of this document,
+but may eventually be necessary.
+
+L2HA
+----
+
+L2HA is very difficult to get right.  Unlike L3HA, where the consequences of
+problems are minor, in L2HA if two gateways are both transiently active, an L2
+loop triggers and a broadcast storm results.  In practice to get around this,
+gateways end up implementing an overly conservative "when in doubt drop all
+traffic" policy, or they implement something like MLAG.
+
+MLAG has multiple gateways work together to pretend to be a single L2 switch
+with a large LACP bond.  In principle, it's the right solution to the problem
+as it solves the broadcast storm problem, and has been deployed successfully in
+other contexts.  That said, it's difficult to get right and not recommended.
diff --git a/Documentation/topics/index.rst b/Documentation/topics/index.rst

index 33f157718d8fd11e17c3fd84f876a6c106155950..30f74fe59b4b2d26c804c052f456fda5fa864653 100644 (file)
--- a/Documentation/topics/index.rst
+++ b/Documentation/topics/index.rst
@@ -32,3 +32,18 @@ that way.
  
  .. toctree::
     :maxdepth: 2
+
+   design
+   datapath
+   integration
+   porting
+   openflow
+   bonding
+   ovsdb-replication
+   dpdk
+   windows
+
+.. toctree::
+   :maxdepth: 2
+
+   high-availability
diff --git a/Documentation/topics/integration.rst b/Documentation/topics/integration.rst

new file mode 100644 (file)

index 0000000..e3c2092
--- /dev/null
+++ b/Documentation/topics/integration.rst
@@ -0,0 +1,257 @@
+..
+      Licensed under the Apache License, Version 2.0 (the "License"); you may
+      not use this file except in compliance with the License. You may obtain
+      a copy of the License at
+
+          http://www.apache.org/licenses/LICENSE-2.0
+
+      Unless required by applicable law or agreed to in writing, software
+      distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
+      WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
+      License for the specific language governing permissions and limitations
+      under the License.
+
+      Convention for heading levels in Open vSwitch documentation:
+
+      =======  Heading 0 (reserved for the title in a document)
+      -------  Heading 1
+      ~~~~~~~  Heading 2
+      +++++++  Heading 3
+      '''''''  Heading 4
+
+      Avoid deeper levels because they do not render well.
+
+=========================================
+Integration Guide for Centralized Control
+=========================================
+
+This document describes how to integrate Open vSwitch onto a new platform to
+expose the state of the switch and attached devices for centralized control.
+(If you are looking to port the switching components of Open vSwitch to a new
+platform, refer to :doc:`porting`)  The focus of this guide is on hypervisors,
+but many of the interfaces are useful for hardware switches, as well.  The
+XenServer integration is the most mature implementation, so most of the
+examples are drawn from it.
+
+The externally visible interface to this integration is platform-agnostic.  We
+encourage anyone who integrates Open vSwitch to use the same interface, because
+keeping a uniform interface means that controllers require less customization
+for individual platforms (and perhaps no customization at all).
+
+Integration centers around the Open vSwitch database and mostly involves the
+``external_ids`` columns in several of the tables.  These columns are not
+interpreted by Open vSwitch itself.  Instead, they provide information to a
+controller that permits it to associate a database record with a more
+meaningful entity.  In contrast, the ``other_config`` column is used to
+configure behavior of the switch.  The main job of the integrator, then, is to
+ensure that these values are correctly populated and maintained.
+
+An integrator sets the columns in the database by talking to the ovsdb-server
+daemon.  A few of the columns can be set during startup by calling the ovs-ctl
+tool from inside the startup scripts.  The ``xenserver/etc_init.d_openvswitch``
+script provides examples of its use, and the ovs-ctl(8) manpage contains
+complete documentation.  At runtime, ovs-vsctl can be be used to set columns in
+the database.  The script ``xenserver/etc_xensource_scripts_vif`` contains
+examples of its use, and ovs-vsctl(8) manpage contains complete documentation.
+
+Python and C bindings to the database are provided if deeper integration with a
+program are needed.  The XenServer ovs-xapi-sync daemon
+(``xenserver/usr_share_openvswitch_scripts_ovs-xapi-sync``) provides an example
+of using the Python bindings.  More information on the python bindings is
+available at ``python/ovs/db/idl.py``.  Information on the C bindings is
+available at ``lib/ovsdb-idl.h``.
+
+The following diagram shows how integration scripts fit into the Open vSwitch
+architecture:
+
+::
+
+    Diagram
+
+             +----------------------------------------+
+             |           Controller Cluster           +
+             +----------------------------------------+
+                                 |
+                                 |
+    +----------------------------------------------------------+
+    |                            |                             |
+    |             +--------------+---------------+             |
+    |             |                              |             |
+    |   +-------------------+           +------------------+   |
+    |   |   ovsdb-server    |-----------|   ovs-vswitchd   |   |
+    |   +-------------------+           +------------------+   |
+    |             |                              |             |
+    |  +---------------------+                   |             |
+    |  | Integration scripts |                   |             |
+    |  | (ex: ovs-xapi-sync) |                   |             |
+    |  +---------------------+                   |             |
+    |                                            |   Userspace |
+    |----------------------------------------------------------|
+    |                                            |      Kernel |
+    |                                            |             |
+    |                                 +---------------------+  |
+    |                                 |  OVS Kernel Module  |  |
+    |                                 +---------------------+  |
+    +----------------------------------------------------------+
+
+A description of the most relevant fields for integration follows.  By setting
+these values, controllers are able to understand the network and manage it more
+dynamically and precisely.  For more details about the database and each
+individual column, please refer to the ovs-vswitchd.conf.db(5) manpage.
+
+``Open_vSwitch`` table
+----------------------
+
+The ``Open_vSwitch`` table describes the switch as a whole.  The
+``system_type`` and ``system_version`` columns identify the platform to the
+controller.  The ``external_ids:system-id`` key uniquely identifies the
+physical host.  In XenServer, the system-id will likely be the same as the UUID
+returned by ``xe host-list``. This key allows controllers to distinguish
+between multiple hypervisors.
+
+Most of this configuration can be done with the ovs-ctl command at startup.
+For example:
+
+::
+
+    $ ovs-ctl --system-type="XenServer" --system-version="6.0.0-50762p" \
+        --system-id="${UUID}" "${other_options}" start
+
+Alternatively, the ovs-vsctl command may be used to set a particular value at
+runtime.  For example:
+
+::
+
+    $ ovs-vsctl set open_vswitch . external-ids:system-id='"${UUID}"'
+
+The ``other_config:enable-statistics`` key may be set to ``true`` to have OVS
+populate the database with statistics (e.g., number of CPUs, memory, system
+load) for the controller's use.
+
+Bridge table
+------------
+
+The Bridge table describes individual bridges within an Open vSwitch instance.
+The ``external-ids:bridge-id`` key uniquely identifies a particular bridge.  In
+XenServer, this will likely be the same as the UUID returned by ``xe
+network-list`` for that particular bridge.
+
+For example, to set the identifier for bridge "br0", the following command can
+be used:
+
+::
+
+    $ ovs-vsctl set Bridge br0 external-ids:bridge-id='"${UUID}"'
+
+The MAC address of the bridge may be manually configured by setting it with the
+``other_config:hwaddr`` key.  For example:
+
+::
+
+    $ ovs-vsctl set Bridge br0 other_config:hwaddr="12:34:56:78:90:ab"
+
+Interface table
+---------------
+
+The Interface table describes an interface under the control of Open vSwitch.
+The ``external_ids`` column contains keys that are used to provide additional
+information about the interface:
+
+attached-mac
+
+  This field contains the MAC address of the device attached to the interface.
+  On a hypervisor, this is the MAC address of the interface as seen inside a
+  VM.  It does not necessarily correlate to the host-side MAC address.  For
+  example, on XenServer, the MAC address on a VIF in the hypervisor is always
+  FE:FF:FF:FF:FF:FF, but inside the VM a normal MAC address is seen.
+
+iface-id
+
+  This field uniquely identifies the interface.  In hypervisors, this allows
+  the controller to follow VM network interfaces as VMs migrate.  A well-chosen
+  identifier should also allow an administrator or a controller to associate
+  the interface with the corresponding object in the VM management system.  For
+  example, the Open vSwitch integration with XenServer by default uses the
+  XenServer assigned UUID for a VIF record as the iface-id.
+
+iface-status
+
+  In a hypervisor, there are situations where there are multiple interface
+  choices for a single virtual ethernet interface inside a VM.  Valid values
+  are "active" and "inactive".  A complete description is available in the
+  ovs-vswitchd.conf.db(5) manpage.
+
+vm-id
+
+  This field uniquely identifies the VM to which this interface belongs.  A
+  single VM may have multiple interfaces attached to it.
+
+As in the previous tables, the ovs-vsctl command may be used to configure the
+values.  For example, to set the ``iface-id`` on eth0, the following command
+can be used:
+
+::
+
+    $ ovs-vsctl set Interface eth0 external-ids:iface-id='"${UUID}"'
+
+
+HA for OVN DB servers using pacemaker
+-------------------------------------
+
+The ovsdb servers can work in either active or backup mode. In backup mode, db
+server will be connected to an active server and replicate the active servers
+contents. At all times, the data can be transacted only from the active server.
+When the active server dies for some reason, entire OVN operations will be
+stalled.
+
+`Pacemaker <http://clusterlabs.org/pacemaker.html>`__ is a cluster resource
+manager which can manage a defined set of resource across a set of clustered
+nodes. Pacemaker manages the resource with the help of the resource agents.
+One among the resource agent is `OCF
+<http://www.linux-ha.org/wiki/OCF_Resource_Agents>`__
+
+OCF is nothing but a shell script which accepts a set of actions and returns an
+appropriate status code.
+
+With the help of the OCF resource agent ovn/utilities/ovndb-servers.ocf, one
+can defined a resource for the pacemaker such that pacemaker will always
+maintain one running active server at any time.
+
+After creating a pacemaker cluster, use the following commands to create one
+active and multiple backup servers for OVN databases::
+
+    $ pcs resource create ovndb_servers ocf:ovn:ovndb-servers \
+         master_ip=x.x.x.x \
+         ovn_ctl=<path of the ovn-ctl script> \
+         op monitor interval="10s" \
+         op monitor role=Master interval="15s"
+    $ pcs resource master ovndb_servers-master ovndb_servers \
+        meta notify="true"
+
+The `master_ip` and `ovn_ctl` are the parameters that will be used by the OCF
+script. `ovn_ctl` is optional, if not given, it assumes a default value of
+/usr/share/openvswitch/scripts/ovn-ctl. `master_ip` is the IP address on which
+the active database server is expected to be listening.
+
+Whenever the active server dies, pacemaker is responsible to promote one of the
+backup servers to be active. Both ovn-controller and ovn-northd needs the
+ip-address at which the active server is listening. With pacemaker changing the
+node at which the active server is run, it is not efficient to instruct all the
+ovn-controllers and the ovn-northd to listen to the latest active server's
+ip-address.
+
+This problem can be solved by using a native ocf resource agent
+``ocf:heartbeat:IPaddr2``. The IPAddr2 resource agent is just a resource with
+an ip-address. When we colocate this resource with the active server, pacemaker
+will enable the active server to be connected with a single ip-address all the
+time. This is the ip-address that needs to be given as the parameter while
+creating the `ovndb_servers` resource.
+
+Use the following command to create the IPAddr2 resource and colocate it
+with the active server::
+
+    $ pcs resource create VirtualIP ocf:heartbeat:IPaddr2 ip=x.x.x.x \
+        op monitor interval=30s
+    $ pcs constraint order VirtualIP then ovndb_servers-master
+    $ pcs constraint colocation add master ovndb_servers-master with VirtualIP \
+        score=INFINITY
diff --git a/Documentation/topics/openflow.rst b/Documentation/topics/openflow.rst

new file mode 100644 (file)

index 0000000..a2c22a8
--- /dev/null
+++ b/Documentation/topics/openflow.rst
@@ -0,0 +1,419 @@
+..
+      Licensed under the Apache License, Version 2.0 (the "License"); you may
+      not use this file except in compliance with the License. You may obtain
+      a copy of the License at
+
+          http://www.apache.org/licenses/LICENSE-2.0
+
+      Unless required by applicable law or agreed to in writing, software
+      distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
+      WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
+      License for the specific language governing permissions and limitations
+      under the License.
+
+      Convention for heading levels in Open vSwitch documentation:
+
+      =======  Heading 0 (reserved for the title in a document)
+      -------  Heading 1
+      ~~~~~~~  Heading 2
+      +++++++  Heading 3
+      '''''''  Heading 4
+
+      Avoid deeper levels because they do not render well.
+
+================================
+OpenFlow Support in Open vSwitch
+================================
+
+Open vSwitch support for OpenFlow 1.1 and beyond is a work in progress.  This
+file describes the work still to be done.
+
+The Plan
+--------
+
+OpenFlow version support is not a build-time option.  A single build of Open
+vSwitch must be able to handle all supported versions of OpenFlow.  Ideally,
+even at runtime it should be able to support all protocol versions at the same
+time on different OpenFlow bridges (and perhaps even on the same bridge).
+
+At the same time, it would be a shame to litter the core of the OVS code with
+lots of ugly code concerned with the details of various OpenFlow protocol
+versions.
+
+The primary approach to compatibility is to abstract most of the details of the
+differences from the core code, by adding a protocol layer that translates
+between OF1.x and a slightly higher-level abstract representation.  The core of
+this approach is the many ``struct ofputil_*`` structures in
+``include/openvswitch/ofp-util.h``.
+
+As a consequence of this approach, OVS cannot use OpenFlow protocol definitions
+that closely resemble those in the OpenFlow specification, because
+``openflow.h`` in different versions of the OpenFlow specification defines the
+same identifier with different values.  Instead, ``openflow-common.h`` contains
+definitions that are common to all the specifications and separate protocol
+version-specific headers contain protocol-specific definitions renamed so as
+not to conflict, e.g. ``OFPAT10_ENQUEUE`` and ``OFPAT11_ENQUEUE`` for the
+OpenFlow 1.0 and 1.1 values for ``OFPAT_ENQUEUE``.  Generally, in cases of
+conflict, the protocol layer will define a more abstract ``OFPUTIL_*`` or
+struct ``ofputil_*``.
+
+Here are the current approaches in a few tricky areas:
+
+* Port numbering.
+
+  OpenFlow 1.0 has 16-bit port numbers and later OpenFlow versions have 32-bit
+  port numbers.  For now, OVS support for later protocol versions requires all
+  port numbers to fall into the 16-bit range, translating the reserved
+  ``OFPP_*`` port numbers.
+
+* Actions.
+
+  OpenFlow 1.0 and later versions have very different ideas of actions.  OVS
+  reconciles by translating all the versions' actions (and instructions) to and
+  from a common internal representation.
+
+OpenFlow 1.1
+------------
+
+The list of remaining work items for OpenFlow 1.1 is below.  It is probably
+incomplete.
+
+* Match and set double-tagged VLANs (QinQ).
+
+  This requires kernel work for reasonable performance.
+
+  (optional for OF1.1+)
+
+* VLANs tagged with 88a8 Ethertype.
+
+  This requires kernel work for reasonable performance.
+
+  (required for OF1.1+)
+
+OpenFlow 1.2
+------------
+
+OpenFlow 1.2 support requires OpenFlow 1.1 as a prerequisite. All the
+additional work specific to Openflow 1.2 are complete.  (This is based on the
+change log at the end of the OF1.2 spec.  I didn't compare the specs carefully
+yet.)
+
+OpenFlow 1.3
+------------
+
+OpenFlow 1.3 support requires OpenFlow 1.2 as a prerequisite, plus the
+following additional work.  (This is based on the change log at the end of the
+OF1.3 spec, reusing most of the section titles directly.  I didn't compare the
+specs carefully yet.)
+
+* Add support for multipart requests.
+
+  Currently we always report ``OFPBRC_MULTIPART_BUFFER_OVERFLOW``.
+
+  (optional for OF1.3+)
+
+* IPv6 extension header handling support.
+
+  Fully implementing this requires kernel support.  This likely will take some
+  careful and probably time-consuming design work.  The actual coding, once
+  that is all done, is probably 2 or 3 days work.
+
+  (optional for OF1.3+)
+
+* Per-flow meters.
+
+  OpenFlow protocol support is now implemented.  Support for the special
+  ``OFPM_SLOWPATH`` and ``OFPM_CONTROLLER`` meters is missing.  Support for
+  the software switch is under review.
+
+  (optional for OF1.3+)
+
+* Auxiliary connections.
+
+  An implementation in generic code might be a week's worth of work.  The value
+  of an implementation in generic code is questionable, though, since much of
+  the benefit of axuiliary connections is supposed to be to take advantage of
+  hardware support.  (We could make the kernel module somehow send packets
+  across the auxiliary connections directly, for some kind of "hardware"
+  support, if we judged it useful enough.)
+
+  (optional for OF1.3+)
+
+* Provider Backbone Bridge tagging.
+
+  I don't plan to implement this (but we'd accept an implementation).
+
+  (optional for OF1.3+)
+
+* On-demand flow counters.
+
+  I think this might be a real optimization in some cases for the software
+  switch.
+
+  (optional for OF1.3+)
+
+OpenFlow 1.4 & ONF Extensions for 1.3.X Pack1
+---------------------------------------------
+
+The following features are both defined as a set of ONF Extensions for 1.3 and
+integrated in 1.4.
+
+When defined as an ONF Extension for 1.3, the feature is using the Experimenter
+mechanism with the ONF Experimenter ID.
+
+When defined integrated in 1.4, the feature use the standard OpenFlow
+structures (for example defined in openflow-1.4.h).
+
+The two definitions for each feature are independant and can exist in parallel
+in OVS.
+
+
+* Flow entry notifications
+
+  This seems to be modelled after OVS's NXST_FLOW_MONITOR.  (Simon Horman is
+  working on this.)
+
+  (EXT-187)
+  (optional for OF1.4+)
+
+* Role Status
+
+  Already implemented as a 1.4 feature.
+
+  (EXT-191)
+
+  (required for OF1.4+)
+
+* Flow entry eviction
+
+  OVS has flow eviction functionality.  ``table_mod OFPTC_EVICTION``,
+  ``flow_mod 'importance'``, and ``table_desc ofp_table_mod_prop_eviction``
+  need to be implemented.
+
+  (EXT-192-e)
+
+  (optional for OF1.4+)
+
+* Vacancy events
+
+  (EXT-192-v)
+
+  (optional for OF1.4+)
+
+* Bundle
+
+  Transactional modification.  OpenFlow 1.4 requires to support
+  ``flow_mods`` and ``port_mods`` in a bundle if bundle is supported.
+  (Not related to OVS's 'ofbundle' stuff.)
+
+  Implemented as an OpenFlow 1.4 feature.  Only flow_mods and port_mods are
+  supported in a bundle.  If the bundle includes port mods, it may not specify
+  the ``OFPBF_ATOMIC`` flag.  Nevertheless, port mods and flow mods in a bundle
+  are always applied in order and consecutive flow mods between port mods are
+  made available to lookups atomically.
+
+  (EXT-230)
+
+  (optional for OF1.4+)
+
+* Table synchronisation
+
+  Probably not so useful to the software switch.
+
+  (EXT-232)
+
+  (optional for OF1.4+)
+
+* Group and Meter change notifications
+
+  (EXT-235)
+
+  (optional for OF1.4+)
+
+* Bad flow entry priority error
+
+  Probably not so useful to the software switch.
+
+  (EXT-236)
+
+  (optional for OF1.4+)
+
+* Set async config error
+
+  (EXT-237)
+
+  (optional for OF1.4+)
+
+* PBB UCA header field
+
+  See comment on Provider Backbone Bridge in section about OpenFlow 1.3.
+
+  (EXT-256)
+
+  (optional for OF1.4+)
+
+* Multipart timeout error
+
+  (EXT-264)
+
+  (required for OF1.4+)
+
+OpenFlow 1.4 only
+-----------------
+
+Those features are those only available in OpenFlow 1.4, other OpenFlow 1.4
+features are listed in the previous section.
+
+* More extensible wire protocol
+
+  Many on-wire structures got TLVs.
+
+  All required features are now supported.
+  Remaining optional: table desc, table-status
+
+  (EXT-262)
+
+  (required for OF1.4+)
+
+* More descriptive reasons for packet-in
+
+  Distinguish ``OFPR_APPLY_ACTION``, ``OFPR_ACTION_SET``, ``OFPR_GROUP``,
+  ``OFPR_PACKET_OUT``.  ``NO_MATCH`` was renamed to ``OFPR_TABLE_MISS``.
+  (OFPR_ACTION_SET and OFPR_GROUP are now supported)
+
+  (EXT-136)
+
+  (required for OF1.4+)
+
+* Optical port properties
+
+  (EXT-154)
+
+  (optional for OF1.4+)
+
+OpenFlow 1.5 & ONF Extensions for 1.3.X Pack2
+---------------------------------------------
+
+The following features are both defined as a set of ONF Extensions for 1.3 and
+integrated in 1.5. Note that this list is not definitive as those are not yet
+published.
+
+When defined as an ONF Extension for 1.3, the feature is using the Experimenter
+mechanism with the ONF Experimenter ID.  When defined integrated in 1.5, the
+feature use the standard OpenFlow structures (for example defined in
+openflow-1.5.h).
+
+The two definitions for each feature are independant and can exist in parallel
+in OVS.
+
+* Time scheduled bundles
+
+  (EXT-340)
+
+  (optional for OF1.5+)
+
+OpenFlow 1.5 only
+-----------------
+
+Those features are those only available in OpenFlow 1.5, other OpenFlow 1.5
+features are listed in the previous section.  Note that this list is not
+definitive as OpenFlow 1.5 is not yet published.
+
+* Egress Tables
+
+  (EXT-306)
+
+  (optional for OF1.5+)
+
+* Packet Type aware pipeline
+
+  Prototype for OVS was done during specification.
+
+  (EXT-112)
+
+  (optional for OF1.5+)
+
+* Extensible Flow Entry Statistics
+
+  (EXT-334)
+
+  (required for OF1.5+)
+
+* Flow Entry Statistics Trigger
+
+  (EXT-335)
+
+  (optional for OF1.5+)
+
+* Controller connection status
+
+  Prototype for OVS was done during specification.
+
+  (EXT-454)
+
+  (optional for OF1.5+)
+
+* Meter action
+
+  (EXT-379)
+
+  (required for OF1.5+ if metering is supported)
+
+* Enable setting all pipeline fields in packet-out
+
+  Prototype for OVS was done during specification.
+
+  (EXT-427)
+
+  (required for OF1.5+)
+
+* Port properties for pipeline fields
+
+  Prototype for OVS was done during specification.
+
+  (EXT-388)
+
+  (optional for OF1.5+)
+
+* Port property for recirculation
+
+  Prototype for OVS was done during specification.
+
+  (EXT-399)
+
+  (optional for OF1.5+)
+
+General
+-------
+
+* ovs-ofctl(8) often lists as Nicira extensions features that later OpenFlow
+  versions support in standard ways.
+
+How to contribute
+-----------------
+
+If you plan to contribute code for a feature, please let everyone know on
+ovs-dev before you start work.  This will help avoid duplicating work.
+
+Consider the following:
+
+* Testing.
+
+  Please test your code.
+
+* Unit tests.
+
+  Consider writing some.  The tests directory has many examples that you can
+  use as a starting point.
+
+* ovs-ofctl.
+
+  If you add a feature that is useful for some ovs-ofctl command then you
+  should add support for it there.
+
+* Documentation.
+
+  If you add a user-visible feature, then you should document it in the
+  appropriate manpage and mention it in NEWS as well.
+
+Refer to :doc:`/internals/contributing/index` for more information.
diff --git a/Documentation/topics/ovsdb-replication.rst b/Documentation/topics/ovsdb-replication.rst

new file mode 100644 (file)

index 0000000..fbf5a9b
--- /dev/null
+++ b/Documentation/topics/ovsdb-replication.rst
@@ -0,0 +1,172 @@
+..
+      Licensed under the Apache License, Version 2.0 (the "License"); you may
+      not use this file except in compliance with the License. You may obtain
+      a copy of the License at
+
+          http://www.apache.org/licenses/LICENSE-2.0
+
+      Unless required by applicable law or agreed to in writing, software
+      distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
+      WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
+      License for the specific language governing permissions and limitations
+      under the License.
+
+      Convention for heading levels in Open vSwitch documentation:
+
+      =======  Heading 0 (reserved for the title in a document)
+      -------  Heading 1
+      ~~~~~~~  Heading 2
+      +++++++  Heading 3
+      '''''''  Heading 4
+
+      Avoid deeper levels because they do not render well.
+
+================================
+OVSDB Replication Implementation
+================================
+
+Given two Open vSwitch databases with the same schema, OVSDB replication keeps
+these databases in the same state, i.e. each of the databases have the same
+contents at any given time even if they are not running in the same host.  This
+document elaborates on the implementation details to provide this
+functionality.
+
+Terminology
+-----------
+
+Source of truth database
+  database whose content will be replicated to another database.
+
+Active server
+  ovsdb-server providing RPC interface to the source of truth database.
+
+Standby server
+  ovsdb-server providing RPC interface to the database that is not the source
+  of truth.
+
+Design
+------
+
+The overall design of replication consists of one ovsdb-server (active server)
+communicating the state of its databases to another ovsdb-server (standby
+server) so that the latter keep its own databases in that same state.  To
+achieve this, the standby server acts as a client of the active server, in the
+sense that it sends a monitor request to keep up to date with the changes in
+the active server databases. When a notification from the active server
+arrives, the standby server executes the necessary set of operations so its
+databases reach the same state as the the active server databases. Below is the
+design represented as a diagram.::
+
+    +--------------+    replication     +--------------+
+    |    Active    |<-------------------|   Standby    |
+    | OVSDB-server |                    | OVSDB-server |
+    +--------------+                    +--------------+
+            |                                  |
+            |                                  |
+        +-------+                          +-------+
+        |  SoT  |                          |       |
+        | OVSDB |                          | OVSDB |
+        +-------+                          +-------+
+
+Setting Up The Replication
+--------------------------
+
+To initiate the replication process, the standby server must be executed
+indicating the location of the active server via the command line option
+``--sync-from=server``, where server can take any form described in the
+ovsdb-client manpage and it must specify an active connection type (tcp, unix,
+ssl). This option will cause the standby server to attempt to send a monitor
+request to the active server in every main loop iteration, until the active
+server responds.
+
+When sending a monitor request the standby server is doing the following:
+
+1. Erase the content of the databases for which it is providing a RPC
+   interface.
+
+2. Open the jsonrpc channel to communicate with the active server.
+
+3. Fetch all the databases located in the active server.
+
+4. For each database with the same schema in both the active and standby
+   servers: construct and send a monitor request message specifying the tables
+   that will be monitored (i.e all the tables on the database except the ones
+   blacklisted [*]).
+
+5. Set the standby database to the current state of the active database.
+
+Once the monitor request message is sent, the standby server will continuously
+receive notifications of changes occurring to the tables specified in the
+request. The process of handling this notifications is detailed in the next
+section.
+
+[*] A set of tables that will be excluded from replication can be configure as
+a blacklist of tables via the command line option
+``--sync-exclude-tables=db:table[,db:table]...``, where db corresponds to the
+database where the table resides.
+
+Replication Process
+-------------------
+
+The replication process consists on handling the update notifications received
+in the standby server caused by the monitor request that was previously sent to
+the active server. In every loop iteration, the standby server attempts to
+receive a message from the active server which can be an error, an echo message
+(used to keep the connection alive) or an update notification. In case the
+message is a fatal error, the standby server will disconnect from the active
+without dropping the replicated data. If it is an echo message, the standby
+server will reply with an echo message as well. If the message is an update
+notification, the following process occurs:
+
+1. Create a new transaction.
+
+2. Get the ``<table-updates>`` object from the ``params`` member of the
+   notification.
+
+3. For each ``<table-update>`` in the ``<table-updates>`` object do:
+
+    1. For each ``<row-update>`` in ``<table-update>`` check what kind of
+       operation should be executed according to the following criteria
+       about the presence of the object members:
+
+       - If ``old`` member is not present, execute an insert operation using
+         ``<row>`` from the ``new`` member.
+
+       - If ``old`` member is present and ``new`` member is not present,
+         execute a delete operation using ``<row>`` from the ``old`` member
+
+       - If both ``old`` and ``new`` members are present, execute an update
+         operation using ``<row>`` from the ``new`` member.
+
+4. Commit the transaction.
+
+   If an error occurs during the replication process, all replication is
+   restarted by resending a new monitor request as described in the section
+   "Setting up the replication".
+
+Runtime Management Commands
+---------------------------
+
+Runtime management commands can be sent to a running standby server via
+ovs-appctl in order to configure the replication functionality. The available
+commands are the following.
+
+``ovsdb-server/set-remote-ovsdb-server {server}``
+  sets the name of the active server
+
+``ovsdb-server/get-remote-ovsdb-server``
+  gets the name of the active server
+
+``ovsdb-server/connect-remote-ovsdb-server``
+  causes the server to attempt to send a monitor request every main loop
+  iteration
+
+``ovsdb-server/disconnect-remote-ovsdb-server``
+  closes the jsonrpc channel between the active server and frees the memory
+  used for the replication configuration.
+
+``ovsdb-server/set-sync-exclude-tables {db:table,...}``
+  sets the tables list that will be excluded from being replicated
+
+``ovsdb-server/get-sync-excluded-tables``
+  gets the tables list that is currently excluded from replication
diff --git a/Documentation/topics/porting.rst b/Documentation/topics/porting.rst

new file mode 100644 (file)

index 0000000..b327b2b
--- /dev/null
+++ b/Documentation/topics/porting.rst
@@ -0,0 +1,329 @@
+..
+      Licensed under the Apache License, Version 2.0 (the "License"); you may
+      not use this file except in compliance with the License. You may obtain
+      a copy of the License at
+
+          http://www.apache.org/licenses/LICENSE-2.0
+
+      Unless required by applicable law or agreed to in writing, software
+      distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
+      WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
+      License for the specific language governing permissions and limitations
+      under the License.
+
+      Convention for heading levels in Open vSwitch documentation:
+
+      =======  Heading 0 (reserved for the title in a document)
+      -------  Heading 1
+      ~~~~~~~  Heading 2
+      +++++++  Heading 3
+      '''''''  Heading 4
+
+      Avoid deeper levels because they do not render well.
+
+================================================
+Porting Open vSwitch to New Software or Hardware
+================================================
+
+Open vSwitch (OVS) is intended to be easily ported to new software and hardware
+platforms.  This document describes the types of changes that are most likely
+to be necessary in porting OVS to Unix-like platforms.  (Porting OVS to other
+kinds of platforms is likely to be more difficult.)
+
+Vocabulary
+----------
+
+For historical reasons, different words are used for essentially the same
+concept in different areas of the Open vSwitch source tree.  Here is a
+concordance, indexed by the area of the source tree:
+
+::
+
+    datapath/       vport           ---
+    vswitchd/       iface           port
+    ofproto/        port            bundle
+    ofproto/bond.c  slave           bond
+    lib/lacp.c      slave           lacp
+    lib/netdev.c    netdev          ---
+    database        Interface       Port
+
+Open vSwitch Architectural Overview
+-----------------------------------
+
+The following diagram shows the very high-level architecture of Open vSwitch
+from a porter's perspective.
+
+::
+
+    +-------------------+
+    |    ovs-vswitchd   |<-->ovsdb-server
+    +-------------------+
+    |      ofproto      |<-->OpenFlow controllers
+    +--------+-+--------+
+    | netdev | | ofproto|
+    +--------+ |provider|
+    | netdev | +--------+
+    |provider|
+    +--------+
+
+Some of the components are generic.  Modulo bugs or inadequacies, these
+components should not need to be modified as part of a port:
+
+ovs-vswitchd
+  The main Open vSwitch userspace program, in vswitchd/.  It reads the desired
+  Open vSwitch configuration from the ovsdb-server program over an IPC channel
+  and passes this configuration down to the "ofproto" library.  It also passes
+  certain status and statistical information from ofproto back into the
+  database.
+
+ofproto
+  The Open vSwitch library, in ofproto/, that implements an OpenFlow switch.
+  It talks to OpenFlow controllers over the network and to switch hardware or
+  software through an "ofproto provider", explained further below.
+
+netdev
+  The Open vSwitch library, in lib/netdev.c, that abstracts interacting with
+  network devices, that is, Ethernet interfaces.  The netdev library is a thin
+  layer over "netdev provider" code, explained further below.
+
+The other components may need attention during a port.  You will almost
+certainly have to implement a "netdev provider".  Depending on the type of port
+you are doing and the desired performance, you may also have to implement an
+"ofproto provider" or a lower-level component called a "dpif" provider.
+
+The following sections talk about these components in more detail.
+
+Writing a netdev Provider
+-------------------------
+
+A "netdev provider" implements an operating system and hardware specific
+interface to "network devices", e.g. eth0 on Linux.  Open vSwitch must be able
+to open each port on a switch as a netdev, so you will need to implement a
+"netdev provider" that works with your switch hardware and software.
+
+``struct netdev_class``, in ``lib/netdev-provider.h``, defines the interfaces
+required to implement a netdev.  That structure contains many function
+pointers, each of which has a comment that is meant to describe its behavior in
+detail.  If the requirements are unclear, report this as a bug.
+
+The netdev interface can be divided into a few rough categories:
+
+- Functions required to properly implement OpenFlow features.  For example,
+  OpenFlow requires the ability to report the Ethernet hardware address of a
+  port.  These functions must be implemented for minimally correct operation.
+
+- Functions required to implement optional Open vSwitch features.  For example,
+  the Open vSwitch support for in-band control requires netdev support for
+  inspecting the TCP/IP stack's ARP table.  These functions must be implemented
+  if the corresponding OVS features are to work, but may be omitted initially.
+
+- Functions needed in some implementations but not in others.  For example,
+  most kinds of ports (see below) do not need functionality to receive packets
+  from a network device.
+
+The existing netdev implementations may serve as useful examples during a port:
+
+- lib/netdev-linux.c implements netdev functionality for Linux network devices,
+  using Linux kernel calls.  It may be a good place to start for full-featured
+  netdev implementations.
+
+- lib/netdev-vport.c provides support for "virtual ports" implemented by the
+  Open vSwitch datapath module for the Linux kernel.  This may serve as a model
+  for minimal netdev implementations.
+
+- lib/netdev-dummy.c is a fake netdev implementation useful only for testing.
+
+.. _porting strategies:
+
+Porting Strategies
+------------------
+
+After a netdev provider has been implemented for a system's network devices,
+you may choose among three basic porting strategies.
+
+The lowest-effort strategy is to use the "userspace switch" implementation
+built into Open vSwitch.  This ought to work, without writing any more code, as
+long as the netdev provider that you implemented supports receiving packets.
+It yields poor performance, however, because every packet passes through the
+ovs-vswitchd process. Refer to :doc:`/intro/install/userspace` for instructions
+on how to configure a userspace switch.
+
+If the userspace switch is not the right choice for your port, then you will
+have to write more code.  You may implement either an "ofproto provider" or a
+"dpif provider".  Which you should choose depends on a few different factors:
+
+* Only an ofproto provider can take full advantage of hardware with built-in
+  support for wildcards (e.g. an ACL table or a TCAM).
+
+* A dpif provider can take advantage of the Open vSwitch built-in
+  implementations of bonding, LACP, 802.1ag, 802.1Q VLANs, and other features.
+  An ofproto provider has to provide its own implementations, if the hardware
+  can support them at all.
+
+* A dpif provider is usually easier to implement, but most appropriate for
+  software switching.  It "explodes" wildcard rules into exact-match entries
+  (with an optional wildcard mask).  This allows fast hash lookups in software,
+  but makes inefficient use of TCAMs in hardware that support wildcarding.
+
+The following sections describe how to implement each kind of port.
+
+ofproto Providers
+-----------------
+
+An "ofproto provider" is what ofproto uses to directly monitor and control an
+OpenFlow-capable switch.  ``struct ofproto_class``, in
+``ofproto/ofproto-provider.h``, defines the interfaces to implement an ofproto
+provider for new hardware or software.  That structure contains many function
+pointers, each of which has a comment that is meant to describe its behavior in
+detail.  If the requirements are unclear, report this as a bug.
+
+The ofproto provider interface is preliminary.  Let us know if it seems
+unsuitable for your purpose.  We will try to improve it.
+
+Writing a dpif Provider
+-----------------------
+
+Open vSwitch has a built-in ofproto provider named "ofproto-dpif", which is
+built on top of a library for manipulating datapaths, called "dpif".  A
+"datapath" is a simple flow table, one that is only required to support
+exact-match flows, that is, flows without wildcards.  When a packet arrives on
+a network device, the datapath looks for it in this table.  If there is a
+match, then it performs the associated actions.  If there is no match, the
+datapath passes the packet up to ofproto-dpif, which maintains the full
+OpenFlow flow table.  If the packet matches in this flow table, then
+ofproto-dpif executes its actions and inserts a new entry into the dpif flow
+table.  (Otherwise, ofproto-dpif passes the packet up to ofproto to send the
+packet to the OpenFlow controller, if one is configured.)
+
+When calculating the dpif flow, ofproto-dpif generates an exact-match flow that
+describes the missed packet.  It makes an effort to figure out what fields can
+be wildcarded based on the switch's configuration and OpenFlow flow table.  The
+dpif is free to ignore the suggested wildcards and only support the exact-match
+entry.  However, if the dpif supports wildcarding, then it can use the masks to
+match multiple flows with fewer entries and potentially significantly reduce
+the number of flow misses handled by ofproto-dpif.
+
+The "dpif" library in turn delegates much of its functionality to a "dpif
+provider".  The following diagram shows how dpif providers fit into the Open
+vSwitch architecture:
+
+::
+
+
+    Architecure
+
+               _
+              |   +-------------------+
+              |   |    ovs-vswitchd   |<-->ovsdb-server
+              |   +-------------------+
+              |   |      ofproto      |<-->OpenFlow controllers
+              |   +--------+-+--------+  _
+              |   | netdev | |ofproto-|   |
+    userspace |   +--------+ |  dpif  |   |
+              |   | netdev | +--------+   |
+              |   |provider| |  dpif  |   |
+              |   +---||---+ +--------+   |
+              |       ||     |  dpif  |   | implementation of
+              |       ||     |provider|   | ofproto provider
+              |_      ||     +---||---+   |
+                      ||         ||       |
+               _  +---||-----+---||---+   |
+              |   |          |datapath|   |
+       kernel |   |          +--------+  _|
+              |   |                   |
+              |_  +--------||---------+
+                           ||
+                        physical
+                           NIC
+
+struct ``dpif_class``, in ``lib/dpif-provider.h``, defines the interfaces
+required to implement a dpif provider for new hardware or software.  That
+structure contains many function pointers, each of which has a comment that is
+meant to describe its behavior in detail.  If the requirements are unclear,
+report this as a bug.
+
+There are two existing dpif implementations that may serve as useful examples
+during a port:
+
+* lib/dpif-netlink.c is a Linux-specific dpif implementation that talks to an
+  Open vSwitch-specific kernel module (whose sources are in the "datapath"
+  directory).  The kernel module performs all of the switching work, passing
+  packets that do not match any flow table entry up to userspace.  This dpif
+  implementation is essentially a wrapper around calls into the kernel module.
+
+* lib/dpif-netdev.c is a generic dpif implementation that performs all
+  switching internally.  This is how the Open vSwitch userspace switch is
+  implemented.
+
+Miscellaneous Notes
+-------------------
+
+Open vSwitch source code uses ``uint16_t``, ``uint32_t``, and ``uint64_t`` as
+fixed-width types in host byte order, and ``ovs_be16``, ``ovs_be32``, and
+``ovs_be64`` as fixed-width types in network byte order.  Each of the latter is
+equivalent to the one of the former, but the difference in name makes the
+intended use obvious.
+
+The default "fail-mode" for Open vSwitch bridges is "standalone", meaning that,
+when the OpenFlow controllers cannot be contacted, Open vSwitch acts as a
+regular MAC-learning switch.  This works well in virtualization environments
+where there is normally just one uplink (either a single physical interface or
+a bond).  In a more general environment, it can create loops.  So, if you are
+porting to a general-purpose switch platform, you should consider changing the
+default "fail-mode" to "secure", which does not behave this way.  See
+documentation for the "fail-mode" column in the Bridge table in
+ovs-vswitchd.conf.db(5) for more information.
+
+``lib/entropy.c`` assumes that it can obtain high-quality random number seeds
+at startup by reading from /dev/urandom.  You will need to modify it if this is
+not true on your platform.
+
+``vswitchd/system-stats.c`` only knows how to obtain some statistics on Linux.
+Optionally you may implement them for your platform as well.
+
+Why OVS Does Not Support Hybrid Providers
+-----------------------------------------
+
+The `porting strategies`_ section above describes the "ofproto provider" and
+"dpif provider" porting strategies.  Only an ofproto provider can take
+advantage of hardware TCAM support, and only a dpif provider can take advantage
+of the OVS built-in implementations of various features.  It is therefore
+tempting to suggest a hybrid approach that shares the advantages of both
+strategies.
+
+However, Open vSwitch does not support a hybrid approach.  Doing so may be
+possible, with a significant amount of extra development work, but it does not
+yet seem worthwhile, for the reasons explained below.
+
+First, user surprise is likely when a switch supports a feature only with a
+high performance penalty.  For example, one user questioned why adding a
+particular OpenFlow action to a flow caused a 1,058x slowdown on a hardware
+OpenFlow implementation [1]_.  The action required the flow to be implemented in
+software.
+
+Given that implementing a flow in software on the slow management CPU of a
+hardware switch causes a major slowdown, software-implemented flows would only
+make sense for very low-volume traffic.  But many of the features built into
+the OVS software switch implementation would need to apply to every flow to be
+useful.  There is no value, for example, in applying bonding or 802.1Q VLAN
+support only to low-volume traffic.
+
+Besides supporting features of OpenFlow actions, a hybrid approach could also
+support forms of matching not supported by particular switching hardware, by
+sending all packets that might match a rule to software.  But again this can
+cause an unacceptable slowdown by forcing bulk traffic through software in the
+hardware switch's slow management CPU.  Consider, for example, a hardware
+switch that can match on the IPv6 Ethernet type but not on fields in IPv6
+headers.  An OpenFlow table that matched on the IPv6 Ethernet type would
+perform well, but adding a rule that matched only UDPv6 would force every IPv6
+packet to software, slowing down not just UDPv6 but all IPv6 processing.
+
+.. [1] Aaron Rosen, "Modify packet fields extremely slow",
+    openflow-discuss mailing list, June 26, 2011, archived at
+    https://mailman.stanford.edu/pipermail/openflow-discuss/2011-June/002386.html.
+
+Questions
+---------
+
+Direct porting questions to dev@openvswitch.org.  We will try to use questions
+to improve this porting guide.
diff --git a/Documentation/topics/windows.rst b/Documentation/topics/windows.rst

new file mode 100644 (file)

index 0000000..81c1da5
--- /dev/null
+++ b/Documentation/topics/windows.rst
@@ -0,0 +1,510 @@
+..
+      Licensed under the Apache License, Version 2.0 (the "License"); you may
+      not use this file except in compliance with the License. You may obtain
+      a copy of the License at
+
+          http://www.apache.org/licenses/LICENSE-2.0
+
+      Unless required by applicable law or agreed to in writing, software
+      distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
+      WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
+      License for the specific language governing permissions and limitations
+      under the License.
+
+      Convention for heading levels in Open vSwitch documentation:
+
+      =======  Heading 0 (reserved for the title in a document)
+      -------  Heading 1
+      ~~~~~~~  Heading 2
+      +++++++  Heading 3
+      '''''''  Heading 4
+
+      Avoid deeper levels because they do not render well.
+
+=====================
+OVS-on-Hyper-V Design
+=====================
+
+This document provides details of the effort to develop Open vSwitch on
+Microsoft Hyper-V. This document should give enough information to understand
+the overall design.
+
+.. note::
+  The userspace portion of the OVS has been ported to Hyper-V in a separate
+  effort, and committed to the openvswitch repo. This document will mostly
+  emphasize on the kernel driver, though we touch upon some of the aspects of
+  userspace as well.
+
+Background Info
+---------------
+
+Microsoft’s hypervisor solution - Hyper-V [1]_ implements a virtual switch
+that is extensible and provides opportunities for other vendors to implement
+functional extensions [2]_. The extensions need to be implemented as NDIS
+drivers that bind within the extensible switch driver stack provided. The
+extensions can broadly provide the functionality of monitoring, modifying and
+forwarding packets to destination ports on the Hyper-V extensible switch.
+Correspondingly, the extensions can be categorized into the following types and
+provide the functionality noted:
+
+* Capturing extensions: monitoring packets
+
+* Filtering extensions: monitoring, modifying packets
+
+* Forwarding extensions: monitoring, modifying, forwarding packets
+
+As can be expected, the kernel portion (datapath) of OVS on Hyper-V solution
+will be implemented as a forwarding extension.
+
+In Hyper-V, the virtual machine is called the Child Partition. Each VIF or
+physical NIC on the Hyper-V extensible switch is attached via a port. Each port
+is both on the ingress path or the egress path of the switch. The ingress path
+is used for packets being sent out of a port, and egress is used for packet
+being received on a port. By design, NDIS provides a layered interface. In this
+layered interface, higher level layers call into lower level layers, in the
+ingress path. In the egress path, it is the other way round. In addition, there
+is a object identifier (OID) interface for control operations Eg. addition of a
+port. The workflow for the calls is similar in nature to the packets, where
+higher level layers call into the lower level layers. A good representational
+diagram of this architecture is in [4]_.
+
+Windows Filtering Platform (WFP)[5]_ is a platform implemented on Hyper-V that
+provides APIs and services for filtering packets. WFP has been utilized to
+filter on some of the packets that OVS is not equipped to handle directly. More
+details in later sections.
+
+IP Helper [6]_ is a set of API available on Hyper-V to retrieve information
+related to the network configuration information on the host machine. IP Helper
+has been used to retrieve some of the configuration information that OVS needs.
+
+Design
+------
+
+::
+
+    Various blocks of the OVS Windows implementation
+
+                                      +-------------------------------+
+                                      |                               |
+                                      |        CHILD PARTITION        |
+                                      |                               |
+      +------+ +--------------+       | +-----------+  +------------+ |
+      |      | |              |       | |           |  |            | |
+      | ovs- | |     OVS-     |       | | Virtual   |  | Virtual    | |
+      | *ctl | |  USERSPACE   |       | | Machine #1|  | Machine #2 | |
+      |      | |    DAEMON    |       | |           |  |            | |
+      +------+-++---+---------+       | +--+------+-+  +----+------++ | +--------+
+      |  dpif-  |   | netdev- |       |    |VIF #1|         |VIF #2|  | |Physical|
+      | netlink |   | windows |       |    +------+         +------+  | |  NIC   |
+      +---------+   +---------+       |      ||                   /\  | +--------+
+    User     /\         /\            |      || *#1*         *#4* ||  |     /\
+    =========||=========||============+------||-------------------||--+     ||
+    Kernel   ||         ||                   \/                   ||  ||=====/
+             \/         \/                +-----+                 +-----+ *#5*
+     +-------------------------------+    |     |                 |     |
+     |   +----------------------+    |    |     |                 |     |
+     |   |   OVS Pseudo Device  |    |    |     |                 |     |
+     |   +----------------------+    |    |     |                 |     |
+     |      | Netlink Impl. |        |    |     |                 |     |
+     |      -----------------        |    |  I  |                 |     |
+     | +------------+                |    |  N  |                 |  E  |
+     | |  Flowtable | +------------+ |    |  G  |                 |  G  |
+     | +------------+ |  Packet    | |*#2*|  R  |                 |  R  |
+     |   +--------+   | Processing | |<=> |  E  |                 |  E  |
+     |   |   WFP  |   |            | |    |  S  |                 |  S  |
+     |   | Driver |   +------------+ |    |  S  |                 |  S  |
+     |   +--------+                  |    |     |                 |     |
+     |                               |    |     |                 |     |
+     |   OVS FORWARDING EXTENSION    |    |     |                 |     |
+     +-------------------------------+    +-----+-----------------+-----+
+                                          |HYPER-V Extensible Switch *#3|
+                                          +-----------------------------+
+                                                   NDIS STACK
+
+This diagram shows the various blocks involved in the OVS Windows
+implementation, along with some of the components available in the NDIS stack,
+and also the virtual machines. The workflow of a packet being transmitted from
+a VIF out and into another VIF and to a physical NIC is also shown. Later on in
+this section, we will discuss the flow of a packet at a high level.
+
+The figure gives a general idea of where the OVS userspace and the kernel
+components fit in, and how they interface with each other.
+
+The kernel portion (datapath) of OVS on Hyper-V solution has be implemented as
+a forwarding extension roughly implementing the following
+sub-modules/functionality. Details of each of these sub-components in the
+kernel are contained in later sections:
+
+* Interfacing with the NDIS stack
+
+* Netlink message parser
+
+* Netlink sockets
+
+* Switch/Datapath management
+
+* Interfacing with userspace portion of the OVS solution to implement the
+  necessary functionality that userspace needs
+
+* Port management
+
+* Flowtable/Actions/packet forwarding
+
+* Tunneling
+
+* Event notifications
+
+The datapath for the OVS on Linux is a kernel module, and cannot be directly
+ported since there are significant differences in architecture even though the
+end functionality provided would be similar. Some examples of the differences
+are:
+
+* Interfacing with the NDIS stack to hook into the NDIS callbacks for
+  functionality such as receiving and sending packets, packet completions, OIDs
+  used for events such as a new port appearing on the virtual switch.
+
+* Interface between the userspace and the kernel module.
+
+* Event notifications are significantly different.
+
+* The communication interface between DPIF and the kernel module need not be
+  implemented in the way OVS on Linux does. That said, it would be advantageous
+  to have a similar interface to the kernel module for reasons of readability
+  and maintainability.
+
+* Any licensing issues of using Linux kernel code directly.
+
+Due to these differences, it was a straightforward decision to develop the
+datapath for OVS on Hyper-V from scratch rather than porting the one on Linux.
+A re-development focused on the following goals:
+
+* Adhere to the existing requirements of userspace portion of OVS (such as
+  ovs-vswitchd), to minimize changes in the userspace workflow.
+
+* Fit well into the typical workflow of a Hyper-V extensible switch forwarding
+  extension.
+
+The userspace portion of the OVS solution is mostly POSIX code, and not very
+Linux specific. Majority of the userspace code does not interface directly with
+the kernel datapath and was ported independently of the kernel datapath effort.
+
+As explained in the OVS porting design document [7]_, DPIF is the portion of
+userspace that interfaces with the kernel portion of the OVS. The interface
+that each DPIF provider has to implement is defined in ``dpif-provider.h``
+[3]_.  Though each platform is allowed to have its own implementation of the
+DPIF provider, it was found, via community feedback, that it is desired to
+share code whenever possible. Thus, the DPIF provider for OVS on Hyper-V shares
+code with the DPIF provider on Linux. This interface is implemented in
+``dpif-netlink.c``.
+
+We'll elaborate more on kernel-userspace interface in a dedicated section
+below. Here it suffices to say that the DPIF provider implementation for
+Windows is netlink-based and shares code with the Linux one.
+
+Kernel Module (Datapath)
+------------------------
+
+Interfacing with the NDIS Stack
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+For each virtual switch on Hyper-V, the OVS extensible switch extension can be
+enabled/disabled. We support enabling the OVS extension on only one switch.
+This is consistent with using a single datapath in the kernel on Linux. All the
+physical adapters are connected as external adapters to the extensible switch.
+
+When the OVS switch extension registers itself as a filter driver, it also
+registers callbacks for the switch/port management and datapath functions. In
+other words, when a switch is created on the Hyper-V root partition (host), the
+extension gets an activate callback upon which it can initialize the data
+structures necessary for OVS to function. Similarly, there are callbacks for
+when a port gets added to the Hyper-V switch, and an External Network adapter
+or a VM Network adapter is connected/disconnected to the port. There are also
+callbacks for when a VIF (NIC of a child partition) send out a packet, or a
+packet is received on an external NIC.
+
+As shown in the figures, an extensible switch extension gets to see a packet
+sent by the VM (VIF) twice - once on the ingress path and once on the egress
+path. Forwarding decisions are to be made on the ingress path. Correspondingly,
+we will be hooking onto the following interfaces:
+
+* Ingress send indication: intercept packets for performing flow based
+  forwarding.This includes straight forwarding to output ports. Any packet
+  modifications needed to be performed are done here either inline or by
+  creating a new packet. A forwarding action is performed as the flow actions
+  dictate.
+
+* Ingress completion indication: cleanup and free packets that we generated on
+  the ingress send path, pass-through for packets that we did not generate.
+
+* Egress receive indication: pass-through.
+
+* Egress completion indication: pass-through.
+
+Interfacing with OVS Userspace
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+We have implemented a pseudo device interface for letting OVS userspace talk to
+the OVS kernel module. This is equivalent to the typical character device
+interface on POSIX platforms where we can register custom functions for read,
+write and ioctl functionality. The pseudo device supports a whole bunch of
+ioctls that netdev and DPIF on OVS userspace make use of.
+
+Netlink Message Parser
+~~~~~~~~~~~~~~~~~~~~~~
+
+The communication between OVS userspace and OVS kernel datapath is in the form
+of Netlink messages [1]_. More details about this are provided below.  In the
+kernel, a full fledged netlink message parser has been implemented along the
+lines of the netlink message parser in OVS userspace. In fact, a lot of the
+code is ported code.
+
+On the lines of ``struct ofpbuf`` in OVS userspace, a managed buffer has been
+implemented in the kernel datapath to make it easier to parse and construct
+netlink messages.
+
+Netlink Sockets
+~~~~~~~~~~~~~~~
+
+On Linux, OVS userspace utilizes netlink sockets to pass back and forth netlink
+messages. Since much of userspace code including DPIF provider in
+dpif-netlink.c (formerly dpif-linux.c) has been reused, pseudo-netlink sockets
+have been implemented in OVS userspace. As it is known, Windows lacks native
+netlink socket support, and also the socket family is not extensible either.
+Hence it is not possible to provide a native implementation of netlink socket.
+We emulate netlink sockets in lib/netlink-socket.c and support all of the nl_*
+APIs to higher levels. The implementation opens a handle to the pseudo device
+for each netlink socket. Some more details on this topic are provided in the
+userspace section on netlink sockets.
+
+Typical netlink semantics of read message, write message, dump, and transaction
+have been implemented so that higher level layers are not affected by the
+netlink implementation not being native.
+
+Switch/Datapath Management
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+As explained above, we hook onto the management callback functions in the NDIS
+interface for when to initialize the OVS data structures, flow tables etc. Some
+of this code is also driven by OVS userspace code which sends down ioctls for
+operations like creating a tunnel port etc.
+
+Port Management
+~~~~~~~~~~~~~~~
+
+As explained above, we hook onto the management callback functions in the NDIS
+interface to know when a port is added/connected to the Hyper-V switch. We use
+these callbacks to initialize the port related data structures in OVS. Also,
+some of the ports are tunnel ports that don’t exist on the Hyper-V switch and
+get added from OVS userspace.
+
+In order to identify a Hyper-V port, we use the value of 'FriendlyName' field
+in each Hyper-V port. We call this the "OVS-port-name". The idea is that OVS
+userspace sets 'OVS-port-name' in each Hyper-V port to the same value as the
+'name' field of the 'Interface' table in OVSDB. When OVS userspace calls into
+the kernel datapath to add a port, we match the name of the port with the
+'OVS-port-name' of a Hyper-V port.
+
+We maintain separate hash tables, and separate counters for ports that have
+been added from the Hyper-V switch, and for ports that have been added from OVS
+userspace.
+
+Flowtable/Actions/Packet Forwarding
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The flowtable and flow actions based packet forwarding is the core of the OVS
+datapath functionality. For each packet on the ingress path, we consult the
+flowtable and execute the corresponding actions. The actions can be limited to
+simple forwarding to a particular destination port(s), or more commonly
+involves modifying the packet to insert a tunnel context or a VLAN ID, and
+thereafter forwarding to the external port to send the packet to a destination
+host.
+
+Tunneling
+~~~~~~~~~
+
+We make use of the Internal Port on a Hyper-V switch for implementing
+tunneling. The Internal Port is a virtual adapter that is exposed on the Hyper-
+V host, and connected to the Hyper-V switch. Basically, it is an interface
+between the host and the virtual switch. The Internal Port acts as the Tunnel
+end point for the host (aka VTEP), and holds the VTEP IP address.
+
+Tunneling ports are not actual ports on the Hyper-V switch. These are virtual
+ports that OVS maintains and while executing actions, if the outport is a
+tunnel port, we short circuit by performing the encapsulation action based on
+the tunnel context. The encapsulated packet gets forwarded to the external
+port, and appears to the outside world as though it was set from the VTEP.
+
+Similarly, when a tunneled packet enters the OVS from the external port bound
+to the internal port (VTEP), and if yes, we short circuit the path, and
+directly forward the inner packet to the destination port (mostly a VIF, but
+dictated by the flow). We leverage the Windows Filtering Platform (WFP)
+framework to be able to receive tunneled packets that cannot be decapsulated by
+OVS right away. Currently, fragmented IP packets fall into that category, and
+we leverage the code in the host IP stack to reassemble the packet, and
+performing decapsulation on the reassembled packet.
+
+We'll also be using the IP helper library to provide us IP address and other
+information corresponding to the Internal port.
+
+Event Notifications
+~~~~~~~~~~~~~~~~~~~
+
+The pseudo device interface described above is also used for providing event
+notifications back to OVS userspace. A shared memory/overlapped IO model is
+used.
+
+Userspace Components
+~~~~~~~~~~~~~~~~~~~~
+
+The userspace portion of the OVS solution is mostly POSIX code, and not very
+Linux specific. Majority of the userspace code does not interface directly with
+the kernel datapath and was ported independently of the kernel datapath effort.
+
+In this section, we cover the userspace components that interface with the
+kernel datapath.
+
+As explained earlier, OVS on Hyper-V shares the DPIF provider implementation
+with Linux. The DPIF provider on Linux uses netlink sockets and netlink
+messages. Netlink sockets and messages are extensively used on Linux to
+exchange information between userspace and kernel. In order to satisfy these
+dependencies, netlink socket (pseudo and non-native) and netlink messages are
+implemented on Hyper-V.
+
+The following are the major advantages of sharing DPIF provider code:
+
+1. Maintenance is simpler:
+
+   Any change made to the interface defined in dpif-provider.h need not be
+   propagated to multiple implementations. Also, developers familiar with the
+   Linux implementation of the DPIF provider can easily ramp on the Hyper-V
+   implementation as well.
+
+2. Netlink messages provides inherent advantages:
+
+   Netlink messages are known for their extensibility. Each message is
+   versioned, so the provided data structures offer a mechanism to perform
+   version checking and forward/backward compatibility with the kernel module.
+
+Netlink Sockets
+~~~~~~~~~~~~~~~
+
+As explained in other sections, an emulation of netlink sockets has been
+implemented in ``lib/netlink-socket.c`` for Windows. The implementation creates
+a handle to the OVS pseudo device, and emulates netlink socket semantics of
+receive message, send message, dump, and transact. Most of the ``nl_*``
+functions are supported.
+
+The fact that the implementation is non-native manifests in various ways.  One
+example is that PID for the netlink socket is not automatically assigned in
+userspace when a handle is created to the OVS pseudo device. There's an extra
+command (defined in ``OvsDpInterfaceExt.h``) that is used to grab the PID
+generated in the kernel.
+
+DPIF Provider
+~~~~~~~~~~~~~
+
+As has been mentioned in earlier sections, the netlink socket and netlink
+message based DPIF provider on Linux has been ported to Windows.
+
+Most of the code is common. Some divergence is in the code to receive packets.
+The Linux implementation uses epoll() which is not natively supported on
+Windows.
+
+netdev-windows
+~~~~~~~~~~~~~~
+
+We have a Windows implementation of the interface defined in
+``lib/netdev-provider.h``. The implementation provides functionality to get
+extended information about an interface. It is limited in functionality
+compared to the Linux implementation of the netdev provider and cannot be used
+to add any interfaces in the kernel such as a tap interface or to send/receive
+packets. The netdev-windows implementation uses the datapath interface
+extensions defined in ``datapath-windows/include/OvsDpInterfaceExt.h``.
+
+Powershell Extensions to Set ``OVS-port-name``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+As explained in the section on "Port management", each Hyper-V port has a
+'FriendlyName' field, which we call as the "OVS-port-name" field. We have
+implemented powershell command extensions to be able to set the "OVS-port-name"
+of a Hyper-V port.
+
+Kernel-Userspace Interface
+--------------------------
+
+openvswitch.h and OvsDpInterfaceExt.h
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Since the DPIF provider is shared with Linux, the kernel datapath provides the
+same interface as the Linux datapath. The interface is defined in
+``datapath/linux/compat/include/linux/openvswitch.h``. Derivatives of this
+interface file are created during OVS userspace compilation. The derivative for
+the kernel datapath on Hyper-V is provided in
+``datapath-windows/include/OvsDpInterface.h``.
+
+That said, there are Windows specific extensions that are defined in the
+interface file ``datapath-windows/include/OvsDpInterfaceExt.h``.
+
+Flow of a Packet
+----------------
+
+Figure 2 shows the numbered steps in which a packets gets sent out of a VIF and
+is forwarded to another VIF or a physical NIC. As mentioned earlier, each VIF
+is attached to the switch via a port, and each port is both on the ingress and
+egress path of the switch, and depending on whether a packet is being
+transmitted or received, one of the paths gets used. In the figure, each step n
+is annotated as ``#n``
+
+The steps are as follows:
+
+1. When a packet is sent out of a VIF or an physical NIC or an internal port,
+   the packet is part of the ingress path.
+
+2. The OVS kernel driver gets to intercept this packet.
+
+   a. OVS looks up the flows in the flowtable for this packet, and executes the
+      corresponding action.
+
+   b. If there is not action, the packet is sent up to OVS userspace to examine
+      the packet and figure out the actions.
+
+   c. Userspace executes the packet by specifying the actions, and might also
+      insert a flow for such a packet in the future.
+
+   d. The destination ports are added to the packet and sent down to the Hyper-
+      V switch.
+
+3. The Hyper-V forwards the packet to the destination ports specified in the
+   packet, and sends it out on the egress path.
+
+4. The packet gets forwarded to the destination VIF.
+
+5. It might also get forwarded to a physical NIC as well, if the physical NIC
+   has been added as a destination port by OVS.
+
+Build/Deployment
+----------------
+
+The userspace components added as part of OVS Windows implementation have been
+integrated with autoconf, and can be built using the steps mentioned in the
+BUILD.Windows file. Additional targets need to be specified to make.
+
+The OVS kernel code is part of a Visual Studio 2013 solution, and is compiled
+from the IDE. There are plans in the future to move this to a compilation mode
+such that we can compile it without an IDE as well.
+
+Once compiled, we have an install script that can be used to load the kernel
+driver.
+
+References
+----------
+
+.. [1] Hyper-V Extensible Switch http://msdn.microsoft.com/en-us/library/windows/hardware/hh598161(v=vs.85).aspx
+.. [2] Hyper-V Extensible Switch Extensions http://msdn.microsoft.com/en-us/library/windows/hardware/hh598169(v=vs.85).aspx
+.. [3] DPIF Provider http://openvswitch.sourcearchive.com/documentation/1.1.0-1/dpif-provider_8h_source.html
+.. [4] Hyper-V Extensible Switch Components http://msdn.microsoft.com/en-us/library/windows/hardware/hh598163(v=vs.85).aspx
+.. [5] Windows Filtering Platform http://msdn.microsoft.com/en-us/library/windows/desktop/aa366510(v=vs.85).aspx
+.. [6] IP Helper http://msdn.microsoft.com/en-us/library/windows/hardware/ff557015(v=vs.85).aspx
+.. [7] How to Port Open vSwitch to New Software or Hardware http://git.openvswitch.org/cgi-bin/gitweb.cgi?p=openvswitch;a=blob;f=PORTING
+.. [8] Netlink http://en.wikipedia.org/wiki/Netlink
+.. [9] epoll http://en.wikipedia.org/wiki/Epoll
diff --git a/FAQ.rst b/FAQ.rst

index c5ae62f4dea6719b0eb6ab4ef9697494145d243e..c072d10531605d9eba14e59c0f67058616dd61c6 100644 (file)
--- a/FAQ.rst
+++ b/FAQ.rst
@@ -79,8 +79,9 @@ Q: Does Open vSwitch only work on Linux?
  
  Q: What's involved with porting Open vSwitch to a new platform or switching ASIC?
  
-    A: The `porting document <PORTING.rst>`__ describes how one would go about
-    porting Open vSwitch to a new operating system or hardware platform.
+    A: The `porting document <Documentation/development-guide/porting.rst>`__
+    describes how one would go about porting Open vSwitch to a new operating
+    system or hardware platform.
  
  Q: Why would I use Open vSwitch instead of the Linux bridge?
  
@@ -1588,9 +1589,10 @@ Q: What versions of OpenFlow does Open vSwitch support?
      (Open vSwitch 2.2 had an experimental implementation of OpenFlow 1.4 that
      could cause crashes.  We don't recommend enabling it.)
  
-    The `OpenFlow guide <OPENFLOW.rst>`__ tracks support for OpenFlow 1.1 and
-    later features.  When support for OpenFlow 1.4 and 1.5 is solidly
-    implemented, Open vSwitch will enable those version by default.
+    The `OpenFlow guide <Documentation/development-guide/openflow.rst>`__
+    tracks support for OpenFlow 1.1 and later features.  When support for
+    OpenFlow 1.4 and 1.5 is solidly implemented, Open vSwitch will enable those
+    version by default.
  
  Q: Does Open vSwitch support MPLS?
  
@@ -1651,8 +1653,8 @@ going through.
      greater than 65535 (the maximum priority that can be set with
      OpenFlow).
  
-    The DESIGN file at the top level of the Open vSwitch source
-    distribution describes the in-band model in detail.
+    The ``Documentation/topics/design`` doc describes the in-band model in
+    detail.
  
      If your controllers are not actually in-band (e.g. they are on
      localhost via 127.0.0.1, or on a separate network), then you should
diff --git a/IntegrationGuide.rst b/IntegrationGuide.rst

deleted file mode 100644 (file)

index 11e77ba..0000000
--- a/IntegrationGuide.rst
+++ /dev/null
@@ -1,264 +0,0 @@
-..
-      Licensed under the Apache License, Version 2.0 (the "License"); you may
-      not use this file except in compliance with the License. You may obtain
-      a copy of the License at
-
-          http://www.apache.org/licenses/LICENSE-2.0
-
-      Unless required by applicable law or agreed to in writing, software
-      distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
-      WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
-      License for the specific language governing permissions and limitations
-      under the License.
-
-      Convention for heading levels in Open vSwitch documentation:
-
-      =======  Heading 0 (reserved for the title in a document)
-      -------  Heading 1
-      ~~~~~~~  Heading 2
-      +++++++  Heading 3
-      '''''''  Heading 4
-
-      Avoid deeper levels because they do not render well.
-
-=========================================
-Integration Guide for Centralized Control
-=========================================
-
-This document describes how to integrate Open vSwitch onto a new platform to
-expose the state of the switch and attached devices for centralized control.
-(If you are looking to port the switching components of Open vSwitch to a new
-platform, please see the PORTING document.)  The focus of this guide is on
-hypervisors, but many of the interfaces are useful for hardware switches, as
-well.  The XenServer integration is the most mature implementation, so most of
-the examples are drawn from it.
-
-The externally visible interface to this integration is platform-agnostic.  We
-encourage anyone who integrates Open vSwitch to use the same interface, because
-keeping a uniform interface means that controllers require less customization
-for individual platforms (and perhaps no customization at all).
-
-Integration centers around the Open vSwitch database and mostly involves the
-``external_ids`` columns in several of the tables.  These columns are not
-interpreted by Open vSwitch itself.  Instead, they provide information to a
-controller that permits it to associate a database record with a more
-meaningful entity.  In contrast, the ``other_config`` column is used to
-configure behavior of the switch.  The main job of the integrator, then, is to
-ensure that these values are correctly populated and maintained.
-
-An integrator sets the columns in the database by talking to the ovsdb-server
-daemon.  A few of the columns can be set during startup by calling the ovs-ctl
-tool from inside the startup scripts.  The ``xenserver/etc_init.d_openvswitch``
-script provides examples of its use, and the ovs-ctl(8) manpage contains
-complete documentation.  At runtime, ovs-vsctl can be be used to set columns in
-the database.  The script ``xenserver/etc_xensource_scripts_vif`` contains
-examples of its use, and ovs-vsctl(8) manpage contains complete documentation.
-
-Python and C bindings to the database are provided if deeper integration with a
-program are needed.  The XenServer ovs-xapi-sync daemon
-(``xenserver/usr_share_openvswitch_scripts_ovs-xapi-sync``) provides an example
-of using the Python bindings.  More information on the python bindings is
-available at ``python/ovs/db/idl.py``.  Information on the C bindings is
-available at ``lib/ovsdb-idl.h``.
-
-The following diagram shows how integration scripts fit into the Open vSwitch
-architecture:
-
-::
-
-    Diagram
-
-             +----------------------------------------+
-             |           Controller Cluster           +
-             +----------------------------------------+
-                                 |
-                                 |
-    +----------------------------------------------------------+
-    |                            |                             |
-    |             +--------------+---------------+             |
-    |             |                              |             |
-    |   +-------------------+           +------------------+   |
-    |   |   ovsdb-server    |-----------|   ovs-vswitchd   |   |
-    |   +-------------------+           +------------------+   |
-    |             |                              |             |
-    |  +---------------------+                   |             |
-    |  | Integration scripts |                   |             |
-    |  | (ex: ovs-xapi-sync) |                   |             |
-    |  +---------------------+                   |             |
-    |                                            |   Userspace |
-    |----------------------------------------------------------|
-    |                                            |      Kernel |
-    |                                            |             |
-    |                                 +---------------------+  |
-    |                                 |  OVS Kernel Module  |  |
-    |                                 +---------------------+  |
-    +----------------------------------------------------------+
-
-A description of the most relevant fields for integration follows.  By setting
-these values, controllers are able to understand the network and manage it more
-dynamically and precisely.  For more details about the database and each
-individual column, please refer to the ovs-vswitchd.conf.db(5) manpage.
-
-``Open_vSwitch`` table
-----------------------
-
-The ``Open_vSwitch`` table describes the switch as a whole.  The
-``system_type`` and ``system_version`` columns identify the platform to the
-controller.  The ``external_ids:system-id`` key uniquely identifies the
-physical host.  In XenServer, the system-id will likely be the same as the UUID
-returned by ``xe host-list``. This key allows controllers to distinguish
-between multiple hypervisors.
-
-Most of this configuration can be done with the ovs-ctl command at startup.
-For example:
-
-::
-
-    $ ovs-ctl --system-type="XenServer" --system-version="6.0.0-50762p" \
-        --system-id="${UUID}" "${other_options}" start
-
-Alternatively, the ovs-vsctl command may be used to set a particular value at
-runtime.  For example:
-
-::
-
-    $ ovs-vsctl set open_vswitch . external-ids:system-id='"${UUID}"'
-
-The ``other_config:enable-statistics`` key may be set to ``true`` to have OVS
-populate the database with statistics (e.g., number of CPUs, memory, system
-load) for the controller's use.
-
-Bridge table
-------------
-
-The Bridge table describes individual bridges within an Open vSwitch instance.
-The ``external-ids:bridge-id`` key uniquely identifies a particular bridge.  In
-XenServer, this will likely be the same as the UUID returned by ``xe
-network-list`` for that particular bridge.
-
-For example, to set the identifier for bridge "br0", the following command can
-be used:
-
-::
-
-    $ ovs-vsctl set Bridge br0 external-ids:bridge-id='"${UUID}"'
-
-The MAC address of the bridge may be manually configured by setting it with the
-``other_config:hwaddr`` key.  For example:
-
-::
-
-    $ ovs-vsctl set Bridge br0 other_config:hwaddr="12:34:56:78:90:ab"
-
-Interface table
----------------
-
-The Interface table describes an interface under the control of Open vSwitch.
-The ``external_ids`` column contains keys that are used to provide additional
-information about the interface:
-
-attached-mac
-
-  This field contains the MAC address of the device attached to the interface.
-  On a hypervisor, this is the MAC address of the interface as seen inside a
-  VM.  It does not necessarily correlate to the host-side MAC address.  For
-  example, on XenServer, the MAC address on a VIF in the hypervisor is always
-  FE:FF:FF:FF:FF:FF, but inside the VM a normal MAC address is seen.
-
-iface-id
-
-  This field uniquely identifies the interface.  In hypervisors, this allows
-  the controller to follow VM network interfaces as VMs migrate.  A well-chosen
-  identifier should also allow an administrator or a controller to associate
-  the interface with the corresponding object in the VM management system.  For
-  example, the Open vSwitch integration with XenServer by default uses the
-  XenServer assigned UUID for a VIF record as the iface-id.
-
-iface-status
-
-  In a hypervisor, there are situations where there are multiple interface
-  choices for a single virtual ethernet interface inside a VM.  Valid values
-  are "active" and "inactive".  A complete description is available in the
-  ovs-vswitchd.conf.db(5) manpage.
-
-vm-id
-
-  This field uniquely identifies the VM to which this interface belongs.  A
-  single VM may have multiple interfaces attached to it.
-
-As in the previous tables, the ovs-vsctl command may be used to configure the
-values.  For example, to set the ``iface-id`` on eth0, the following command
-can be used:
-
-::
-
-    $ ovs-vsctl set Interface eth0 external-ids:iface-id='"${UUID}"'
-
-
-HA for OVN DB servers using pacemaker
--------------------------------------
-
-The ovsdb servers can work in either active or backup mode. In backup mode, db
-server will be connected to an active server and replicate the active servers
-contents. At all times, the data can be transacted only from the active server.
-When the active server dies for some reason, entire OVN operations will be
-stalled.
-
-`Pacemaker <http://clusterlabs.org/pacemaker.html>`_ is a cluster resource
-manager which can manage a defined set of resource across a set of clustered
-nodes. Pacemaker manages the resource with the help of the resource agents.
-One among the resource agent is
-`OCF <http://www.linux-ha.org/wiki/OCF_Resource_Agents>`_
-
-OCF is nothing but a shell script which accepts a set of actions and returns an
-appropriate status code.
-
-With the help of the OCF resource agent ovn/utilities/ovndb-servers.ocf, one
-can defined a resource for the pacemaker such that pacemaker will always
-maintain one running active server at any time.
-
-After creating a pacemaker cluster, use the following commands to create
-one active and multiple backup servers for OVN databases.
-
-::
-
-    pcs resource create ovndb_servers ocf:ovn:ovndb-servers \
-         master_ip=x.x.x.x \
-         ovn_ctl=<path of the ovn-ctl script> \
-         op monitor interval="10s" \
-         op monitor role=Master interval="15s"
-
-    pcs resource master ovndb_servers-master ovndb_servers \
-        meta notify="true"
-
-The `master_ip` and `ovn_ctl` are the parameters that will be used by the
-OCF script. `ovn_ctl` is optional, if not given, it assumes a default value of
-/usr/share/openvswitch/scripts/ovn-ctl. `master_ip` is the IP address on which
-the active database server is expected to be listening.
-
-Whenever the active server dies, pacemaker is responsible to promote one of
-the backup servers to be active. Both ovn-controller and ovn-northd needs the
-ip-address at which the active server is listening. With pacemaker changing the
-node at which the active server is run, it is not efficient to instruct all the
-ovn-controllers and the ovn-northd to listen to the latest active server's
-ip-address.
-
-This problem can be solved by using a native ocf resource agent
-`ocf:heartbeat:IPaddr2`. The IPAddr2 resource agent is just a resource with an
-ip-address. When we colocate this resource with the active server, pacemaker
-will enable the active server to be connected with a single ip-address all the
-time. This is the ip-address that needs to be given as the parameter while
-creating the `ovndb_servers` resource.
-
-Use the following command to create the IPAddr2 resource and colocate it
-with the active server.
-
-::
-
-    pcs resource create VirtualIP ocf:heartbeat:IPaddr2 ip=x.x.x.x \
-        op monitor interval=30s
-
-    pcs constraint order VirtualIP then ovndb_servers-master
-
-    pcs constraint colocation add master ovndb_servers-master with VirtualIP \
-        score=INFINITY
diff --git a/Makefile.am b/Makefile.am

index 3579ac18a0748b92437f39780ff42039c719d904..427ac07eb8f0ba31dea4cf33067200ef73ff4a44 100644 (file)
--- a/Makefile.am
+++ b/Makefile.am
@@ -68,12 +68,8 @@ PYCOV_CLEAN_FILES = build-aux/check-structs,cover
  docs = \
         AUTHORS.rst \
         CONTRIBUTING.rst \
-       DESIGN.rst \
         FAQ.rst \
-       IntegrationGuide.rst \
         MAINTAINERS.rst \
-       OPENFLOW.rst \
-       PORTING.rst \
         README.rst \
         WHY-OVS.rst
  EXTRA_DIST = \
diff --git a/OPENFLOW.rst b/OPENFLOW.rst

deleted file mode 100644 (file)

index b62c5be..0000000
--- a/OPENFLOW.rst
+++ /dev/null
@@ -1,415 +0,0 @@
-..
-      Licensed under the Apache License, Version 2.0 (the "License"); you may
-      not use this file except in compliance with the License. You may obtain
-      a copy of the License at
-
-          http://www.apache.org/licenses/LICENSE-2.0
-
-      Unless required by applicable law or agreed to in writing, software
-      distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
-      WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
-      License for the specific language governing permissions and limitations
-      under the License.
-
-      Convention for heading levels in Open vSwitch documentation:
-
-      =======  Heading 0 (reserved for the title in a document)
-      -------  Heading 1
-      ~~~~~~~  Heading 2
-      +++++++  Heading 3
-      '''''''  Heading 4
-
-      Avoid deeper levels because they do not render well.
-
-================================
-OpenFlow Support in Open vSwitch
-================================
-
-Open vSwitch support for OpenFlow 1.1 and beyond is a work in progress.  This
-file describes the work still to be done.
-
-The Plan
---------
-
-OpenFlow version support is not a build-time option.  A single build of Open
-vSwitch must be able to handle all supported versions of OpenFlow.  Ideally,
-even at runtime it should be able to support all protocol versions at the same
-time on different OpenFlow bridges (and perhaps even on the same bridge).
-
-At the same time, it would be a shame to litter the core of the OVS code with
-lots of ugly code concerned with the details of various OpenFlow protocol
-versions.
-
-The primary approach to compatibility is to abstract most of the details of the
-differences from the core code, by adding a protocol layer that translates
-between OF1.x and a slightly higher-level abstract representation.  The core of
-this approach is the many ``struct ofputil_*`` structures in
-``include/openvswitch/ofp-util.h``.
-
-As a consequence of this approach, OVS cannot use OpenFlow protocol definitions
-that closely resemble those in the OpenFlow specification, because
-``openflow.h`` in different versions of the OpenFlow specification defines the
-same identifier with different values.  Instead, ``openflow-common.h`` contains
-definitions that are common to all the specifications and separate protocol
-version-specific headers contain protocol-specific definitions renamed so as
-not to conflict, e.g. ``OFPAT10_ENQUEUE`` and ``OFPAT11_ENQUEUE`` for the
-OpenFlow 1.0 and 1.1 values for ``OFPAT_ENQUEUE``.  Generally, in cases of
-conflict, the protocol layer will define a more abstract ``OFPUTIL_*`` or
-struct ``ofputil_*``.
-
-Here are the current approaches in a few tricky areas:
-
-* Port numbering.
-
-  OpenFlow 1.0 has 16-bit port numbers and later OpenFlow versions have 32-bit
-  port numbers.  For now, OVS support for later protocol versions requires all
-  port numbers to fall into the 16-bit range, translating the reserved
-  ``OFPP_*`` port numbers.
-
-* Actions.
-
-  OpenFlow 1.0 and later versions have very different ideas of actions.  OVS
-  reconciles by translating all the versions' actions (and instructions) to and
-  from a common internal representation.
-
-OpenFlow 1.1
-------------
-
-The list of remaining work items for OpenFlow 1.1 is below.  It is probably
-incomplete.
-
-* Match and set double-tagged VLANs (QinQ).
-
-  This requires kernel work for reasonable performance.
-
-  (optional for OF1.1+)
-
-* VLANs tagged with 88a8 Ethertype.
-
-  This requires kernel work for reasonable performance.
-
-  (required for OF1.1+)
-
-OpenFlow 1.2
-------------
-
-OpenFlow 1.2 support requires OpenFlow 1.1 as a prerequisite. All the
-additional work specific to Openflow 1.2 are complete.  (This is based on the
-change log at the end of the OF1.2 spec.  I didn't compare the specs carefully
-yet.)
-
-OpenFlow 1.3
-------------
-
-OpenFlow 1.3 support requires OpenFlow 1.2 as a prerequisite, plus the
-following additional work.  (This is based on the change log at the end of the
-OF1.3 spec, reusing most of the section titles directly.  I didn't compare the
-specs carefully yet.)
-
-* Add support for multipart requests.
-
-  Currently we always report ``OFPBRC_MULTIPART_BUFFER_OVERFLOW``.
-
-  (optional for OF1.3+)
-
-* IPv6 extension header handling support.
-
-  Fully implementing this requires kernel support.  This likely will take some
-  careful and probably time-consuming design work.  The actual coding, once
-  that is all done, is probably 2 or 3 days work.
-
-  (optional for OF1.3+)
-
-* Per-flow meters.
-
-  OpenFlow protocol support is now implemented.  Support for the special
-  ``OFPM_SLOWPATH`` and ``OFPM_CONTROLLER`` meters is missing.  Support for
-  the software switch is under review.
-
-  (optional for OF1.3+)
-
-* Auxiliary connections.
-
-  An implementation in generic code might be a week's worth of work.  The value
-  of an implementation in generic code is questionable, though, since much of
-  the benefit of axuiliary connections is supposed to be to take advantage of
-  hardware support.  (We could make the kernel module somehow send packets
-  across the auxiliary connections directly, for some kind of "hardware"
-  support, if we judged it useful enough.)
-
-  (optional for OF1.3+)
-
-* Provider Backbone Bridge tagging.
-
-  I don't plan to implement this (but we'd accept an implementation).
-
-  (optional for OF1.3+)
-
-* On-demand flow counters.
-
-  I think this might be a real optimization in some cases for the software
-  switch.
-
-  (optional for OF1.3+)
-
-OpenFlow 1.4 & ONF Extensions for 1.3.X Pack1
----------------------------------------------
-
-The following features are both defined as a set of ONF Extensions for 1.3 and
-integrated in 1.4.
-
-When defined as an ONF Extension for 1.3, the feature is using the Experimenter
-mechanism with the ONF Experimenter ID.
-
-When defined integrated in 1.4, the feature use the standard OpenFlow
-structures (for example defined in openflow-1.4.h).
-
-The two definitions for each feature are independant and can exist in parallel
-in OVS.
-
-
-* Flow entry notifications
-
-  This seems to be modelled after OVS's NXST_FLOW_MONITOR.  (Simon Horman is
-  working on this.)
-
-  (EXT-187)
-  (optional for OF1.4+)
-
-* Role Status
-
-  Already implemented as a 1.4 feature.
-
-  (EXT-191)
-
-  (required for OF1.4+)
-
-* Flow entry eviction
-
-  OVS has flow eviction functionality.  ``table_mod OFPTC_EVICTION``,
-  ``flow_mod 'importance'``, and ``table_desc ofp_table_mod_prop_eviction``
-  need to be implemented.
-
-  (EXT-192-e)
-
-  (optional for OF1.4+)
-
-* Vacancy events
-
-  (EXT-192-v)
-
-  (optional for OF1.4+)
-
-* Bundle
-
-  Transactional modification.  OpenFlow 1.4 requires to support
-  ``flow_mods`` and ``port_mods`` in a bundle if bundle is supported.
-  (Not related to OVS's 'ofbundle' stuff.)
-
-  Implemented as an OpenFlow 1.4 feature.  Only flow_mods and port_mods are
-  supported in a bundle.  If the bundle includes port mods, it may not specify
-  the ``OFPBF_ATOMIC`` flag.  Nevertheless, port mods and flow mods in a bundle
-  are always applied in order and consecutive flow mods between port mods are
-  made available to lookups atomically.
-
-  (EXT-230)
-
-  (optional for OF1.4+)
-
-* Table synchronisation
-
-  Probably not so useful to the software switch.
-
-  (EXT-232)
-
-  (optional for OF1.4+)
-
-* Group and Meter change notifications
-
-  (EXT-235)
-
-  (optional for OF1.4+)
-
-* Bad flow entry priority error
-
-  Probably not so useful to the software switch.
-
-  (EXT-236)
-
-  (optional for OF1.4+)
-
-* Set async config error
-
-  (EXT-237)
-
-  (optional for OF1.4+)
-
-* PBB UCA header field
-
-  See comment on Provider Backbone Bridge in section about OpenFlow 1.3.
-
-  (EXT-256)
-
-  (optional for OF1.4+)
-
-* Multipart timeout error
-
-  (EXT-264)
-
-  (required for OF1.4+)
-
-OpenFlow 1.4 only
------------------
-
-Those features are those only available in OpenFlow 1.4, other OpenFlow 1.4
-features are listed in the previous section.
-
-* More extensible wire protocol
-
-  Many on-wire structures got TLVs.
-
-  All required features are now supported.
-  Remaining optional: table desc, table-status
-
-  (EXT-262)
-
-  (required for OF1.4+)
-
-* More descriptive reasons for packet-in
-
-  Distinguish ``OFPR_APPLY_ACTION``, ``OFPR_ACTION_SET``, ``OFPR_GROUP``,
-  ``OFPR_PACKET_OUT``.  ``NO_MATCH`` was renamed to ``OFPR_TABLE_MISS``.
-  (OFPR_ACTION_SET and OFPR_GROUP are now supported)
-
-  (EXT-136)
-
-  (required for OF1.4+)
-
-* Optical port properties
-
-  (EXT-154)
-
-  (optional for OF1.4+)
-
-OpenFlow 1.5 & ONF Extensions for 1.3.X Pack2
----------------------------------------------
-
-The following features are both defined as a set of ONF Extensions for 1.3 and
-integrated in 1.5. Note that this list is not definitive as those are not yet
-published.
-
-When defined as an ONF Extension for 1.3, the feature is using the Experimenter
-mechanism with the ONF Experimenter ID.  When defined integrated in 1.5, the
-feature use the standard OpenFlow structures (for example defined in
-openflow-1.5.h).
-
-The two definitions for each feature are independant and can exist in parallel
-in OVS.
-
-* Time scheduled bundles
-
-  (EXT-340)
-
-  (optional for OF1.5+)
-
-OpenFlow 1.5 only
------------------
-
-Those features are those only available in OpenFlow 1.5, other OpenFlow 1.5
-features are listed in the previous section.  Note that this list is not
-definitive as OpenFlow 1.5 is not yet published.
-
-* Egress Tables
-
-  (EXT-306)
-
-  (optional for OF1.5+)
-
-* Packet Type aware pipeline
-
-  Prototype for OVS was done during specification.
-
-  (EXT-112)
-
-  (optional for OF1.5+)
-
-* Extensible Flow Entry Statistics
-
-  (EXT-334)
-
-  (required for OF1.5+)
-
-* Flow Entry Statistics Trigger
-
-  (EXT-335)
-
-  (optional for OF1.5+)
-
-* Controller connection status
-
-  Prototype for OVS was done during specification.
-
-  (EXT-454)
-
-  (optional for OF1.5+)
-
-* Meter action
-
-  (EXT-379)
-
-  (required for OF1.5+ if metering is supported)
-
-* Enable setting all pipeline fields in packet-out
-
-  Prototype for OVS was done during specification.
-
-  (EXT-427)
-
-  (required for OF1.5+)
-
-* Port properties for pipeline fields
-
-  Prototype for OVS was done during specification.
-
-  (EXT-388)
-
-  (optional for OF1.5+)
-
-* Port property for recirculation
-
-  Prototype for OVS was done during specification.
-
-  (EXT-399)
-
-  (optional for OF1.5+)
-
-General
--------
-
-* ovs-ofctl(8) often lists as Nicira extensions features that later OpenFlow
-  versions support in standard ways.
-
-How to contribute
------------------
-
-If you plan to contribute code for a feature, please let everyone know on
-ovs-dev before you start work.  This will help avoid duplicating work.
-
-Please consider the following:
-
-* Testing.  Please test your code.
-
-* Unit tests.  Please consider writing some.  The tests directory has many
-  examples that you can use as a starting point.
-
-* ovs-ofctl.  If you add a feature that is useful for some ovs-ofctl command
-  then you should add support for it there.
-
-* Documentation.  If you add a user-visible feature, then you should document
-  it in the appropriate manpage and mention it in NEWS as well.
-
-* Coding style (see the `coding style guide <CodingStyle.rst>`__ file at the top
-  of the source tree).
-
-* The `patch submission guidelines <CONTRIBUTING.rst>`__.  I recommend using
-  "git send-email", which automatically follows a lot of those guidelines.
diff --git a/PORTING.rst b/PORTING.rst

deleted file mode 100644 (file)

index bae8cd9..0000000
--- a/PORTING.rst
+++ /dev/null
@@ -1,332 +0,0 @@
-..
-      Licensed under the Apache License, Version 2.0 (the "License"); you may
-      not use this file except in compliance with the License. You may obtain
-      a copy of the License at
-
-          http://www.apache.org/licenses/LICENSE-2.0
-
-      Unless required by applicable law or agreed to in writing, software
-      distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
-      WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
-      License for the specific language governing permissions and limitations
-      under the License.
-
-      Convention for heading levels in Open vSwitch documentation:
-
-      =======  Heading 0 (reserved for the title in a document)
-      -------  Heading 1
-      ~~~~~~~  Heading 2
-      +++++++  Heading 3
-      '''''''  Heading 4
-
-      Avoid deeper levels because they do not render well.
-
-================================================
-Porting Open vSwitch to New Software or Hardware
-================================================
-
-Open vSwitch (OVS) is intended to be easily ported to new software and hardware
-platforms.  This document describes the types of changes that are most likely
-to be necessary in porting OVS to Unix-like platforms.  (Porting OVS to other
-kinds of platforms is likely to be more difficult.)
-
-Vocabulary
-----------
-
-For historical reasons, different words are used for essentially the same
-concept in different areas of the Open vSwitch source tree.  Here is a
-concordance, indexed by the area of the source tree:
-
-::
-
-    datapath/       vport           ---
-    vswitchd/       iface           port
-    ofproto/        port            bundle
-    ofproto/bond.c  slave           bond
-    lib/lacp.c      slave           lacp
-    lib/netdev.c    netdev          ---
-    database        Interface       Port
-
-Open vSwitch Architectural Overview
------------------------------------
-
-The following diagram shows the very high-level architecture of Open vSwitch
-from a porter's perspective.
-
-::
-
-    +-------------------+
-    |    ovs-vswitchd   |<-->ovsdb-server
-    +-------------------+
-    |      ofproto      |<-->OpenFlow controllers
-    +--------+-+--------+
-    | netdev | | ofproto|
-    +--------+ |provider|
-    | netdev | +--------+
-    |provider|
-    +--------+
-
-Some of the components are generic.  Modulo bugs or inadequacies, these
-components should not need to be modified as part of a port:
-
-ovs-vswitchd
-  The main Open vSwitch userspace program, in vswitchd/.  It reads the desired
-  Open vSwitch configuration from the ovsdb-server program over an IPC channel
-  and passes this configuration down to the "ofproto" library.  It also passes
-  certain status and statistical information from ofproto back into the
-  database.
-
-ofproto
-  The Open vSwitch library, in ofproto/, that implements an OpenFlow switch.
-  It talks to OpenFlow controllers over the network and to switch hardware or
-  software through an "ofproto provider", explained further below.
-
-netdev
-  The Open vSwitch library, in lib/netdev.c, that abstracts interacting with
-  network devices, that is, Ethernet interfaces.  The netdev library is a thin
-  layer over "netdev provider" code, explained further below.
-
-The other components may need attention during a port.  You will almost
-certainly have to implement a "netdev provider".  Depending on the type of port
-you are doing and the desired performance, you may also have to implement an
-"ofproto provider" or a lower-level component called a "dpif" provider.
-
-The following sections talk about these components in more detail.
-
-Writing a netdev Provider
--------------------------
-
-A "netdev provider" implements an operating system and hardware specific
-interface to "network devices", e.g. eth0 on Linux.  Open vSwitch must be able
-to open each port on a switch as a netdev, so you will need to implement a
-"netdev provider" that works with your switch hardware and software.
-
-``struct netdev_class``, in ``lib/netdev-provider.h``, defines the interfaces
-required to implement a netdev.  That structure contains many function
-pointers, each of which has a comment that is meant to describe its behavior in
-detail.  If the requirements are unclear, report this as a bug.
-
-The netdev interface can be divided into a few rough categories:
-
-- Functions required to properly implement OpenFlow features.  For example,
-  OpenFlow requires the ability to report the Ethernet hardware address of a
-  port.  These functions must be implemented for minimally correct operation.
-
-- Functions required to implement optional Open vSwitch features.  For example,
-  the Open vSwitch support for in-band control requires netdev support for
-  inspecting the TCP/IP stack's ARP table.  These functions must be implemented
-  if the corresponding OVS features are to work, but may be omitted initially.
-
-- Functions needed in some implementations but not in others.  For example,
-  most kinds of ports (see below) do not need functionality to receive packets
-  from a network device.
-
-The existing netdev implementations may serve as useful examples during a port:
-
-- lib/netdev-linux.c implements netdev functionality for Linux network devices,
-  using Linux kernel calls.  It may be a good place to start for full-featured
-  netdev implementations.
-
-- lib/netdev-vport.c provides support for "virtual ports" implemented by the
-  Open vSwitch datapath module for the Linux kernel.  This may serve as a model
-  for minimal netdev implementations.
-
-- lib/netdev-dummy.c is a fake netdev implementation useful only for testing.
-
-.. _porting strategies:
-
-Porting Strategies
-------------------
-
-After a netdev provider has been implemented for a system's network devices,
-you may choose among three basic porting strategies.
-
-.. TODO(stephenfin): Update the link to the installation guide when this is
-   moved
-
-The lowest-effort strategy is to use the "userspace switch" implementation
-built into Open vSwitch.  This ought to work, without writing any more code, as
-long as the netdev provider that you implemented supports receiving packets.
-It yields poor performance, however, because every packet passes through the
-ovs-vswitchd process.  See the `userspace installation guide` for instructions
-on how to configure a userspace switch.
-
-If the userspace switch is not the right choice for your port, then you will
-have to write more code.  You may implement either an "ofproto provider" or a
-"dpif provider".  Which you should choose depends on a few different factors:
-
-* Only an ofproto provider can take full advantage of hardware with built-in
-  support for wildcards (e.g. an ACL table or a TCAM).
-
-* A dpif provider can take advantage of the Open vSwitch built-in
-  implementations of bonding, LACP, 802.1ag, 802.1Q VLANs, and other features.
-  An ofproto provider has to provide its own implementations, if the hardware
-  can support them at all.
-
-* A dpif provider is usually easier to implement, but most appropriate for
-  software switching.  It "explodes" wildcard rules into exact-match entries
-  (with an optional wildcard mask).  This allows fast hash lookups in software,
-  but makes inefficient use of TCAMs in hardware that support wildcarding.
-
-The following sections describe how to implement each kind of port.
-
-ofproto Providers
------------------
-
-An "ofproto provider" is what ofproto uses to directly monitor and control an
-OpenFlow-capable switch.  ``struct ofproto_class``, in
-``ofproto/ofproto-provider.h``, defines the interfaces to implement an ofproto
-provider for new hardware or software.  That structure contains many function
-pointers, each of which has a comment that is meant to describe its behavior in
-detail.  If the requirements are unclear, report this as a bug.
-
-The ofproto provider interface is preliminary.  Let us know if it seems
-unsuitable for your purpose.  We will try to improve it.
-
-Writing a dpif Provider
------------------------
-
-Open vSwitch has a built-in ofproto provider named "ofproto-dpif", which is
-built on top of a library for manipulating datapaths, called "dpif".  A
-"datapath" is a simple flow table, one that is only required to support
-exact-match flows, that is, flows without wildcards.  When a packet arrives on
-a network device, the datapath looks for it in this table.  If there is a
-match, then it performs the associated actions.  If there is no match, the
-datapath passes the packet up to ofproto-dpif, which maintains the full
-OpenFlow flow table.  If the packet matches in this flow table, then
-ofproto-dpif executes its actions and inserts a new entry into the dpif flow
-table.  (Otherwise, ofproto-dpif passes the packet up to ofproto to send the
-packet to the OpenFlow controller, if one is configured.)
-
-When calculating the dpif flow, ofproto-dpif generates an exact-match flow that
-describes the missed packet.  It makes an effort to figure out what fields can
-be wildcarded based on the switch's configuration and OpenFlow flow table.  The
-dpif is free to ignore the suggested wildcards and only support the exact-match
-entry.  However, if the dpif supports wildcarding, then it can use the masks to
-match multiple flows with fewer entries and potentially significantly reduce
-the number of flow misses handled by ofproto-dpif.
-
-The "dpif" library in turn delegates much of its functionality to a "dpif
-provider".  The following diagram shows how dpif providers fit into the Open
-vSwitch architecture:
-
-::
-
-
-    Architecure
-
-               _
-              |   +-------------------+
-              |   |    ovs-vswitchd   |<-->ovsdb-server
-              |   +-------------------+
-              |   |      ofproto      |<-->OpenFlow controllers
-              |   +--------+-+--------+  _
-              |   | netdev | |ofproto-|   |
-    userspace |   +--------+ |  dpif  |   |
-              |   | netdev | +--------+   |
-              |   |provider| |  dpif  |   |
-              |   +---||---+ +--------+   |
-              |       ||     |  dpif  |   | implementation of
-              |       ||     |provider|   | ofproto provider
-              |_      ||     +---||---+   |
-                      ||         ||       |
-               _  +---||-----+---||---+   |
-              |   |          |datapath|   |
-       kernel |   |          +--------+  _|
-              |   |                   |
-              |_  +--------||---------+
-                           ||
-                        physical
-                           NIC
-
-struct ``dpif_class``, in ``lib/dpif-provider.h``, defines the interfaces
-required to implement a dpif provider for new hardware or software.  That
-structure contains many function pointers, each of which has a comment that is
-meant to describe its behavior in detail.  If the requirements are unclear,
-report this as a bug.
-
-There are two existing dpif implementations that may serve as useful examples
-during a port:
-
-* lib/dpif-netlink.c is a Linux-specific dpif implementation that talks to an
-  Open vSwitch-specific kernel module (whose sources are in the "datapath"
-  directory).  The kernel module performs all of the switching work, passing
-  packets that do not match any flow table entry up to userspace.  This dpif
-  implementation is essentially a wrapper around calls into the kernel module.
-
-* lib/dpif-netdev.c is a generic dpif implementation that performs all
-  switching internally.  This is how the Open vSwitch userspace switch is
-  implemented.
-
-Miscellaneous Notes
--------------------
-
-Open vSwitch source code uses ``uint16_t``, ``uint32_t``, and ``uint64_t`` as
-fixed-width types in host byte order, and ``ovs_be16``, ``ovs_be32``, and
-``ovs_be64`` as fixed-width types in network byte order.  Each of the latter is
-equivalent to the one of the former, but the difference in name makes the
-intended use obvious.
-
-The default "fail-mode" for Open vSwitch bridges is "standalone", meaning that,
-when the OpenFlow controllers cannot be contacted, Open vSwitch acts as a
-regular MAC-learning switch.  This works well in virtualization environments
-where there is normally just one uplink (either a single physical interface or
-a bond).  In a more general environment, it can create loops.  So, if you are
-porting to a general-purpose switch platform, you should consider changing the
-default "fail-mode" to "secure", which does not behave this way.  See
-documentation for the "fail-mode" column in the Bridge table in
-ovs-vswitchd.conf.db(5) for more information.
-
-``lib/entropy.c`` assumes that it can obtain high-quality random number seeds
-at startup by reading from /dev/urandom.  You will need to modify it if this is
-not true on your platform.
-
-``vswitchd/system-stats.c`` only knows how to obtain some statistics on Linux.
-Optionally you may implement them for your platform as well.
-
-Why OVS Does Not Support Hybrid Providers
------------------------------------------
-
-The `porting strategies`_ section above describes the "ofproto provider" and
-"dpif provider" porting strategies.  Only an ofproto provider can take
-advantage of hardware TCAM support, and only a dpif provider can take advantage
-of the OVS built-in implementations of various features.  It is therefore
-tempting to suggest a hybrid approach that shares the advantages of both
-strategies.
-
-However, Open vSwitch does not support a hybrid approach.  Doing so may be
-possible, with a significant amount of extra development work, but it does not
-yet seem worthwhile, for the reasons explained below.
-
-First, user surprise is likely when a switch supports a feature only with a
-high performance penalty.  For example, one user questioned why adding a
-particular OpenFlow action to a flow caused a 1,058x slowdown on a hardware
-OpenFlow implementation [1]_.  The action required the flow to be implemented in
-software.
-
-Given that implementing a flow in software on the slow management CPU of a
-hardware switch causes a major slowdown, software-implemented flows would only
-make sense for very low-volume traffic.  But many of the features built into
-the OVS software switch implementation would need to apply to every flow to be
-useful.  There is no value, for example, in applying bonding or 802.1Q VLAN
-support only to low-volume traffic.
-
-Besides supporting features of OpenFlow actions, a hybrid approach could also
-support forms of matching not supported by particular switching hardware, by
-sending all packets that might match a rule to software.  But again this can
-cause an unacceptable slowdown by forcing bulk traffic through software in the
-hardware switch's slow management CPU.  Consider, for example, a hardware
-switch that can match on the IPv6 Ethernet type but not on fields in IPv6
-headers.  An OpenFlow table that matched on the IPv6 Ethernet type would
-perform well, but adding a rule that matched only UDPv6 would force every IPv6
-packet to software, slowing down not just UDPv6 but all IPv6 processing.
-
-.. [1] Aaron Rosen, "Modify packet fields extremely slow",
-    openflow-discuss mailing list, June 26, 2011, archived at
-    https://mailman.stanford.edu/pipermail/openflow-discuss/2011-June/002386.html.
-
-Questions
----------
-
-Direct porting questions to dev@openvswitch.org.  We will try to use questions
-to improve this porting guide.
diff --git a/WHY-OVS.rst b/WHY-OVS.rst

index 5cfd7213e50374e5da10cd907232c5fc963bf284..e73066a766580f134080ccadeed56523a05e4bbf 100644 (file)
--- a/WHY-OVS.rst
+++ b/WHY-OVS.rst
@@ -109,8 +109,8 @@ implementation or a hardware switch.
  
  There are many ongoing efforts to port Open vSwitch to hardware chipsets. These
  include multiple merchant silicon chipsets (Broadcom and Marvell), as well as a
-number of vendor-specific platforms. (The PORTING file discusses how one would
-go about making such a port.)
+number of vendor-specific platforms. The "Porting" section in the documentation
+discusses how one would go about making such a port.
  
  The advantage of hardware integration is not only performance within
  virtualized environments. If physical switches also expose the Open vSwitch
diff --git a/datapath-windows/DESIGN.rst b/datapath-windows/DESIGN.rst

deleted file mode 100644 (file)

index 81c1da5..0000000
--- a/datapath-windows/DESIGN.rst
+++ /dev/null
@@ -1,510 +0,0 @@
-..
-      Licensed under the Apache License, Version 2.0 (the "License"); you may
-      not use this file except in compliance with the License. You may obtain
-      a copy of the License at
-
-          http://www.apache.org/licenses/LICENSE-2.0
-
-      Unless required by applicable law or agreed to in writing, software
-      distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
-      WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
-      License for the specific language governing permissions and limitations
-      under the License.
-
-      Convention for heading levels in Open vSwitch documentation:
-
-      =======  Heading 0 (reserved for the title in a document)
-      -------  Heading 1
-      ~~~~~~~  Heading 2
-      +++++++  Heading 3
-      '''''''  Heading 4
-
-      Avoid deeper levels because they do not render well.
-
-=====================
-OVS-on-Hyper-V Design
-=====================
-
-This document provides details of the effort to develop Open vSwitch on
-Microsoft Hyper-V. This document should give enough information to understand
-the overall design.
-
-.. note::
-  The userspace portion of the OVS has been ported to Hyper-V in a separate
-  effort, and committed to the openvswitch repo. This document will mostly
-  emphasize on the kernel driver, though we touch upon some of the aspects of
-  userspace as well.
-
-Background Info
----------------
-
-Microsoft’s hypervisor solution - Hyper-V [1]_ implements a virtual switch
-that is extensible and provides opportunities for other vendors to implement
-functional extensions [2]_. The extensions need to be implemented as NDIS
-drivers that bind within the extensible switch driver stack provided. The
-extensions can broadly provide the functionality of monitoring, modifying and
-forwarding packets to destination ports on the Hyper-V extensible switch.
-Correspondingly, the extensions can be categorized into the following types and
-provide the functionality noted:
-
-* Capturing extensions: monitoring packets
-
-* Filtering extensions: monitoring, modifying packets
-
-* Forwarding extensions: monitoring, modifying, forwarding packets
-
-As can be expected, the kernel portion (datapath) of OVS on Hyper-V solution
-will be implemented as a forwarding extension.
-
-In Hyper-V, the virtual machine is called the Child Partition. Each VIF or
-physical NIC on the Hyper-V extensible switch is attached via a port. Each port
-is both on the ingress path or the egress path of the switch. The ingress path
-is used for packets being sent out of a port, and egress is used for packet
-being received on a port. By design, NDIS provides a layered interface. In this
-layered interface, higher level layers call into lower level layers, in the
-ingress path. In the egress path, it is the other way round. In addition, there
-is a object identifier (OID) interface for control operations Eg. addition of a
-port. The workflow for the calls is similar in nature to the packets, where
-higher level layers call into the lower level layers. A good representational
-diagram of this architecture is in [4]_.
-
-Windows Filtering Platform (WFP)[5]_ is a platform implemented on Hyper-V that
-provides APIs and services for filtering packets. WFP has been utilized to
-filter on some of the packets that OVS is not equipped to handle directly. More
-details in later sections.
-
-IP Helper [6]_ is a set of API available on Hyper-V to retrieve information
-related to the network configuration information on the host machine. IP Helper
-has been used to retrieve some of the configuration information that OVS needs.
-
-Design
-------
-
-::
-
-    Various blocks of the OVS Windows implementation
-
-                                      +-------------------------------+
-                                      |                               |
-                                      |        CHILD PARTITION        |
-                                      |                               |
-      +------+ +--------------+       | +-----------+  +------------+ |
-      |      | |              |       | |           |  |            | |
-      | ovs- | |     OVS-     |       | | Virtual   |  | Virtual    | |
-      | *ctl | |  USERSPACE   |       | | Machine #1|  | Machine #2 | |
-      |      | |    DAEMON    |       | |           |  |            | |
-      +------+-++---+---------+       | +--+------+-+  +----+------++ | +--------+
-      |  dpif-  |   | netdev- |       |    |VIF #1|         |VIF #2|  | |Physical|
-      | netlink |   | windows |       |    +------+         +------+  | |  NIC   |
-      +---------+   +---------+       |      ||                   /\  | +--------+
-    User     /\         /\            |      || *#1*         *#4* ||  |     /\
-    =========||=========||============+------||-------------------||--+     ||
-    Kernel   ||         ||                   \/                   ||  ||=====/
-             \/         \/                +-----+                 +-----+ *#5*
-     +-------------------------------+    |     |                 |     |
-     |   +----------------------+    |    |     |                 |     |
-     |   |   OVS Pseudo Device  |    |    |     |                 |     |
-     |   +----------------------+    |    |     |                 |     |
-     |      | Netlink Impl. |        |    |     |                 |     |
-     |      -----------------        |    |  I  |                 |     |
-     | +------------+                |    |  N  |                 |  E  |
-     | |  Flowtable | +------------+ |    |  G  |                 |  G  |
-     | +------------+ |  Packet    | |*#2*|  R  |                 |  R  |
-     |   +--------+   | Processing | |<=> |  E  |                 |  E  |
-     |   |   WFP  |   |            | |    |  S  |                 |  S  |
-     |   | Driver |   +------------+ |    |  S  |                 |  S  |
-     |   +--------+                  |    |     |                 |     |
-     |                               |    |     |                 |     |
-     |   OVS FORWARDING EXTENSION    |    |     |                 |     |
-     +-------------------------------+    +-----+-----------------+-----+
-                                          |HYPER-V Extensible Switch *#3|
-                                          +-----------------------------+
-                                                   NDIS STACK
-
-This diagram shows the various blocks involved in the OVS Windows
-implementation, along with some of the components available in the NDIS stack,
-and also the virtual machines. The workflow of a packet being transmitted from
-a VIF out and into another VIF and to a physical NIC is also shown. Later on in
-this section, we will discuss the flow of a packet at a high level.
-
-The figure gives a general idea of where the OVS userspace and the kernel
-components fit in, and how they interface with each other.
-
-The kernel portion (datapath) of OVS on Hyper-V solution has be implemented as
-a forwarding extension roughly implementing the following
-sub-modules/functionality. Details of each of these sub-components in the
-kernel are contained in later sections:
-
-* Interfacing with the NDIS stack
-
-* Netlink message parser
-
-* Netlink sockets
-
-* Switch/Datapath management
-
-* Interfacing with userspace portion of the OVS solution to implement the
-  necessary functionality that userspace needs
-
-* Port management
-
-* Flowtable/Actions/packet forwarding
-
-* Tunneling
-
-* Event notifications
-
-The datapath for the OVS on Linux is a kernel module, and cannot be directly
-ported since there are significant differences in architecture even though the
-end functionality provided would be similar. Some examples of the differences
-are:
-
-* Interfacing with the NDIS stack to hook into the NDIS callbacks for
-  functionality such as receiving and sending packets, packet completions, OIDs
-  used for events such as a new port appearing on the virtual switch.
-
-* Interface between the userspace and the kernel module.
-
-* Event notifications are significantly different.
-
-* The communication interface between DPIF and the kernel module need not be
-  implemented in the way OVS on Linux does. That said, it would be advantageous
-  to have a similar interface to the kernel module for reasons of readability
-  and maintainability.
-
-* Any licensing issues of using Linux kernel code directly.
-
-Due to these differences, it was a straightforward decision to develop the
-datapath for OVS on Hyper-V from scratch rather than porting the one on Linux.
-A re-development focused on the following goals:
-
-* Adhere to the existing requirements of userspace portion of OVS (such as
-  ovs-vswitchd), to minimize changes in the userspace workflow.
-
-* Fit well into the typical workflow of a Hyper-V extensible switch forwarding
-  extension.
-
-The userspace portion of the OVS solution is mostly POSIX code, and not very
-Linux specific. Majority of the userspace code does not interface directly with
-the kernel datapath and was ported independently of the kernel datapath effort.
-
-As explained in the OVS porting design document [7]_, DPIF is the portion of
-userspace that interfaces with the kernel portion of the OVS. The interface
-that each DPIF provider has to implement is defined in ``dpif-provider.h``
-[3]_.  Though each platform is allowed to have its own implementation of the
-DPIF provider, it was found, via community feedback, that it is desired to
-share code whenever possible. Thus, the DPIF provider for OVS on Hyper-V shares
-code with the DPIF provider on Linux. This interface is implemented in
-``dpif-netlink.c``.
-
-We'll elaborate more on kernel-userspace interface in a dedicated section
-below. Here it suffices to say that the DPIF provider implementation for
-Windows is netlink-based and shares code with the Linux one.
-
-Kernel Module (Datapath)
-------------------------
-
-Interfacing with the NDIS Stack
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-For each virtual switch on Hyper-V, the OVS extensible switch extension can be
-enabled/disabled. We support enabling the OVS extension on only one switch.
-This is consistent with using a single datapath in the kernel on Linux. All the
-physical adapters are connected as external adapters to the extensible switch.
-
-When the OVS switch extension registers itself as a filter driver, it also
-registers callbacks for the switch/port management and datapath functions. In
-other words, when a switch is created on the Hyper-V root partition (host), the
-extension gets an activate callback upon which it can initialize the data
-structures necessary for OVS to function. Similarly, there are callbacks for
-when a port gets added to the Hyper-V switch, and an External Network adapter
-or a VM Network adapter is connected/disconnected to the port. There are also
-callbacks for when a VIF (NIC of a child partition) send out a packet, or a
-packet is received on an external NIC.
-
-As shown in the figures, an extensible switch extension gets to see a packet
-sent by the VM (VIF) twice - once on the ingress path and once on the egress
-path. Forwarding decisions are to be made on the ingress path. Correspondingly,
-we will be hooking onto the following interfaces:
-
-* Ingress send indication: intercept packets for performing flow based
-  forwarding.This includes straight forwarding to output ports. Any packet
-  modifications needed to be performed are done here either inline or by
-  creating a new packet. A forwarding action is performed as the flow actions
-  dictate.
-
-* Ingress completion indication: cleanup and free packets that we generated on
-  the ingress send path, pass-through for packets that we did not generate.
-
-* Egress receive indication: pass-through.
-
-* Egress completion indication: pass-through.
-
-Interfacing with OVS Userspace
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-We have implemented a pseudo device interface for letting OVS userspace talk to
-the OVS kernel module. This is equivalent to the typical character device
-interface on POSIX platforms where we can register custom functions for read,
-write and ioctl functionality. The pseudo device supports a whole bunch of
-ioctls that netdev and DPIF on OVS userspace make use of.
-
-Netlink Message Parser
-~~~~~~~~~~~~~~~~~~~~~~
-
-The communication between OVS userspace and OVS kernel datapath is in the form
-of Netlink messages [1]_. More details about this are provided below.  In the
-kernel, a full fledged netlink message parser has been implemented along the
-lines of the netlink message parser in OVS userspace. In fact, a lot of the
-code is ported code.
-
-On the lines of ``struct ofpbuf`` in OVS userspace, a managed buffer has been
-implemented in the kernel datapath to make it easier to parse and construct
-netlink messages.
-
-Netlink Sockets
-~~~~~~~~~~~~~~~
-
-On Linux, OVS userspace utilizes netlink sockets to pass back and forth netlink
-messages. Since much of userspace code including DPIF provider in
-dpif-netlink.c (formerly dpif-linux.c) has been reused, pseudo-netlink sockets
-have been implemented in OVS userspace. As it is known, Windows lacks native
-netlink socket support, and also the socket family is not extensible either.
-Hence it is not possible to provide a native implementation of netlink socket.
-We emulate netlink sockets in lib/netlink-socket.c and support all of the nl_*
-APIs to higher levels. The implementation opens a handle to the pseudo device
-for each netlink socket. Some more details on this topic are provided in the
-userspace section on netlink sockets.
-
-Typical netlink semantics of read message, write message, dump, and transaction
-have been implemented so that higher level layers are not affected by the
-netlink implementation not being native.
-
-Switch/Datapath Management
-~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-As explained above, we hook onto the management callback functions in the NDIS
-interface for when to initialize the OVS data structures, flow tables etc. Some
-of this code is also driven by OVS userspace code which sends down ioctls for
-operations like creating a tunnel port etc.
-
-Port Management
-~~~~~~~~~~~~~~~
-
-As explained above, we hook onto the management callback functions in the NDIS
-interface to know when a port is added/connected to the Hyper-V switch. We use
-these callbacks to initialize the port related data structures in OVS. Also,
-some of the ports are tunnel ports that don’t exist on the Hyper-V switch and
-get added from OVS userspace.
-
-In order to identify a Hyper-V port, we use the value of 'FriendlyName' field
-in each Hyper-V port. We call this the "OVS-port-name". The idea is that OVS
-userspace sets 'OVS-port-name' in each Hyper-V port to the same value as the
-'name' field of the 'Interface' table in OVSDB. When OVS userspace calls into
-the kernel datapath to add a port, we match the name of the port with the
-'OVS-port-name' of a Hyper-V port.
-
-We maintain separate hash tables, and separate counters for ports that have
-been added from the Hyper-V switch, and for ports that have been added from OVS
-userspace.
-
-Flowtable/Actions/Packet Forwarding
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-The flowtable and flow actions based packet forwarding is the core of the OVS
-datapath functionality. For each packet on the ingress path, we consult the
-flowtable and execute the corresponding actions. The actions can be limited to
-simple forwarding to a particular destination port(s), or more commonly
-involves modifying the packet to insert a tunnel context or a VLAN ID, and
-thereafter forwarding to the external port to send the packet to a destination
-host.
-
-Tunneling
-~~~~~~~~~
-
-We make use of the Internal Port on a Hyper-V switch for implementing
-tunneling. The Internal Port is a virtual adapter that is exposed on the Hyper-
-V host, and connected to the Hyper-V switch. Basically, it is an interface
-between the host and the virtual switch. The Internal Port acts as the Tunnel
-end point for the host (aka VTEP), and holds the VTEP IP address.
-
-Tunneling ports are not actual ports on the Hyper-V switch. These are virtual
-ports that OVS maintains and while executing actions, if the outport is a
-tunnel port, we short circuit by performing the encapsulation action based on
-the tunnel context. The encapsulated packet gets forwarded to the external
-port, and appears to the outside world as though it was set from the VTEP.
-
-Similarly, when a tunneled packet enters the OVS from the external port bound
-to the internal port (VTEP), and if yes, we short circuit the path, and
-directly forward the inner packet to the destination port (mostly a VIF, but
-dictated by the flow). We leverage the Windows Filtering Platform (WFP)
-framework to be able to receive tunneled packets that cannot be decapsulated by
-OVS right away. Currently, fragmented IP packets fall into that category, and
-we leverage the code in the host IP stack to reassemble the packet, and
-performing decapsulation on the reassembled packet.
-
-We'll also be using the IP helper library to provide us IP address and other
-information corresponding to the Internal port.
-
-Event Notifications
-~~~~~~~~~~~~~~~~~~~
-
-The pseudo device interface described above is also used for providing event
-notifications back to OVS userspace. A shared memory/overlapped IO model is
-used.
-
-Userspace Components
-~~~~~~~~~~~~~~~~~~~~
-
-The userspace portion of the OVS solution is mostly POSIX code, and not very
-Linux specific. Majority of the userspace code does not interface directly with
-the kernel datapath and was ported independently of the kernel datapath effort.
-
-In this section, we cover the userspace components that interface with the
-kernel datapath.
-
-As explained earlier, OVS on Hyper-V shares the DPIF provider implementation
-with Linux. The DPIF provider on Linux uses netlink sockets and netlink
-messages. Netlink sockets and messages are extensively used on Linux to
-exchange information between userspace and kernel. In order to satisfy these
-dependencies, netlink socket (pseudo and non-native) and netlink messages are
-implemented on Hyper-V.
-
-The following are the major advantages of sharing DPIF provider code:
-
-1. Maintenance is simpler:
-
-   Any change made to the interface defined in dpif-provider.h need not be
-   propagated to multiple implementations. Also, developers familiar with the
-   Linux implementation of the DPIF provider can easily ramp on the Hyper-V
-   implementation as well.
-
-2. Netlink messages provides inherent advantages:
-
-   Netlink messages are known for their extensibility. Each message is
-   versioned, so the provided data structures offer a mechanism to perform
-   version checking and forward/backward compatibility with the kernel module.
-
-Netlink Sockets
-~~~~~~~~~~~~~~~
-
-As explained in other sections, an emulation of netlink sockets has been
-implemented in ``lib/netlink-socket.c`` for Windows. The implementation creates
-a handle to the OVS pseudo device, and emulates netlink socket semantics of
-receive message, send message, dump, and transact. Most of the ``nl_*``
-functions are supported.
-
-The fact that the implementation is non-native manifests in various ways.  One
-example is that PID for the netlink socket is not automatically assigned in
-userspace when a handle is created to the OVS pseudo device. There's an extra
-command (defined in ``OvsDpInterfaceExt.h``) that is used to grab the PID
-generated in the kernel.
-
-DPIF Provider
-~~~~~~~~~~~~~
-
-As has been mentioned in earlier sections, the netlink socket and netlink
-message based DPIF provider on Linux has been ported to Windows.
-
-Most of the code is common. Some divergence is in the code to receive packets.
-The Linux implementation uses epoll() which is not natively supported on
-Windows.
-
-netdev-windows
-~~~~~~~~~~~~~~
-
-We have a Windows implementation of the interface defined in
-``lib/netdev-provider.h``. The implementation provides functionality to get
-extended information about an interface. It is limited in functionality
-compared to the Linux implementation of the netdev provider and cannot be used
-to add any interfaces in the kernel such as a tap interface or to send/receive
-packets. The netdev-windows implementation uses the datapath interface
-extensions defined in ``datapath-windows/include/OvsDpInterfaceExt.h``.
-
-Powershell Extensions to Set ``OVS-port-name``
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-As explained in the section on "Port management", each Hyper-V port has a
-'FriendlyName' field, which we call as the "OVS-port-name" field. We have
-implemented powershell command extensions to be able to set the "OVS-port-name"
-of a Hyper-V port.
-
-Kernel-Userspace Interface
---------------------------
-
-openvswitch.h and OvsDpInterfaceExt.h
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-Since the DPIF provider is shared with Linux, the kernel datapath provides the
-same interface as the Linux datapath. The interface is defined in
-``datapath/linux/compat/include/linux/openvswitch.h``. Derivatives of this
-interface file are created during OVS userspace compilation. The derivative for
-the kernel datapath on Hyper-V is provided in
-``datapath-windows/include/OvsDpInterface.h``.
-
-That said, there are Windows specific extensions that are defined in the
-interface file ``datapath-windows/include/OvsDpInterfaceExt.h``.
-
-Flow of a Packet
-----------------
-
-Figure 2 shows the numbered steps in which a packets gets sent out of a VIF and
-is forwarded to another VIF or a physical NIC. As mentioned earlier, each VIF
-is attached to the switch via a port, and each port is both on the ingress and
-egress path of the switch, and depending on whether a packet is being
-transmitted or received, one of the paths gets used. In the figure, each step n
-is annotated as ``#n``
-
-The steps are as follows:
-
-1. When a packet is sent out of a VIF or an physical NIC or an internal port,
-   the packet is part of the ingress path.
-
-2. The OVS kernel driver gets to intercept this packet.
-
-   a. OVS looks up the flows in the flowtable for this packet, and executes the
-      corresponding action.
-
-   b. If there is not action, the packet is sent up to OVS userspace to examine
-      the packet and figure out the actions.
-
-   c. Userspace executes the packet by specifying the actions, and might also
-      insert a flow for such a packet in the future.
-
-   d. The destination ports are added to the packet and sent down to the Hyper-
-      V switch.
-
-3. The Hyper-V forwards the packet to the destination ports specified in the
-   packet, and sends it out on the egress path.
-
-4. The packet gets forwarded to the destination VIF.
-
-5. It might also get forwarded to a physical NIC as well, if the physical NIC
-   has been added as a destination port by OVS.
-
-Build/Deployment
-----------------
-
-The userspace components added as part of OVS Windows implementation have been
-integrated with autoconf, and can be built using the steps mentioned in the
-BUILD.Windows file. Additional targets need to be specified to make.
-
-The OVS kernel code is part of a Visual Studio 2013 solution, and is compiled
-from the IDE. There are plans in the future to move this to a compilation mode
-such that we can compile it without an IDE as well.
-
-Once compiled, we have an install script that can be used to load the kernel
-driver.
-
-References
-----------
-
-.. [1] Hyper-V Extensible Switch http://msdn.microsoft.com/en-us/library/windows/hardware/hh598161(v=vs.85).aspx
-.. [2] Hyper-V Extensible Switch Extensions http://msdn.microsoft.com/en-us/library/windows/hardware/hh598169(v=vs.85).aspx
-.. [3] DPIF Provider http://openvswitch.sourcearchive.com/documentation/1.1.0-1/dpif-provider_8h_source.html
-.. [4] Hyper-V Extensible Switch Components http://msdn.microsoft.com/en-us/library/windows/hardware/hh598163(v=vs.85).aspx
-.. [5] Windows Filtering Platform http://msdn.microsoft.com/en-us/library/windows/desktop/aa366510(v=vs.85).aspx
-.. [6] IP Helper http://msdn.microsoft.com/en-us/library/windows/hardware/ff557015(v=vs.85).aspx
-.. [7] How to Port Open vSwitch to New Software or Hardware http://git.openvswitch.org/cgi-bin/gitweb.cgi?p=openvswitch;a=blob;f=PORTING
-.. [8] Netlink http://en.wikipedia.org/wiki/Netlink
-.. [9] epoll http://en.wikipedia.org/wiki/Epoll
diff --git a/datapath-windows/automake.mk b/datapath-windows/automake.mk

index 0f2bb22bfa4276040699ac584a4e1bcbbcdc5b30..88aa50ae60f5f59147f898a76908be95830d7780 100644 (file)
--- a/datapath-windows/automake.mk
+++ b/datapath-windows/automake.mk
@@ -1,5 +1,4 @@
  EXTRA_DIST += \
-       datapath-windows/DESIGN.rst \
         datapath-windows/Package/package.VcxProj \
         datapath-windows/Package/package.VcxProj.user \
         datapath-windows/include/OvsDpInterfaceExt.h \
diff --git a/datapath/Modules.mk b/datapath/Modules.mk

index 2ffab2b2ff1fcfc851ab14644f80b7904d297c55..21f04a0ea82dc872423312ed902a63c88c81d109 100644 (file)
--- a/datapath/Modules.mk
+++ b/datapath/Modules.mk
@@ -45,9 +45,6 @@ openvswitch_headers = \
         vport-internal_dev.h \
         vport-netdev.h
  
-openvswitch_extras = \
-       README.rst
-
  dist_sources = $(foreach module,$(dist_modules),$($(module)_sources))
  dist_headers = $(foreach module,$(dist_modules),$($(module)_headers))
  dist_extras = $(foreach module,$(dist_modules),$($(module)_extras))
diff --git a/datapath/README.rst b/datapath/README.rst

deleted file mode 100644 (file)

index 47e0e23..0000000
--- a/datapath/README.rst
+++ /dev/null
@@ -1,265 +0,0 @@
-..
-      Licensed under the Apache License, Version 2.0 (the "License"); you may
-      not use this file except in compliance with the License. You may obtain
-      a copy of the License at
-
-          http://www.apache.org/licenses/LICENSE-2.0
-
-      Unless required by applicable law or agreed to in writing, software
-      distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
-      WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
-      License for the specific language governing permissions and limitations
-      under the License.
-
-      Convention for heading levels in Open vSwitch documentation:
-
-      =======  Heading 0 (reserved for the title in a document)
-      -------  Heading 1
-      ~~~~~~~  Heading 2
-      +++++++  Heading 3
-      '''''''  Heading 4
-
-      Avoid deeper levels because they do not render well.
-
-=======================================
-Open vSwitch Datapath Development Guide
-=======================================
-
-The Open vSwitch kernel module allows flexible userspace control over
-flow-level packet processing on selected network devices.  It can be used to
-implement a plain Ethernet switch, network device bonding, VLAN processing,
-network access control, flow-based network control, and so on.
-
-The kernel module implements multiple "datapaths" (analogous to bridges), each
-of which can have multiple "vports" (analogous to ports within a bridge).  Each
-datapath also has associated with it a "flow table" that userspace populates
-with "flows" that map from keys based on packet headers and metadata to sets of
-actions.  The most common action forwards the packet to another vport; other
-actions are also implemented.
-
-When a packet arrives on a vport, the kernel module processes it by extracting
-its flow key and looking it up in the flow table.  If there is a matching flow,
-it executes the associated actions.  If there is no match, it queues the packet
-to userspace for processing (as part of its processing, userspace will likely
-set up a flow to handle further packets of the same type entirely in-kernel).
-
-Flow Key Compatibility
-----------------------
-
-Network protocols evolve over time.  New protocols become important and
-existing protocols lose their prominence.  For the Open vSwitch kernel module
-to remain relevant, it must be possible for newer versions to parse additional
-protocols as part of the flow key.  It might even be desirable, someday, to
-drop support for parsing protocols that have become obsolete.  Therefore, the
-Netlink interface to Open vSwitch is designed to allow carefully written
-userspace applications to work with any version of the flow key, past or
-future.
-
-To support this forward and backward compatibility, whenever the kernel module
-passes a packet to userspace, it also passes along the flow key that it parsed
-from the packet.  Userspace then extracts its own notion of a flow key from the
-packet and compares it against the kernel-provided version:
-
-- If userspace's notion of the flow key for the packet matches the kernel's,
-  then nothing special is necessary.
-
-- If the kernel's flow key includes more fields than the userspace version of
-  the flow key, for example if the kernel decoded IPv6 headers but userspace
-  stopped at the Ethernet type (because it does not understand IPv6), then
-  again nothing special is necessary.  Userspace can still set up a flow in the
-  usual way, as long as it uses the kernel-provided flow key to do it.
-
-- If the userspace flow key includes more fields than the kernel's, for example
-  if userspace decoded an IPv6 header but the kernel stopped at the Ethernet
-  type, then userspace can forward the packet manually, without setting up a
-  flow in the kernel.  This case is bad for performance because every packet
-  that the kernel considers part of the flow must go to userspace, but the
-  forwarding behavior is correct.  (If userspace can determine that the values
-  of the extra fields would not affect forwarding behavior, then it could set
-  up a flow anyway.)
-
-How flow keys evolve over time is important to making this work, so
-the following sections go into detail.
-
-Flow Key Format
----------------
-
-A flow key is passed over a Netlink socket as a sequence of Netlink attributes.
-Some attributes represent packet metadata, defined as any information about a
-packet that cannot be extracted from the packet itself, e.g. the vport on which
-the packet was received.  Most attributes, however, are extracted from headers
-within the packet, e.g. source and destination addresses from Ethernet, IP, or
-TCP headers.
-
-The ``<linux/openvswitch.h>`` header file defines the exact format of the flow
-key attributes.  For informal explanatory purposes here, we write them as
-comma-separated strings, with parentheses indicating arguments and nesting.
-For example, the following could represent a flow key corresponding to a TCP
-packet that arrived on vport 1::
-
-    in_port(1), eth(src=e0:91:f5:21:d0:b2, dst=00:02:e3:0f:80:a4),
-    eth_type(0x0800), ipv4(src=172.16.0.20, dst=172.18.0.52, proto=17, tos=0,
-    frag=no), tcp(src=49163, dst=80)
-
-Often we ellipsize arguments not important to the discussion, e.g.::
-
-    in_port(1), eth(...), eth_type(0x0800), ipv4(...), tcp(...)
-
-Wildcarded Flow Key Format
---------------------------
-
-A wildcarded flow is described with two sequences of Netlink attributes passed
-over the Netlink socket. A flow key, exactly as described above, and an
-optional corresponding flow mask.
-
-A wildcarded flow can represent a group of exact match flows. Each ``1`` bit
-in the mask specifies an exact match with the corresponding bit in the flow key.
-A ``0`` bit specifies a don't care bit, which will match either a ``1`` or
-``0`` bit of an incoming packet. Using a wildcarded flow can improve the flow
-set up rate by reducing the number of new flows that need to be processed by
-the user space program.
-
-Support for the mask Netlink attribute is optional for both the kernel and user
-space program. The kernel can ignore the mask attribute, installing an exact
-match flow, or reduce the number of don't care bits in the kernel to less than
-what was specified by the user space program. In this case, variations in bits
-that the kernel does not implement will simply result in additional flow
-setups.  The kernel module will also work with user space programs that neither
-support nor supply flow mask attributes.
-
-Since the kernel may ignore or modify wildcard bits, it can be difficult for
-the userspace program to know exactly what matches are installed. There are two
-possible approaches: reactively install flows as they miss the kernel flow
-table (and therefore not attempt to determine wildcard changes at all) or use
-the kernel's response messages to determine the installed wildcards.
-
-When interacting with userspace, the kernel should maintain the match portion
-of the key exactly as originally installed. This will provides a handle to
-identify the flow for all future operations. However, when reporting the mask
-of an installed flow, the mask should include any restrictions imposed by the
-kernel.
-
-The behavior when using overlapping wildcarded flows is undefined. It is the
-responsibility of the user space program to ensure that any incoming packet can
-match at most one flow, wildcarded or not. The current implementation performs
-best-effort detection of overlapping wildcarded flows and may reject some but
-not all of them. However, this behavior may change in future versions.
-
-Unique Flow Identifiers
------------------------
-
-An alternative to using the original match portion of a key as the handle for
-flow identification is a unique flow identifier, or "UFID". UFIDs are optional
-for both the kernel and user space program.
-
-User space programs that support UFID are expected to provide it during flow
-setup in addition to the flow, then refer to the flow using the UFID for all
-future operations. The kernel is not required to index flows by the original
-flow key if a UFID is specified.
-
-Basic Rule for Evolving Flow Keys
----------------------------------
-
-Some care is needed to really maintain forward and backward compatibility for
-applications that follow the rules listed under "Flow key compatibility" above.
-
-The basic rule is obvious:
-
-    New network protocol support must only supplement existing flow key
-    attributes.  It must not change the meaning of already defined flow key
-    attributes.
-
-This rule does have less-obvious consequences so it is worth working through a
-few examples.  Suppose, for example, that the kernel module did not already
-implement VLAN parsing.  Instead, it just interpreted the 802.1Q TPID
-(``0x8100``) as the Ethertype then stopped parsing the packet.  The flow key
-for any packet with an 802.1Q header would look essentially like this, ignoring
-metadata::
-
-    eth(...), eth_type(0x8100)
-
-Naively, to add VLAN support, it makes sense to add a new "vlan" flow key
-attribute to contain the VLAN tag, then continue to decode the encapsulated
-headers beyond the VLAN tag using the existing field definitions.  With this
-change, a TCP packet in VLAN 10 would have a flow key much like this::
-
-    eth(...), vlan(vid=10, pcp=0), eth_type(0x0800), ip(proto=6, ...), tcp(...)
-
-But this change would negatively affect a userspace application that has not
-been updated to understand the new "vlan" flow key attribute.  The application
-could, following the flow compatibility rules above, ignore the "vlan"
-attribute that it does not understand and therefore assume that the flow
-contained IP packets.  This is a bad assumption (the flow only contains IP
-packets if one parses and skips over the 802.1Q header) and it could cause the
-application's behavior to change across kernel versions even though it follows
-the compatibility rules.
-
-The solution is to use a set of nested attributes.  This is, for example, why
-802.1Q support uses nested attributes.  A TCP packet in VLAN 10 is actually
-expressed as::
-
-    eth(...), eth_type(0x8100), vlan(vid=10, pcp=0), encap(eth_type(0x0800),
-    ip(proto=6, ...), tcp(...)))
-
-Notice how the ``eth_type``, ``ip``, and ``tcp`` flow key attributes are nested
-inside the ``encap`` attribute.  Thus, an application that does not understand
-the ``vlan`` key will not see either of those attributes and therefore will not
-misinterpret them.  (Also, the outer ``eth_type`` is still ``0x8100``, not
-changed to ``0x0800``)
-
-Handling Malformed Packets
---------------------------
-
-Don't drop packets in the kernel for malformed protocol headers, bad checksums,
-etc.  This would prevent userspace from implementing a simple Ethernet switch
-that forwards every packet.
-
-Instead, in such a case, include an attribute with "empty" content.  It doesn't
-matter if the empty content could be valid protocol values, as long as those
-values are rarely seen in practice, because userspace can always forward all
-packets with those values to userspace and handle them individually.
-
-For example, consider a packet that contains an IP header that indicates
-protocol 6 for TCP, but which is truncated just after the IP header, so that
-the TCP header is missing.  The flow key for this packet would include a tcp
-attribute with all-zero ``src`` and ``dst``, like this::
-
-    eth(...), eth_type(0x0800), ip(proto=6, ...), tcp(src=0, dst=0)
-
-As another example, consider a packet with an Ethernet type of 0x8100,
-indicating that a VLAN TCI should follow, but which is truncated just after the
-Ethernet type.  The flow key for this packet would include an all-zero-bits
-vlan and an empty encap attribute, like this::
-
-    eth(...), eth_type(0x8100), vlan(0), encap()
-
-Unlike a TCP packet with source and destination ports 0, an all-zero-bits VLAN
-TCI is not that rare, so the CFI bit (aka VLAN_TAG_PRESENT inside the kernel)
-is ordinarily set in a vlan attribute expressly to allow this situation to be
-distinguished.  Thus, the flow key in this second example unambiguously
-indicates a missing or malformed VLAN TCI.
-
-Other Rules
------------
-
-The other rules for flow keys are much less subtle:
-
-- Duplicate attributes are not allowed at a given nesting level.
-
-- Ordering of attributes is not significant.
-
-- When the kernel sends a given flow key to userspace, it always composes it
-  the same way.  This allows userspace to hash and compare entire flow keys
-  that it may not be able to fully interpret.
-
-Coding Rules
-------------
-
-Implement the headers and codes for compatibility with older kernel in
-``linux/compat/`` directory.  All public functions should be exported using
-``EXPORT_SYMBOL`` macro.  Public function replacing the same-named kernel
-function should be prefixed with ``rpl_``.  Otherwise, the function should be
-prefixed with ``ovs_``.  For special case when it is not possible to follow
-this rule (e.g., the ``pskb_expand_head()`` function), the function name must
-be added to ``linux/compat/build-aux/export-check-whitelist``, otherwise, the
-compilation check ``check-export-symbol`` will fail.
diff --git a/include/openvswitch/ofp-actions.h b/include/openvswitch/ofp-actions.h

index 29992614ea93afb3b8a7e1f5d62c5a2e83bc48ca..946beafeb500929d48556b206ae7c0b6666712c1 100644 (file)
--- a/include/openvswitch/ofp-actions.h
+++ b/include/openvswitch/ofp-actions.h
@@ -158,8 +158,8 @@ enum {
   *       NXAST_SET_TUNNEL64.  In these cases, if the "struct ofpact" originated
   *       from OpenFlow, then we want to make sure that, if it gets translated
   *       back to OpenFlow later, it is translated back to the same action type.
- *       (Otherwise, we'd violate the promise made in DESIGN, in the "Action
- *       Reproduction" section.)
+ *       (Otherwise, we'd violate the promise made in the topics/design doc, in
+ *       the "Action Reproduction" section.)
   *
   *       For such actions, the 'raw' member should be the "enum ofp_raw_action"
   *       originally extracted from the OpenFlow action.  (If the action didn't
diff --git a/include/openvswitch/ofp-util.h b/include/openvswitch/ofp-util.h

index 8703d2a3a3382a9b6ee78d7fc8264771168e1d3a..91ff0c293de6fdf5bc4d613c5d914d1d30531264 100644 (file)
--- a/include/openvswitch/ofp-util.h
+++ b/include/openvswitch/ofp-util.h
@@ -282,7 +282,7 @@ enum ofputil_flow_mod_flags {
  /* Protocol-independent flow_mod.
   *
   * The handling of cookies across multiple versions of OpenFlow is a bit
- * confusing.  See DESIGN for the details. */
+ * confusing.  See the topics/design doc for the details. */
  struct ofputil_flow_mod {
      struct ovs_list list_node; /* For queuing flow_mods. */
  
@@ -818,7 +818,7 @@ struct ofputil_table_features {
       * supported, otherwise 0.  For other versions, they are decoded as -1 and
       * ignored for encoding.
       *
-     * See the section "OFPTC_* Table Configuration" in DESIGN.rst for more
+     * Search for "OFPTC_* Table Configuration" in the documentation for more
       * details of how OpenFlow has changed in this area.
       */
      enum ofputil_table_miss miss_config; /* OF1.1 and 1.2 only. */
diff --git a/lib/dpif.h b/lib/dpif.h

index e69087deef1c36de8592db8c7c4fa47c78aa3f18..40ffe29e7f8a563ae12a166866c4813f92f59cde 100644 (file)
--- a/lib/dpif.h
+++ b/lib/dpif.h
@@ -113,9 +113,8 @@
   *
   *      In Open vSwitch userspace, "struct flow" is the typical way to describe
   *      a flow, but the datapath interface uses a different data format to
- *      allow ABI forward- and backward-compatibility.  datapath/README.rst
- *      describes the rationale and design.  Refer to OVS_KEY_ATTR_* and
- *      "struct ovs_key_*" in include/odp-netlink.h for details.
+ *      allow ABI forward- and backward-compatibility.  Refer to OVS_KEY_ATTR_*
+ *      and "struct ovs_key_*" in include/odp-netlink.h for details.
   *      lib/odp-util.h defines several functions for working with these flows.
   *
   *    - A "mask" that, for each bit in the flow, specifies whether the datapath
diff --git a/lib/mac-learning.c b/lib/mac-learning.c

index 57b81f4186da7e290653cf4dd80305c0d3702f76..44c49622b44d192f5d0cc47f8934c404369eee86 100644 (file)
--- a/lib/mac-learning.c
+++ b/lib/mac-learning.c
@@ -410,9 +410,9 @@ update_learning_table__(struct mac_learning *ml, struct eth_addr src,
           * reflected packets, so we lock each entry for which a gratuitous ARP
           * packet was received over a non-bond interface and refrain from
           * learning from gratuitous ARP packets that arrive over bond
-         * interfaces for this entry while the lock is in effect.  See
-         * vswitchd/INTERNALS.rst for more in-depth discussion on this
-         * topic. */
+         * interfaces for this entry while the lock is in effect. Refer to the
+         * 'ovs-vswitch Internals' document for more in-depth discussion on
+         * this topic. */
          if (!is_bond) {
              mac_entry_set_grat_arp_lock(mac);
          } else if (mac_entry_is_grat_arp_locked(mac)) {
diff --git a/lib/mac-learning.h b/lib/mac-learning.h

index e42781500f546396ed6ad39213b164185ace1674..ee14185d9669034b2a045868e8fc816bc371d6ad 100644 (file)
--- a/lib/mac-learning.h
+++ b/lib/mac-learning.h
@@ -46,8 +46,8 @@
   *
   * Second, the implementation has the ability to "lock" a MAC table entry
   * updated by a gratuitous ARP.  This is a simple feature but the rationale for
- * it is complicated.  Please refer to the description of SLB bonding in
- * vswitchd/INTERNALS.rst for an explanation.
+ * it is complicated.  Refer to the description of SLB bonding in the
+ * 'ovs-vswitchd Internals' guide for an explanation.
   *
   * Third, the implementation expires entries that are idle for longer than a
   * configurable amount of time.  This is implemented by keeping all of the
diff --git a/lib/netdev.h b/lib/netdev.h

index bad28c4c14731182435776154b3ec779333cf2f2..a667fe35fa17659e4ef481e765395d4b87cf9ba9 100644 (file)
--- a/lib/netdev.h
+++ b/lib/netdev.h
@@ -30,7 +30,7 @@ extern "C" {
   *
   * Every port on a switch must have a corresponding netdev that must minimally
   * support a few operations, such as the ability to read the netdev's MTU.
- * The PORTING file at the top of the source tree has more information in the
+ * The Porting section of the documentation has more information in the
   * "Writing a netdev Provider" section.
   *
   * Thread-safety
diff --git a/lib/ofp-util.c b/lib/ofp-util.c

index 899cfe3886788fe74d7e0dafa8e6aa8dc9b5e4d7..b9efd32ee815c1e2f5ce4074c64995151bcbf71d 100644 (file)
--- a/lib/ofp-util.c
+++ b/lib/ofp-util.c
@@ -5679,7 +5679,7 @@ ofputil_encode_table_config(enum ofputil_table_miss miss,
                              enum ofp_version version)
  {
      uint32_t config = 0;
-    /* See the section "OFPTC_* Table Configuration" in DESIGN.rst for more
+    /* Search for "OFPTC_* Table Configuration" in the documentation for more
       * information on the crazy evolution of this field. */
      switch (version) {
      case OFP10_VERSION:
diff --git a/ofproto/connmgr.c b/ofproto/connmgr.c

index 4b927d64b125dd1b85af1fc079b5f269efa5321f..1f135a4230f24c2774481f01d6cb79484390dbfb 100644 (file)
--- a/ofproto/connmgr.c
+++ b/ofproto/connmgr.c
@@ -1520,7 +1520,7 @@ ofconn_receives_async_msg(const struct ofconn *ofconn,
      ovs_assert((unsigned int) type < OAM_N_TYPES);
  
      /* Keep the following code in sync with the documentation in the
-     * "Asynchronous Messages" section in DESIGN. */
+     * "Asynchronous Messages" section in 'topics/design' */
  
      if (ofconn->type == OFCONN_SERVICE && !ofconn->miss_send_len) {
          /* Service connections don't get asynchronous messages unless they have
diff --git a/ovn/OVN-GW-HA.rst b/ovn/OVN-GW-HA.rst

deleted file mode 100644 (file)

index 5b21b64..0000000
--- a/ovn/OVN-GW-HA.rst
+++ /dev/null
@@ -1,426 +0,0 @@
-..
-      Licensed under the Apache License, Version 2.0 (the "License"); you may
-      not use this file except in compliance with the License. You may obtain
-      a copy of the License at
-
-          http://www.apache.org/licenses/LICENSE-2.0
-
-      Unless required by applicable law or agreed to in writing, software
-      distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
-      WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
-      License for the specific language governing permissions and limitations
-      under the License.
-
-      Convention for heading levels in Open vSwitch documentation:
-
-      =======  Heading 0 (reserved for the title in a document)
-      -------  Heading 1
-      ~~~~~~~  Heading 2
-      +++++++  Heading 3
-      '''''''  Heading 4
-
-      Avoid deeper levels because they do not render well.
-
-==================================
-OVN Gateway High Availability Plan
-==================================
-
-::
-
-    OVN Gateway
-
-         +---------------------------+
-         |                           |
-         |     External Network      |
-         |                           |
-         +-------------^-------------+
-                       |
-                       |
-                 +-----------+
-                 |           |
-                 |  Gateway  |
-                 |           |
-                 +-----------+
-                       ^
-                       |
-                       |
-         +-------------v-------------+
-         |                           |
-         |    OVN Virtual Network    |
-         |                           |
-         +---------------------------+
-
-The OVN gateway is responsible for shuffling traffic between the tunneled
-overlay network (governed by ovn-northd), and the legacy physical network.  In
-a naive implementation, the gateway is a single x86 server, or hardware VTEP.
-For most deployments, a single system has enough forwarding capacity to service
-the entire virtualized network, however, it introduces a single point of
-failure.  If this system dies, the entire OVN deployment becomes unavailable.
-To mitigate this risk, an HA solution is critical -- by spreading
-responsibility across multiple systems, no single server failure can take down
-the network.
-
-An HA solution is both critical to the manageability of the system, and
-extremely difficult to get right.  The purpose of this document, is to propose
-a plan for OVN Gateway High Availability which takes into account our past
-experience building similar systems.  It should be considered a fluid changing
-proposal, not a set-in-stone decree.
-
-Basic Architecture
-------------------
-
-In an OVN deployment, the set of hypervisors and network elements operating
-under the guidance of ovn-northd are in what's called "logical space".  These
-servers use VXLAN, STT, or Geneve to communicate, oblivious to the details of
-the underlying physical network.  When these systems need to communicate with
-legacy networks, traffic must be routed through a Gateway which translates from
-OVN controlled tunnel traffic, to raw physical network traffic.
-
-Since the gateway is typically the only system with a connection to the
-physical network all traffic between logical space and the WAN must travel
-through it.  This makes it a critical single point of failure -- if the gateway
-dies, communication with the WAN ceases for all systems in logical space.
-
-To mitigate this risk, multiple gateways should be run in a "High Availability
-Cluster" or "HA Cluster".  The HA cluster will be responsible for performing
-the duties of a gateways,  while being able to recover gracefully from
-individual member failures.
-
-::
-
-    OVN Gateway HA Cluster
-
-             +---------------------------+
-             |                           |
-             |     External Network      |
-             |                           |
-             +-------------^-------------+
-                           |
-                           |
-    +----------------------v----------------------+
-    |                                             |
-    |          High Availability Cluster          |
-    |                                             |
-    | +-----------+  +-----------+  +-----------+ |
-    | |           |  |           |  |           | |
-    | |  Gateway  |  |  Gateway  |  |  Gateway  | |
-    | |           |  |           |  |           | |
-    | +-----------+  +-----------+  +-----------+ |
-    +----------------------^----------------------+
-                           |
-                           |
-             +-------------v-------------+
-             |                           |
-             |    OVN Virtual Network    |
-             |                           |
-             +---------------------------+
-
-L2 vs L3 High Availability
-~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-In order to achieve this goal, there are two broad approaches one can take.
-The HA cluster can appear to the network like a giant Layer 2 Ethernet Switch,
-or like a giant IP Router. These approaches are called L2HA, and L3HA
-respectively.  L2HA allows ethernet broadcast domains to extend into logical
-space, a significant advantage, but this comes at a cost.  The need to avoid
-transient L2 loops during failover significantly complicates their design.  On
-the other hand, L3HA works for most use cases, is simpler, and fails more
-gracefully.  For these reasons, it is suggested that OVN supports an L3HA
-model, leaving L2HA for future work (or third party VTEP providers).  Both
-models are discussed further below.
-
-L3HA
-----
-
-In this section, we'll work through a basic simple L3HA implementation, on top
-of which we'll gradually build more sophisticated features explaining their
-motivations and implementations as we go.
-
-Naive active-backup
-~~~~~~~~~~~~~~~~~~~
-
-Let's assume that there are a collection of logical routers which a tenant has
-asked for, our task is to schedule these logical routers on one of N gateways,
-and gracefully redistribute the routers on gateways which have failed.  The
-absolute simplest way to achieve this is what we'll call "naive-active-backup".
-
-::
-
-    Naive Active Backup HA Implementation
-
-    +----------------+   +----------------+
-    | Leader         |   | Backup         |
-    |                |   |                |
-    |      A B C     |   |                |
-    |                |   |                |
-    +----+-+-+-+----++   +-+--------------+
-         ^ ^ ^ ^    |      |
-         | | | |    |      |
-         | | | |  +-+------+---+
-         + + + +  | ovn-northd |
-         Traffic  +------------+
-
-In a naive active-backup, one of the Gateways is chosen (arbitrarily) as a
-leader.  All logical routers (A, B, C in the figure), are scheduled on this
-leader gateway and all traffic flows through it.  ovn-northd monitors this
-gateway via OpenFlow echo requests (or some equivalent), and if the gateway
-dies, it recreates the routers on one of the backups.
-
-This approach basically works in most cases and should likely be the starting
-point for OVN -- it's strictly better than no HA solution and is a good
-foundation for more sophisticated solutions.  That said, it's not without it's
-limitations. Specifically, this approach doesn't coordinate with the physical
-network to minimize disruption during failures, and it tightly couples failover
-to ovn-northd (we'll discuss why this is bad in a bit), and wastes resources by
-leaving backup gateways completely unutilized.
-
-Router Failover
-+++++++++++++++
-
-When ovn-northd notices the leader has died and decides to migrate routers to a
-backup gateway, the physical network has to be notified to direct traffic to
-the new gateway.  Otherwise, traffic could be blackholed for longer than
-necessary making failovers worse than they need to be.
-
-For now, let's assume that OVN requires all gateways to be on the same IP
-subnet on the physical network.  If this isn't the case, gateways would need to
-participate in routing protocols to orchestrate failovers, something which is
-difficult and out of scope of this document.
-
-Since all gateways are on the same IP subnet, we simply need to worry about
-updating the MAC learning tables of the Ethernet switches on that subnet.
-Presumably, they all have entries for each logical router pointing to the old
-leader.  If these entries aren't updated, all traffic will be sent to the (now
-defunct) old leader, instead of the new one.
-
-In order to mitigate this issue, it's recommended that the new gateway sends a
-Reverse ARP (RARP) onto the physical network for each logical router it now
-controls.  A Reverse ARP is a benign protocol used by many hypervisors when
-virtual machines migrate to update L2 forwarding tables.  In this case, the
-ethernet source address of the RARP is that of the logical router it
-corresponds to, and its destination is the broadcast address.  This causes the
-RARP to travel to every L2 switch in the broadcast domain, updating forwarding
-tables accordingly.  This strategy is recommended in all failover mechanisms
-discussed in this document -- when a router newly boots on a new leader, it
-should RARP its MAC address.
-
-Controller Independent Active-backup
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-::
-
-    Controller Independent Active-Backup Implementation
-
-    +----------------+   +----------------+
-    | Leader         |   | Backup         |
-    |                |   |                |
-    |      A B C     |   |                |
-    |                |   |                |
-    +----------------+   +----------------+
-         ^ ^ ^ ^
-         | | | |
-         | | | |
-         + + + +
-         Traffic
-
-The fundamental problem with naive active-backup, is it tightly couples the
-failover solution to ovn-northd.  This can significantly increase downtime in
-the event of a failover as the (often already busy) ovn-northd controller has
-to recompute state for the new leader. Worse, if ovn-northd goes down, we can't
-perform gateway failover at all.  This violates the principle that control
-plane outages should have no impact on dataplane functionality.
-
-In a controller independent active-backup configuration, ovn-northd is
-responsible for initial configuration while the HA cluster is responsible for
-monitoring the leader, and failing over to a backup if necessary.  ovn-northd
-sets HA policy, but doesn't actively participate when failovers occur.
-
-Of course, in this model, ovn-northd is not without some responsibility.  Its
-role is to pre-plan what should happen in the event of a failure, leaving it to
-the individual switches to execute this plan.  It does this by assigning each
-gateway a unique leadership priority.  Once assigned, it communicates this
-priority to each node it controls.  Nodes use the leadership priority to
-determine which gateway in the cluster is the active leader by using a simple
-metric: the leader is the gateway that is healthy, with the highest priority.
-If that gateway goes down, leadership falls to the next highest priority, and
-conversely, if a new gateway comes up with a higher priority, it takes over
-leadership.
-
-Thus, in this model, leadership of the HA cluster is determined simply by the
-status of its members.  Therefore if we can communicate the status of each
-gateway to each transport node, they can individually figure out which is the
-leader, and direct traffic accordingly.
-
-Tunnel Monitoring
-+++++++++++++++++
-
-Since in this model leadership is determined exclusively by the health status
-of member gateways, a key problem is how do we communicate this information to
-the relevant transport nodes.  Luckily, we can do this fairly cheaply using
-tunnel monitoring protocols like BFD.
-
-The basic idea is pretty straightforward.  Each transport node maintains a
-tunnel to every gateway in the HA cluster (not just the leader).  These tunnels
-are monitored using the BFD protocol to see which are alive.  Given this
-information, hypervisors can trivially compute the highest priority live
-gateway, and thus the leader.
-
-In practice, this leadership computation can be performed trivially using the
-bundle or group action.  Rather than using OpenFlow to simply output to the
-leader, all gateways could be listed in an active-backup bundle action ordered
-by their priority.  The bundle action will automatically take into account the
-tunnel monitoring status to output the packet to the highest priority live
-gateway.
-
-Inter-Gateway Monitoring
-++++++++++++++++++++++++
-
-One somewhat subtle aspect of this model, is that failovers are not globally
-atomic.  When a failover occurs, it will take some time for all hypervisors to
-notice and adjust accordingly.  Similarly, if a new high priority Gateway comes
-up, it may take some time for all hypervisors to switch over to the new leader.
-In order to avoid confusing the physical network, under these circumstances
-it's important for the backup gateways to drop traffic they've received
-erroneously.  In order to do this, each Gateway must know whether or not it is,
-in fact active.  This can be achieved by creating a mesh of tunnels between
-gateways.  Each gateway monitors the other gateways its cluster to determine
-which are alive, and therefore whether or not that gateway happens to be the
-leader.  If leading, the gateway forwards traffic normally, otherwise it drops
-all traffic.
-
-Gateway Leadership Resignation
-++++++++++++++++++++++++++++++
-
-Sometimes a gateway may be healthy, but still may not be suitable to lead the
-HA cluster.  This could happen for several reasons including:
-
-* The physical network is unreachable
-
-* BFD (or ping) has detected the next hop router is unreachable
-
-* The Gateway recently booted and isn't fully configured
-
-In this case, the Gateway should resign leadership by holding its tunnels down
-using the ``other_config:cpath_down`` flag.  This indicates to participating
-hypervisors and Gateways that this gateway should be treated as if it's down,
-even though its tunnels are still healthy.
-
-Router Specific Active-Backup
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-::
-
-    Router Specific Active-Backup
-
-    +----------------+ +----------------+
-    |                | |                |
-    |      A C       | |     B D E      |
-    |                | |                |
-    +----------------+ +----------------+
-                  ^ ^   ^ ^
-                  | |   | |
-                  | |   | |
-                  + +   + +
-                   Traffic
-
-Controller independent active-backup is a great advance over naive
-active-backup, but it still has one glaring problem -- it under-utilizes the
-backup gateways.  In ideal scenario, all traffic would split evenly among the
-live set of gateways.  Getting all the way there is somewhat tricky, but as a
-step in the direction, one could use the "Router Specific Active-Backup"
-algorithm.  This algorithm looks a lot like active-backup on a per logical
-router basis, with one twist.  It chooses a different active Gateway for each
-logical router.  Thus, in situations where there are several logical routers,
-all with somewhat balanced load, this algorithm performs better.
-
-Implementation of this strategy is quite straightforward if built on top of
-basic controller independent active-backup.  On a per logical router basis, the
-algorithm is the same, leadership is determined by the liveness of the
-gateways.  The key difference here is that the gateways must have a different
-leadership priority for each logical router.  These leadership priorities can
-be computed by ovn-northd just as they had been in the controller independent
-active-backup model.
-
-Once we have these per logical router priorities, they simply need be
-communicated to the members of the gateway cluster and the hypervisors.  The
-hypervisors in particular, need simply have an active-backup bundle action (or
-group action) per logical router listing the gateways in priority order for
-*that router*, rather than having a single bundle action shared for all the
-routers.
-
-Additionally, the gateways need to be updated to take into account individual
-router priorities.  Specifically, each gateway should drop traffic of backup
-routers it's running, and forward traffic of active gateways, instead of simply
-dropping or forwarding everything.  This should likely be done by having
-ovn-controller recompute OpenFlow for the gateway, though other options exist.
-
-The final complication is that ovn-northd's logic must be updated to choose
-these per logical router leadership priorities in a more sophisticated manner.
-It doesn't matter much exactly what algorithm it chooses to do this, beyond
-that it should provide good balancing in the common case.  I.E. each logical
-routers priorities should be different enough that routers balance to different
-gateways even when failures occur.
-
-Preemption
-++++++++++
-
-In an active-backup setup, one issue that users will run into is that of
-gateway leader preemption.  If a new Gateway is added to a cluster, or for some
-reason an existing gateway is rebooted, we could end up in a situation where
-the newly activated gateway has higher priority than any other in the HA
-cluster.  In this case, as soon as that gateway appears, it will preempt
-leadership from the currently active leader causing an unnecessary failover.
-Since failover can be quite expensive, this preemption may be undesirable.
-
-The controller can optionally avoid preemption by cleverly tweaking the
-leadership priorities.  For each router, new gateways should be assigned
-priorities that put them second in line or later when they eventually come up.
-Furthermore, if a gateway goes down for a significant period of time, its old
-leadership priorities should be revoked and new ones should be assigned as if
-it's a brand new gateway.  Note that this should only happen if a gateway has
-been down for a while (several minutes), otherwise a flapping gateway could
-have wide ranging, unpredictable, consequences.
-
-Note that preemption avoidance should be optional depending on the deployment.
-One necessarily sacrifices optimal load balancing to satisfy these requirements
-as new gateways will get no traffic on boot.  Thus, this feature represents a
-trade-off which must be made on a per installation basis.
-
-Fully Active-Active HA
-~~~~~~~~~~~~~~~~~~~~~~
-
-::
-
-    Fully Active-Active HA
-
-    +----------------+ +----------------+
-    |                | |                |
-    |   A B C D E    | |    A B C D E   |
-    |                | |                |
-    +----------------+ +----------------+
-                  ^ ^   ^ ^
-                  | |   | |
-                  | |   | |
-                  + +   + +
-                   Traffic
-
-The final step in L3HA is to have true active-active HA.  In this scenario each
-router has an instance on each Gateway, and a mechanism similar to ECMP is used
-to distribute traffic evenly among all instances.  This mechanism would require
-Gateways to participate in routing protocols with the physical network to
-attract traffic and alert of failures.  It is out of scope of this document,
-but may eventually be necessary.
-
-L2HA
-----
-
-L2HA is very difficult to get right.  Unlike L3HA, where the consequences of
-problems are minor, in L2HA if two gateways are both transiently active, an L2
-loop triggers and a broadcast storm results.  In practice to get around this,
-gateways end up implementing an overly conservative "when in doubt drop all
-traffic" policy, or they implement something like MLAG.
-
-MLAG has multiple gateways work together to pretend to be a single L2 switch
-with a large LACP bond.  In principle, it's the right solution to the problem
-as it solves the broadcast storm problem, and has been deployed successfully in
-other contexts.  That said, it's difficult to get right and not recommended.
diff --git a/ovn/automake.mk b/ovn/automake.mk

index 7465f8ed2530ff019ee1daaeb5d01892c8a92556..1257ef49d86819eb544b010fe6bcdd6013cb4401 100644 (file)
--- a/ovn/automake.mk
+++ b/ovn/automake.mk
@@ -71,8 +71,7 @@ EXTRA_DIST += ovn/ovn-architecture.7.xml
  DISTCLEANFILES += ovn/ovn-architecture.7
  
  EXTRA_DIST += \
-       ovn/TODO.rst \
-       ovn/OVN-GW-HA.rst
+       ovn/TODO.rst
  
  # Version checking for ovn-nb.ovsschema.
  ALL_LOCAL += ovn/ovn-nb.ovsschema.stamp
diff --git a/ovn/controller/pinctrl.c b/ovn/controller/pinctrl.c

index db9e4416154ef3477b28fd9734f845cd52ea7b2c..673d65cb378943dd45cb77ff109e7f92b76fb11e 100644 (file)
--- a/ovn/controller/pinctrl.c
+++ b/ovn/controller/pinctrl.c
@@ -731,8 +731,7 @@ pinctrl_recv(const struct ofp_header *oh, enum ofptype type)
      if (type == OFPTYPE_ECHO_REQUEST) {
          queue_msg(make_echo_reply(oh));
      } else if (type == OFPTYPE_GET_CONFIG_REPLY) {
-        /* Enable asynchronous messages (see "Asynchronous Messages" in
-         * DESIGN.rst for more information). */
+        /* Enable asynchronous messages */
          struct ofputil_switch_config config;
  
          ofputil_decode_get_config_reply(oh, &config);
diff --git a/ovn/ovn-architecture.7.xml b/ovn/ovn-architecture.7.xml

index 95cba984d3a85a43f0210aa628563bc402b1292c..d96e4b14142daae7e95632d317a66ee531432d55 100644 (file)
--- a/ovn/ovn-architecture.7.xml
+++ b/ovn/ovn-architecture.7.xml
@@ -341,8 +341,8 @@
        controller (over a Unix domain socket) instead of a remote controller.
        It's possible, however, for some other bridge in the same system to have
        an in-band remote controller, and in that case this suppresses the flows
-      that in-band control would ordinarily set up.  See <code>In-Band
-      Control</code> in <code>DESIGN.rst</code> for more information.
+      that in-band control would ordinarily set up.  Refer to the documentation
+      for more information.
      </dd>
    </dl>
  
diff --git a/rhel/openvswitch-fedora.spec.in b/rhel/openvswitch-fedora.spec.in

index d70934a48760a4b0d0edb2a4b9d1d6508526a185..d9befe0669682fbe4b26a77022131471ec348795 100644 (file)
--- a/rhel/openvswitch-fedora.spec.in
+++ b/rhel/openvswitch-fedora.spec.in
@@ -481,7 +481,7 @@ fi
  %{_mandir}/man8/ovs-vswitchd.8*
  %{_mandir}/man8/ovs-parse-backtrace.8*
  %{_mandir}/man8/ovs-testcontroller.8*
-%doc COPYING DESIGN.rst NOTICE README.rst WHY-OVS.rst
+%doc COPYING NOTICE README.rst WHY-OVS.rst
  %doc FAQ.rst NEWS rhel/README.RHEL.rst
  /var/lib/openvswitch
  /var/log/openvswitch
diff --git a/rhel/openvswitch.spec.in b/rhel/openvswitch.spec.in

index 95375b3b7be8f874562aa8eefa5344c1362b4b52..2f99dcc0aba7b64f642b81c897c196c7a2828124 100644 (file)
--- a/rhel/openvswitch.spec.in
+++ b/rhel/openvswitch.spec.in
@@ -248,7 +248,7 @@ exit 0
  /usr/share/openvswitch/scripts/sysconfig.template
  /usr/share/openvswitch/vswitch.ovsschema
  /usr/share/openvswitch/vtep.ovsschema
-%doc COPYING DESIGN.rst NOTICE README.rst WHY-OVS.rst FAQ.rst NEWS
+%doc COPYING NOTICE README.rst WHY-OVS.rst FAQ.rst NEWS
  %doc rhel/README.RHEL.rst
  /var/lib/openvswitch
  /var/log/openvswitch
diff --git a/tests/ovs-ofctl.at b/tests/ovs-ofctl.at

index f2ee970093442a8474a7fa2f114ed579673dec35..8bc6239db80420fe9615edb95f2694ca7bd195b9 100644 (file)
--- a/tests/ovs-ofctl.at
+++ b/tests/ovs-ofctl.at
@@ -2476,7 +2476,7 @@ AT_CHECK([echo "$tcp_flags" | ovs-ofctl parse-oxm OpenFlow15], [0],
  AT_CLEANUP
  
  dnl Check all of the patterns mentioned in the "VLAN Matching" section
-dnl in the DESIGN file at top level.
+dnl in the topics/design doc
  AT_SETUP([ovs-ofctl check-vlan])
  AT_KEYWORDS([VLAN])
  
diff --git a/utilities/ovs-ofctl.8.in b/utilities/ovs-ofctl.8.in

index 96135eac82eb5f263b39bd0607ee60b1ccfc7eaa..af1eb2b7baf25125d41ec4c5e39713b021c556bc 100644 (file)
--- a/utilities/ovs-ofctl.8.in
+++ b/utilities/ovs-ofctl.8.in
@@ -787,7 +787,7 @@ When \fBipv6\fR or \fBdl_type=0x86dd\fR is specified, matches IPv6
  header type \fIproto\fR, which is specified as a decimal number between
  0 and 255, inclusive (e.g. 58 to match ICMPv6 packets or 6 to match
  TCP).  The header type is the terminal header as described in the
-\fBDESIGN\fR document.
+\fBtopics/design\fR document.
  .IP
  When \fBarp\fR or \fBdl_type=0x0806\fR is specified, matches the lower
  8 bits of the ARP opcode.  ARP opcodes greater than 255 are treated as
diff --git a/vswitchd/INTERNALS.rst b/vswitchd/INTERNALS.rst

deleted file mode 100644 (file)

index 95c00f2..0000000
--- a/vswitchd/INTERNALS.rst
+++ /dev/null
@@ -1,244 +0,0 @@
-..
-      Licensed under the Apache License, Version 2.0 (the "License"); you may
-      not use this file except in compliance with the License. You may obtain
-      a copy of the License at
-
-          http://www.apache.org/licenses/LICENSE-2.0
-
-      Unless required by applicable law or agreed to in writing, software
-      distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
-      WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
-      License for the specific language governing permissions and limitations
-      under the License.
-
-      Convention for heading levels in Open vSwitch documentation:
-
-      =======  Heading 0 (reserved for the title in a document)
-      -------  Heading 1
-      ~~~~~~~  Heading 2
-      +++++++  Heading 3
-      '''''''  Heading 4
-
-      Avoid deeper levels because they do not render well.
-
-======================
-ovs-vswitchd Internals
-======================
-
-This document describes some of the internals of the ovs-vswitchd process.  It
-is not complete.  It tends to be updated on demand, so if you have questions
-about the vswitchd implementation, ask them and perhaps we'll add some
-appropriate documentation here.
-
-Most of the ovs-vswitchd implementation is in ``vswitchd/bridge.c``, so code
-references below should be assumed to refer to that file except as otherwise
-specified.
-
-Bonding
--------
-
-Bonding allows two or more interfaces (the "slaves") to share network traffic.
-From a high-level point of view, bonded interfaces act like a single port, but
-they have the bandwidth of multiple network devices, e.g. two 1 GB physical
-interfaces act like a single 2 GB interface.  Bonds also increase robustness:
-the bonded port does not go down as long as at least one of its slaves is up.
-
-In vswitchd, a bond always has at least two slaves (and may have more).  If a
-configuration error, etc. would cause a bond to have only one slave, the port
-becomes an ordinary port, not a bonded port, and none of the special features
-of bonded ports described in this section apply.
-
-There are many forms of bonding of which ovs-vswitchd implements only a few.
-The most complex bond ovs-vswitchd implements is called "source load balancing"
-or SLB bonding.  SLB bonding divides traffic among the slaves based on the
-Ethernet source address.  This is useful only if the traffic over the bond has
-multiple Ethernet source addresses, for example if network traffic from
-multiple VMs are multiplexed over the bond.
-
-Enabling and Disabling Slaves
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-When a bond is created, a slave is initially enabled or disabled based on
-whether carrier is detected on the NIC (see ``iface_create()``).  After that, a
-slave is disabled if its carrier goes down for a period of time longer than the
-downdelay, and it is enabled if carrier comes up for longer than the updelay
-(see ``bond_link_status_update()``).  There is one exception where the updelay
-is skipped: if no slaves at all are currently enabled, then the first slave on
-which carrier comes up is enabled immediately.
-
-The updelay should be set to a time longer than the STP forwarding delay of the
-physical switch to which the bond port is connected (if STP is enabled on that
-switch).  Otherwise, the slave will be enabled, and load may be shifted to it,
-before the physical switch starts forwarding packets on that port, which can
-cause some data to be "blackholed" for a time.  The exception for a single
-enabled slave does not cause any problem in this regard because when no slaves
-are enabled all output packets are blackholed anyway.
-
-When a slave becomes disabled, the vswitch immediately chooses a new output
-port for traffic that was destined for that slave (see
-``bond_enable_slave()``).  It also sends a "gratuitous learning packet",
-specifically a RARP, on the bond port (on the newly chosen slave) for each MAC
-address that the vswitch has learned on a port other than the bond (see
-``bond_send_learning_packets()``), to teach the physical switch that the new
-slave should be used in place of the one that is now disabled.  (This behavior
-probably makes sense only for a vswitch that has only one port (the bond)
-connected to a physical switch; vswitchd should probably provide a way to
-disable or configure it in other scenarios.)
-
-Bond Packet Input
-~~~~~~~~~~~~~~~~~
-
-Bonding accepts unicast packets on any bond slave.  This can occasionally cause
-packet duplication for the first few packets sent to a given MAC, if the
-physical switch attached to the bond is flooding packets to that MAC because it
-has not yet learned the correct slave for that MAC.
-
-Bonding only accepts multicast (and broadcast) packets on a single bond slave
-(the "active slave") at any given time.  Multicast packets received on other
-slaves are dropped.  Otherwise, every multicast packet would be duplicated,
-once for every bond slave, because the physical switch attached to the bond
-will flood those packets.
-
-Bonding also drops received packets when the vswitch has learned that the
-packet's MAC is on a port other than the bond port itself.  This is because it
-is likely that the vswitch itself sent the packet out the bond port on a
-different slave and is now receiving the packet back.  This occurs when the
-packet is multicast or the physical switch has not yet learned the MAC and is
-flooding it.  However, the vswitch makes an exception to this rule for
-broadcast ARP replies, which indicate that the MAC has moved to another switch,
-probably due to VM migration.  (ARP replies are normally unicast, so this
-exception does not match normal ARP replies.  It will match the learning
-packets sent on bond fail-over.)
-
-The active slave is simply the first slave to be enabled after the bond is
-created (see ``bond_choose_active_iface()``).  If the active slave is disabled,
-then a new active slave is chosen among the slaves that remain active.
-Currently due to the way that configuration works, this tends to be the
-remaining slave whose interface name is first alphabetically, but this is by no
-means guaranteed.
-
-Bond Packet Output
-~~~~~~~~~~~~~~~~~~
-
-When a packet is sent out a bond port, the bond slave actually used is selected
-based on the packet's source MAC and VLAN tag (see ``choose_output_iface()``).
-In particular, the source MAC and VLAN tag are hashed into one of 256 values,
-and that value is looked up in a hash table (the "bond hash") kept in the
-``bond_hash`` member of struct port.  The hash table entry identifies a bond
-slave.  If no bond slave has yet been chosen for that hash table entry,
-vswitchd chooses one arbitrarily.
-
-Every 10 seconds, vswitchd rebalances the bond slaves (see
-``bond_rebalance_port()``).  To rebalance, vswitchd examines the statistics for
-the number of bytes transmitted by each slave over approximately the past
-minute, with data sent more recently weighted more heavily than data sent less
-recently.  It considers each of the slaves in order from most-loaded to
-least-loaded.  If highly loaded slave H is significantly more heavily loaded
-than the least-loaded slave L, and slave H carries at least two hashes, then
-vswitchd shifts one of H's hashes to L.  However, vswitchd will only shift a
-hash from H to L if it will decrease the ratio of the load between H and L by
-at least 0.1.
-
-Currently, "significantly more loaded" means that H must carry at least 1 Mbps
-more traffic, and that traffic must be at least 3% greater than L's.
-
-Bond Balance Modes
-~~~~~~~~~~~~~~~~~~
-
-Each bond balancing mode has different considerations, described below.
-
-LACP Bonding
-++++++++++++
-
-LACP bonding requires the remote switch to implement LACP, but it is otherwise
-very simple in that, after LACP negotiation is complete, there is no need for
-special handling of received packets.
-
-Several of the physical switches that support LACP block all traffic for ports
-that are configured to use LACP, until LACP is negotiated with the host. When
-configuring a LACP bond on a OVS host (eg: XenServer), this means that there
-will be an interruption of the network connectivity between the time the ports
-on the physical switch and the bond on the OVS host are configured. The
-interruption may be relatively long, if different people are responsible for
-managing the switches and the OVS host.
-
-Such network connectivity failure can be avoided if LACP can be configured on
-the OVS host before configuring the physical switch, and having the OVS host
-fall back to a bond mode (active-backup) till the physical switch LACP
-configuration is complete. An option "lacp-fallback-ab" exists to provide such
-behavior on openvswitch.
-
-Active Backup Bonding
-+++++++++++++++++++++
-
-Active Backup bonds send all traffic out one "active" slave until that slave
-becomes unavailable.  Since they are significantly less complicated than SLB
-bonds, they are preferred when LACP is not an option.  Additionally, they are
-the only bond mode which supports attaching each slave to a different upstream
-switch.
-
-SLB Bonding
-+++++++++++
-
-SLB bonding allows a limited form of load balancing without the remote switch's
-knowledge or cooperation.  The basics of SLB are simple.  SLB assigns each
-source MAC+VLAN pair to a link and transmits all packets from that MAC+VLAN
-through that link.  Learning in the remote switch causes it to send packets to
-that MAC+VLAN through the same link.
-
-SLB bonding has the following complications:
-
-0. When the remote switch has not learned the MAC for the destination of a
-   unicast packet and hence floods the packet to all of the links on the SLB
-   bond, Open vSwitch will forward duplicate packets, one per link, to each
-   other switch port.
-
-   Open vSwitch does not solve this problem.
-
-1. When the remote switch receives a multicast or broadcast packet from a port
-   not on the SLB bond, it will forward it to all of the links in the SLB bond.
-   This would cause packet duplication if not handled specially.
-
-   Open vSwitch avoids packet duplication by accepting multicast and broadcast
-   packets on only the active slave, and dropping multicast and broadcast
-   packets on all other slaves.
-
-2. When Open vSwitch forwards a multicast or broadcast packet to a link in the
-   SLB bond other than the active slave, the remote switch will forward it to
-   all of the other links in the SLB bond, including the active slave.  Without
-   special handling, this would mean that Open vSwitch would forward a second
-   copy of the packet to each switch port (other than the bond), including the
-   port that originated the packet.
-
-   Open vSwitch deals with this case by dropping packets received on any SLB
-   bonded link that have a source MAC+VLAN that has been learned on any other
-   port.  (This means that SLB as implemented in Open vSwitch relies critically
-   on MAC learning.  Notably, SLB is incompatible with the "flood_vlans"
-   feature.)
-
-3. Suppose that a MAC+VLAN moves to an SLB bond from another port (e.g. when a
-   VM is migrated from this hypervisor to a different one).  Without additional
-   special handling, Open vSwitch will not notice until the MAC learning entry
-   expires, up to 60 seconds later as a consequence of rule #2.
-
-   Open vSwitch avoids a 60-second delay by listening for gratuitous ARPs,
-   which VMs commonly emit upon migration.  As an exception to rule #2, a
-   gratuitous ARP received on an SLB bond is not dropped and updates the MAC
-   learning table in the usual way.  (If a move does not trigger a gratuitous
-   ARP, or if the gratuitous ARP is lost in the network, then a 60-second delay
-   still occurs.)
-
-4. Suppose that a MAC+VLAN moves from an SLB bond to another port (e.g. when a
-   VM is migrated from a different hypervisor to this one), that the MAC+VLAN
-   emits a gratuitous ARP, and that Open vSwitch forwards that gratuitous ARP
-   to a link in the SLB bond other than the active slave.  The remote switch
-   will forward the gratuitous ARP to all of the other links in the SLB bond,
-   including the active slave.  Without additional special handling, this would
-   mean that Open vSwitch would learn that the MAC+VLAN was located on the SLB
-   bond, as a consequence of rule #3.
-
-   Open vSwitch avoids this problem by "locking" the MAC learning table entry
-   for a MAC+VLAN from which a gratuitous ARP was received from a non-SLB bond
-   port.  For 5 seconds, a locked MAC learning table entry will not be updated
-   based on a gratuitous ARP received on a SLB bond.
-
diff --git a/vswitchd/automake.mk b/vswitchd/automake.mk

index 94a0272161648e71d0c6efd693201c6c413fd848..895625be20cede3078bc685d4a9e4082811dc18d 100644 (file)
--- a/vswitchd/automake.mk
+++ b/vswitchd/automake.mk
@@ -16,7 +16,6 @@ vswitchd_ovs_vswitchd_LDADD = \
         lib/libsflow.la \
         lib/libopenvswitch.la
  vswitchd_ovs_vswitchd_LDFLAGS = $(AM_LDFLAGS) $(DPDK_vswitchd_LDFLAGS)
-EXTRA_DIST += vswitchd/INTERNALS.rst
  MAN_ROOTS += vswitchd/ovs-vswitchd.8.in
  
  # vswitch schema and IDL
author	Stephen Finucane <stephen@that.guru>
	Thu, 8 Dec 2016 12:55:26 +0000 (12:55 +0000)
committer	Ben Pfaff <blp@ovn.org>
	Mon, 12 Dec 2016 16:57:06 +0000 (08:57 -0800)
DESIGN.rst	[deleted file]	patch \| blob \| blame \| history
Documentation/OVSDB-replication.rst	[deleted file]	patch \| blob \| blame \| history
Documentation/automake.mk		patch \| blob \| blame \| history
Documentation/howto/openstack-containers.rst		patch \| blob \| blame \| history
Documentation/intro/install/netbsd.rst		patch \| blob \| blame \| history
Documentation/topics/bonding.rst	[new file with mode: 0644]	patch \| blob
Documentation/topics/datapath.rst	[new file with mode: 0644]	patch \| blob
Documentation/topics/design.rst	[new file with mode: 0644]	patch \| blob
Documentation/topics/dpdk.rst	[new file with mode: 0644]	patch \| blob
Documentation/topics/high-availability.rst	[new file with mode: 0644]	patch \| blob
Documentation/topics/index.rst		patch \| blob \| blame \| history
Documentation/topics/integration.rst	[new file with mode: 0644]	patch \| blob
Documentation/topics/openflow.rst	[new file with mode: 0644]	patch \| blob
Documentation/topics/ovsdb-replication.rst	[new file with mode: 0644]	patch \| blob
Documentation/topics/porting.rst	[new file with mode: 0644]	patch \| blob
Documentation/topics/windows.rst	[new file with mode: 0644]	patch \| blob
FAQ.rst		patch \| blob \| blame \| history
IntegrationGuide.rst	[deleted file]	patch \| blob \| blame \| history
Makefile.am		patch \| blob \| blame \| history
OPENFLOW.rst	[deleted file]	patch \| blob \| blame \| history
PORTING.rst	[deleted file]	patch \| blob \| blame \| history
WHY-OVS.rst		patch \| blob \| blame \| history
datapath-windows/DESIGN.rst	[deleted file]	patch \| blob \| blame \| history
datapath-windows/automake.mk		patch \| blob \| blame \| history
datapath/Modules.mk		patch \| blob \| blame \| history
datapath/README.rst	[deleted file]	patch \| blob \| blame \| history
include/openvswitch/ofp-actions.h		patch \| blob \| blame \| history
include/openvswitch/ofp-util.h		patch \| blob \| blame \| history
lib/dpif.h		patch \| blob \| blame \| history
lib/mac-learning.c		patch \| blob \| blame \| history
lib/mac-learning.h		patch \| blob \| blame \| history
lib/netdev.h		patch \| blob \| blame \| history
lib/ofp-util.c		patch \| blob \| blame \| history
ofproto/connmgr.c		patch \| blob \| blame \| history
ovn/OVN-GW-HA.rst	[deleted file]	patch \| blob \| blame \| history
ovn/automake.mk		patch \| blob \| blame \| history
ovn/controller/pinctrl.c		patch \| blob \| blame \| history
ovn/ovn-architecture.7.xml		patch \| blob \| blame \| history
rhel/openvswitch-fedora.spec.in		patch \| blob \| blame \| history
rhel/openvswitch.spec.in		patch \| blob \| blame \| history
tests/ovs-ofctl.at		patch \| blob \| blame \| history
utilities/ovs-ofctl.8.in		patch \| blob \| blame \| history
vswitchd/INTERNALS.rst	[deleted file]	patch \| blob \| blame \| history
vswitchd/automake.mk		patch \| blob \| blame \| history